Border Gateway Protocol (BGP) Troubleshooting

VMware Cloud Foundation (VCF) 3.9.1 requires the configuration of BGP in order to deploy a new instance of VCF. If BGP is not configured properly, the deployment of VCF will fail as it will not be able to validate the communication to the Edge Service Gateways has been configured properly. In this post, I’m just going to quickly run through some methods you can use to attempt to troubleshoot a deployment that is failing due to a BGP issue.

For reference, refer to the following diagram:

VMware Cloud Foundation 3.9.1 will automatically deploy two Edge Service Gateways, denoted here as ESG1 and ESG2. These ESGs provide for north-south traffic to the Application Virtual Networks (AVNs) deployed by VCF. To provide routing, the ESGs are configured to communicate via BGP to the corporate network.  

In this example, two Autonomous Systems (AS) are defined. AS 65001 is configured on the corporate network. AS 65003 is configured on the ESGs that will get deployed by VCF.

Two networks provide the uplinks to the corporate network. These are defined as 172.27.11.x and 172.27.12.x.  Four IP addresses total in these networks are assigned to the ESGs. These will communicate to the router on the corporate network with the .1 address on that network. Each of the IP addresses on the ESGs are considered to be neighbors to the .1 IP addresses on the corporate network.

Let’s assume you have setup a VyOS router as I mentioned here. If you didn’t, that’s fine. Just realize that the commands may change slightly.

First, login to your router and check the status of BGP. You can do this using the ‘show ip bgp summary’ command with VyOS like so:

Here we can see that we have configured four BGP neighbors. You will also see the AS number defined for each neighbor as well as the router. Double check these and make sure that they match up with what is supposed to be configured in your environment.

Also note that the Up/Down column shows that BGP has never been able to communicate with the neighbors.

Now that you know that BGP has not been able to communicate, we need to find out if this is due to a BGP issue or a network issue. From the router, attempt to ping each of the neighbors.

If you are unable to ping, then you need to resolve that issue first. Some suggestions of things to verify:

  • Are there any firewalls in place that would prevent the ping?
  • If you are running in a nested environment, have you verified the correct NICs are connected to the appropriate portgroups?
  • Are there any VLAN IDs set and are they correct?
  • Are you getting DUPs? Is there another system using that IP in the environment?

If you are able to ping all the neighbors, then we need to look at the BGP configuration. You can use a command similar to the ‘sho ip bgp summary’ on the router to verify the AS numbers configured.

You can also access the ESGs by connecting to the management IP via SSH. The username should be admin and the password would be what you provided to Cloud Builder to do the VCF deployment.

Once you are connected to an ESG, you need to find the VRF associated to the T0 router. You can do this using the ‘get logical router’ command. Identify the T0 router’s VRF and use the vrf command to switch to that context.

From here, you can try to ping the corporate router. This should work, as you’ve already tested it from the corporate router. From the ESG, you can now use the ‘get bgp neighbor summary’ command to check it’s configuration.

Make sure the AS numbers are correctly defined for each neighbor as well as the local AS. Also make sure the IP addresses used are correct.

If your able to ping and you’ve verified all the AS numbers and neighbor information, then the next thing to check is if the BGP password is correctly set. Refer to the information that was provided to Cloud Builder to begin Bringup. Then double check to ensure the password defined is the one used on the corporate router.

At this point, you should be able to check and see that BGP is connected. From a ESG, this would look like this:

If you have only performed these steps on one ESG, make sure you do it for the second ESG as well. Now you can go to the corporate router (ie, your VyOS router) and check the bgp summary from there:

Another thing you should check is if the routes are being distributed. On VyOS, you can do this by using the ‘show ip route’ command. This command will show all the routes known by the router along with a code that shows how the router learned about the route. In the example below, the routes prefaced with a ‘B’ are routes that have been advertised through BGP.

Lastly, I’ve seen some people who have setup VyOS or some other router and have forgotten to add their management network as a network that needs to be advertised. In the example above, this would be the network that is listed. If you do not see this line, then you should execute a command like the following to configure advertisement of that network.

# set protocols bgp 65001 address-family ipv4-unicast network

Again, I was using VyOS in this example. If you are using different hardware or a different software-based router, then the commands may be slightly different. They should be very similar though.

If this fixed your BGP issue, then you should be able to go back to your VCF deployment and retry the bringup operation and continue on!