Removing Failed Edge Cluster

Sometimes.. things just don’t work right, I am writing another blog entry that required me to deploy an edge cluster and well.. I fat fingered an IP address so there was a failure. While it would be nice to just make a change to the submitted configuration, we can’t in this instance. Rather we need to remove and then redeploy the edge cluster. There is a KB for this, but I like visuals to go along with my text so let’s walk through it together!

First I’d like to take brief look inside of NSX manager to see what we’ll be cleaning up, really like the network topo that NSX puts together. We can see the two Edge Uplink segments, and a Tier 0 router with 4 interfaces.

The KB we’re going through can be found here: https://kb.vmware.com/s/article/78635

After reading through the KB I copied the link to download the script and ssh’d into my sddc manager as vcf and su’d to root. If you do not have internet access to your SDDC Manager you can download the script and copy it to the SDDC Manager using other means like WinSCP.

Once logged in to the SDDC Manager I used wget to pull the script down adding -O to save it as an intelligible filename:

wget https://kb.vmware.com/sfc/servlet.shepherd/version/download/0685G00000NHZoBQAX -O edge_cluster_cleaner.tar.gz

Once the file downloaded I untar-ungzip’d it with the following:

 tar -xf edge_cluster_cleaner.tar.gz

This creates a “cleanup” directory, and in my case it was in the home (~) directory for root. I then changed to that directory and ran the removal script with the “-h” flag to see the options

Looking at the options it seems fairly straightforward, below is what I’m going to use, including the –dryrun option, for the first run. Nice option to have considering I already fat fingered one thing to get here 🙂

./remove_edge_cluster.sh --cluster WLD-1-EDGES --workload WLD-1 --user administrator@vsphere.local --dryrun

After about 3 minutes I had a comprehensive list of everything the script would do when run the next time, without –dryrun! I also had a new appreciation for everything VCF automates when it comes to NSX. Yes, I had to enter 5o or so things into a wizard but VCF sticks them in all the right places! Let’s break down what we got back:

If you don’t pass it a –password when you run the script it will ask you interactively! Then it’ll use that info to connect to the vCenter that “owns” the workload domain specified.

From there it starts gathers all the information on the Edge Cluster… Then, since we haven’t created and Tier-1 routers it has nothing to clean up on that front. Next, it removes BGP configuration and interfaces on the Tier-0 router and ultimately deletes it entirely. Then it removes the edge cluster… it’s important to note, that at this point we haven’t removed the edge VMs themselves yet, we have only removed configuration settings and the Tier 0 -which was a container running on the edges. We also remove the “Edge Cluster” which is simply a construct inside of NSX.

The last part of the script run is where the Edge Node VM’s get deleted. Then the uplink segments in NSX get removed, along with the transport zone, port groups and resource pools. This puts NSX back in a state where you can deploy and Edge Cluster again, hopefully with the correct IP’s in place 🙂

Let’s run it for real this time and check our output! This is the command, same as above just removing the –dryrun tag:

 ./remove_edge_cluster.sh --cluster WLD-1-EDGES --workload WLD-1 --user administrator@vsphere.local

Heh, better to stop now then after I guess! When you don’t do a –dryrun it needs one further validation:

Now, on with the show!

After less than 3 minutes, everything is cleaned up and removed, and for completion, here’s the “new” topo in NSX Manager!

LOL, I hope my mistake helps someone who needs to remove their Edge cluster for a more valiant purpose! Thanks for reading!