You performed maintanance, updates, whatever and suddenly your vSAN configuration breaks. VM's start dropping like flies, nothing will come up. And you find out the vSAN configuration broken which makes you very sad.
vSAN is very dependant on vCenter to send changes to every other node, if for some reason the vCenter loses connection while sending changes your vSAN will break.
The unicast settings will then have changed and be left in an incomplete state where some hosts will be missing. This can be fixed by manually configuring the Unicast settings to make the nodes find each other again.
Step 1 - Access to ESXi hosts:
You need access to the ESXi hosts to make the changes via SSH. By default SSH is disabled so you need to enable it.
- Login to the ESXi Host using your favorite browser
- On the top left menu select Manage
- Click on the tab Services
- Find the service TSM-SSH, select it and click on start
- Repeat step 1 to 4 for all the ESXi hosts you have including the vSAN witness server!
Step 2 - Collecting information:
In this step it's important to find out which node is missing what information and the current status. Without it, you wouldn't know what configuration changes you need to make.
- Connect to 1 of your ESXi hosts by SSH and use the following command to get the current status of your vSAN network:
If anything related to vSAN show the status in Red then your vSAN has failedesxcli vsan health cluster list
- Now connect to all your ESXi hosts by SSH and run the following command on all of them including the witness server:
There are three values you need to lookout foresxcli vsan cluster get
- Local Node UUID - Take a note of this, depending on what node is missing you will need this UUID
- Sub-Cluster Member count - This should the same number of nodes you have in your vSAN network including the witness server
- Sub-Cluster Member HostNames - This should contain the hostnames of all the vSAN nodes in your network including the witness server
- Now you need to check if the nodes are missing other nodes in the unicast, use this command on all your ESXi hosts:
You should see a list with Node UUID's with their IP Addresses. The list should contain all your vSAN nodes minus the one you are currently SSH'd into, so if you have 5 nodes, you should see 4 nodes.esxcli vsan cluster unicastagent list
With the information collected, you need to make a comparison. For example, if 1 of the 4 nodes contain all nodes and nodes #2, #3 and #4 are missing node #1, then those nodes are at fault and needs to be fixed.
In this case you know that Node #1 is missing so take a note of that node IP address and Local Node UUID.
Step 3 - Fixing the problem:
Now you know what node is missing from the rest of the nodes, you can start fixing the issue.
- Connect to all ESXi hosts missing the vSAN node and use the following command to add it:
This will add the missing node back in the Unicast list, thus restoring vSAN.esxcli vsan cluster unicastagent add -t node -u <Local Node UUID> -U true -a <IP Address of Missing Node> -p 12321
Step 4 - Checking vSAN:
You can use the following steps to check if your vSAN is functioning again
- Run the following command:
esxcli vsan cluster get
Check if the line Sub-Cluster Member count and Sub-Cluster Member HostNames now contain the right amount of nodes and their hostsnames. - Run the following command:
Everything else should be 0 except for the line Healthy, this represents that your VM's are healthy and up and running. It's not the end of the world when other lines have a few objects in them. This might not be related to vSAN at all but something that's already present in your environment.esxcli vsan debug object health summary get
- Run the following command:
Everything vSAN related should now report back Green. Anything else reporting red or yellow might not be related to vSAN but other problems you already might have.esxcli vsan health cluster list
- Check vCenter, log in and see if you can manage your VM's.