You performed maintanance, updates, whatever and suddenly your vSAN configuration breaks. VM's start dropping like flies, nothing will come up.
Most of the time this is caused by an incomplete vSAN node configuration.
vSAN is very dependant on vCenter to send changes to every other node, if for some reason the vCenter loses connection while sending changes your vSAN will break.
The unicast settings will then have changed and be left in an incomplete state where some hosts will be missing. This can be fixed by manually configuring the Unicast settings to make the nodes find eachother again.
Step 1 - Access to ESXi hosts:
You need access to the ESXi hosts to make the changes via SSH. By default SSH is disabled so you need to enable it.
- Login to the ESXi Host using your favorite browser
- On the top left menu select Manage
- Click on the tab Services
- Find the service TSM-SSH, select it and click on start
- Repeat step 1 to 4 for all the ESXi hosts you have including the vSAN witness server!
Step 2 - Collecting information:
In this step it's important to find out which node is missing what information and the current status. Without it, you wouldn't know what configuration changes you need to make.
- Connect to 1 of your ESXi hosts by SSH and use the following command to get the current status of your vSAN network:
esxcli vsan health cluster list
If anything related to vSAN show the status in Red then your vSAN has failed - Now connect to all your ESXi hosts by SSH and run the following command on all of them including the witness server:
esxcli vsan cluster get
There are three values you need to lookout for- Local Node UUID - Take a note of this, depending on what node is missing you will need this UUID
- Sub-Cluster Member count - This should the same number of nodes you have in your vSAN network including the witness server
- Sub-Cluster Member HostNames - This should contain the hostnames of all the vSAN nodes in your network including the witness server
- Now you need to check if the nodes are missing other nodes in the unicast, use this command on all your ESXi hosts:
esxcli vsan cluster unicastagent list
You should see a list with Node UUID's with their IP Addresses. The list should contain all your vSAN nodes minus the one you are currently SSH'd into, so if you have 5 nodes, you should see 4 nodes.
With the information collected, you need to make a comparison. For example, if 1 of the 4 nodes contain all nodes and nodes #2, #3 and #4 are missing node #1, then those nodes are at fault and needs to be fixed.
In this case you know that Node #1 is missing so take a note of that node IP address and Local Node UUID.
Step 3 - Fixing the problem:
Now you know what node is missing from the rest of the nodes, you can start fixing the issue.
- Connect to all ESXi hosts missing the vSAN node and use the following command to add it:
esxcli vsan cluster unicastagent add -t node -u <Local Node UUID> -U true -a <IP Address of Missing Node> -p 12321
This will add the missing node back in the Unicast list, thus restoring vSAN.
Step 4 - Checking vSAN:
You can use the following steps to check if your vSAN is functioning again
- Run the following command:
esxcli vsan cluster get
Check if the line Sub-Cluster Member count and Sub-Cluster Member HostNames now contain the right amount of nodes and their hostsnames. - Run the following command:
esxcli vsan debug object health summary get
Everything else should be 0 except for the line Healthy, this represents that your VM's are healthy and up and running. It's not the end of the world when other lines have a few objects in them. This might not be related to vSAN at all but something that's already present in your environment. - Run the following command: esxcli vsan health cluster list
Everything vSAN related should now report back Green. Anything else reporting red or yellow might not be related to vSAN but other problems you already might have. - Check vCenter, log in and see if you can manage your VM's.