How to recover from a host failure when DRS didn’t help because it wasn’t enabled, that is a good question. Normally HA will kick in and move the workload from one host to another upon failure. This morning, our HA agent must have slept in. SMH. When I got into the office virtually this morning, I found that I had a host failure, and 57 VMs went down with the ship thinking they were the captains. This is definitely not a good start to the morning. Luckily, I have faced issues like this before, so its like a normal Thursday to me. After doing some triage, I found that the VMhost wasn’t going to power back on due to a major hardware failure.
So in efforts of getting my environment back to working order, I dusted off an old script that I had that will gather the machine name and VMX file location. The dust was pretty thick, as I haven’t had issues like this since the hardware refresh several years back. My thought to fix this is to remove the detective host and add the VMs back to inventory. So as it stands, I’m half way home. The script below will gather the information needed off of the disconnected host and output it to a CSV file that will be needed later.
Get-vmhost VMhost_Name_Here | Get-VM | Add-Member -MemberType ScriptProperty -Name 'VMXPath' -Value {$this.extensiondata.config.files.vmpathname} -Passthru -Force | Select-Object Name,VMXPath | Export-csv C:\Scripts\logs\VMHost_Name.csv -NoTypeInformation -Append
Even though the host is disconnected, you can still target it within a PowerCLI script to gather the information. Once the script was finished running, you will have a CSV file that looks like the following.
Name,VMXPath
VDI_System_01,[VMware_Datastore_01] VDI_System_01/VDI_System_01.vmx
VDI_System_02,[VMware_Datastore_01] VDI_System_02/VDI_System_02.vmx
VDI_System_03,[VMware_Datastore_01] VDI_System_03/VDI_System_03.vmx
VDI_System_04,[VMware_Datastore_03] VDI_System_04/VDI_System_04.vmx
VDI_System_05,[VMware_Datastore_02] VDI_System_05/VDI_System_05.vmx
VDI_System_06,[VMware_Datastore_03] VDI_System_06/VDI_System_06.vmx
VDI_System_07,[VMware_Datastore_01] VDI_System_07/VDI_System_07.vmx
VDI_System_08,[VMware_Datastore_01] VDI_System_08/VDI_System_08.vmx
VDI_System_09,[VMware_Datastore_03] VDI_System_09/VDI_System_09.vmx
VDI_System_10,[VMware_Datastore_03] VDI_System_10/VDI_System_10.vmx
At this point, I disconnect the host from vCenter, and then remove it from inventory. Doing this will cause all of the VMs that were disconnected in vSphere to be removed from inventory. So using the list that was gather of the VMX locations, I can dust off another script. This one takes the output of the earlier script and adds the VMs back to inventory.
$VMs = Import-CSV C:\scripts\logs\VMsToAddToInventory.csv
Foreach ($VM in $VMs){
New-VM -VMFilePath $($VM.VMXPath) -VMHost (Get-random (Get-cluster Desktops | Get-VMhost ))
}
The output of the script looks like the following
VDI_System_01 PoweredOff 4 16.000
VDI_System_02 PoweredOff 2 8.000
VDI_System_03 PoweredOff 2 6.000
VDI_System_04 PoweredOff 4 8.000
VDI_System_05 PoweredOff 2 6.000
VDI_System_06 PoweredOff 2 8.000
VDI_System_07 PoweredOff 2 8.000
VDI_System_08 PoweredOff 4 16.000
VDI_System_09 PoweredOff 4 16.000
VDI_System_10 PoweredOff 4 16.000
From inside of vSphere, you can see the VMs registering and then powering on. Once the script finished adding all of the VMs back to inventory, I performed a check spot check and everything was back to normal.
-Stuart