How to Recovery VMs on a Defective Host?

stuart

3 years ago

How to recover from a host failure when DRS didn’t help because it wasn’t enabled, that is a good question. Normally HA will kick in and move the workload from one host to another upon failure. This morning, our HA agent must have slept in. SMH. When I got into the office virtually this morning, I found that I had a host failure, and 57 VMs went down with the ship thinking they were the captains. This is definitely not a good start to the morning. Luckily, I have faced issues like this before, so its like a normal Thursday to me. After doing some triage, I found that the VMhost wasn’t going to power back on due to a major hardware failure.

So in efforts of getting my environment back to working order, I dusted off an old script that I had that will gather the machine name and VMX file location. The dust was pretty thick, as I haven’t had issues like this since the hardware refresh several years back. My thought to fix this is to remove the detective host and add the VMs back to inventory. So as it stands, I’m half way home. The script below will gather the information needed off of the disconnected host and output it to a CSV file that will be needed later.

Get-vmhost VMhost_Name_Here | Get-VM | Add-Member -MemberType ScriptProperty -Name 'VMXPath' -Value {$this.extensiondata.config.files.vmpathname} -Passthru -Force | Select-Object Name,VMXPath | Export-csv C:\Scripts\logs\VMHost_Name.csv -NoTypeInformation -Append

Even though the host is disconnected, you can still target it within a PowerCLI script to gather the information. Once the script was finished running, you will have a CSV file that looks like the following.

Name,VMXPath
VDI_System_01,[VMware_Datastore_01] VDI_System_01/VDI_System_01.vmx
VDI_System_02,[VMware_Datastore_01] VDI_System_02/VDI_System_02.vmx
VDI_System_03,[VMware_Datastore_01] VDI_System_03/VDI_System_03.vmx
VDI_System_04,[VMware_Datastore_03] VDI_System_04/VDI_System_04.vmx
VDI_System_05,[VMware_Datastore_02] VDI_System_05/VDI_System_05.vmx
VDI_System_06,[VMware_Datastore_03] VDI_System_06/VDI_System_06.vmx
VDI_System_07,[VMware_Datastore_01] VDI_System_07/VDI_System_07.vmx
VDI_System_08,[VMware_Datastore_01] VDI_System_08/VDI_System_08.vmx
VDI_System_09,[VMware_Datastore_03] VDI_System_09/VDI_System_09.vmx
VDI_System_10,[VMware_Datastore_03] VDI_System_10/VDI_System_10.vmx

At this point, I disconnect the host from vCenter, and then remove it from inventory. Doing this will cause all of the VMs that were disconnected in vSphere to be removed from inventory. So using the list that was gather of the VMX locations, I can dust off another script. This one takes the output of the earlier script and adds the VMs back to inventory.

$VMs = Import-CSV C:\scripts\logs\VMsToAddToInventory.csv
Foreach ($VM in $VMs){
    New-VM -VMFilePath $($VM.VMXPath) -VMHost (Get-random (Get-cluster Desktops | Get-VMhost ))
}

The output of the script looks like the following

VDI_System_01      PoweredOff 4        16.000         
VDI_System_02      PoweredOff 2        8.000          
VDI_System_03      PoweredOff 2        6.000          
VDI_System_04      PoweredOff 4        8.000          
VDI_System_05      PoweredOff 2        6.000          
VDI_System_06      PoweredOff 2        8.000          
VDI_System_07      PoweredOff 2        8.000          
VDI_System_08      PoweredOff 4        16.000         
VDI_System_09      PoweredOff 4        16.000         
VDI_System_10      PoweredOff 4        16.000

From inside of vSphere, you can see the VMs registering and then powering on. Once the script finished adding all of the VMs back to inventory, I performed a check spot check and everything was back to normal.

-Stuart