Nagios sent us an alert suggesting a virtual machine had a very high load. The OS itself looked fine and nothing appeared wrong with the server itself. Upon firing up vCenter I noticed that the virtual machine in question needed consolidating. Looking at Veeam backups from last night it seemed the job had failed. When right clicking and consolidating this error appeared:
Consolidate virtual machine disk files <hostname> Unable to access file <unspecified filename> since it is locked
The first thing to try – after finding a relevant KB – was to restart the management agents on ESXi3. Afterwards the vCenter server could NOT reconnect to the ESXi3 node and HA tried to kick-in and migrate the VMs. Although alerts were sent out suggesting a VM failover had occured it appeared that this was not the case and the VMs were all still running on ESXi3. Looking in the vmkernel.log it showed timeouts to a QNAP box which we use as a pRDM. This is the same NAS that the Veeam backups use. Logging into the QNAP via SSH or web interface resulted in a hang, so the device was restarted. After the reboot I could reconnect the host to vcenter and then try the consolidation again.
The Veeam service was restarted but still the consolidation failed again with the same error.
I SSH'd to the host (ESXi3) then logged into the directory on the VMFS volume and ran the following command:
vmkfstools -D myVirtualMachine_1-flat.vmdk
This showed the following output
Lock [type 10c00001 offset 39976960 v 17973, hb offset 3825664
gen 9, mode 2, owner 00000000-00000000-0000-000000000000 mtime 374 nHld 2 nOvf 0]
RO Owner HB Offset 3825664 52136fed-83a478e4-d355-001018f4ef3e
RO Owner HB Offset 3309568 543cebff-bbd46ab3-ac2f-d067e5f051a4
Addr <4, 56, 64>, gen 539, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 322122547200, nb 297316 tbz 0, cow 0, newSinceEpoch 0, zla 4304, bs 1048576
From the output you'll notice there are two owners of this VMDK file. Tracking down the 2 MAC addresses it showed that one of the NICs was from ESXi3 and one from ESXi2. The virtual machine itself resided on ESXi3. I then decided to reboot ESXI2 thinking that this host may have locked the VMDK file when HA kicked in earlier due to the host not reconnecting.
After a reboot I tried the consoildation again but the same error appeared. I then looked at the Veeam server and noticed that there was an extra HDD on the VM. Veeam uses hot-add so it looks like the VMDK was still attached to the Backup server. The Veeam server was located on ESXi2 so that is where the MAC address lock came from. I removed the VMDK from the VM (Making sure i selected Delete from VM and KEEP files). The lock file disappeared from the VM and the consolidation was allowed to begin.