Setup: 2 ESXI 5.0 hosts, connecting to a Dell MD3200i SAN via stacked Cisco switches. The setup has 3 LUNS.
Problem: The small, private cloud with a webserver and backend DB server (amongst a few other servers ) became inaccessible from the outside world via HTTP/SSH etc. Upon inspection it appeared that the iSCSI storage was flapping.
vCenter was located on the same storage so checking anything via this was intermittent.. Connecting to the hosts via the client were also problematic as it kept freezing and timing out. In my experience storage trouble also has a knock on affect to the hosts.
SSH was the best option so I logged onto both of the hosts and checked the vmkernel.log. Both were showing iSCSI time outs
2014-09-03T09:51:50.682Z cpu0:2068)FS3Misc: 1440: Long VMFS3 rsv time on 'ManagementLUN' (held for 118611 msecs). # R: 0, # W: 0 bytesXfer: 0 sectors
2014-09-03T09:53:25.837Z cpu2:2069)FS3Misc: 1440: Long VMFS3 rsv time on 'StorageLUN1' (held for 118599 msecs). # R: 0, # W: 0 bytesXfer: 0 sectors
2014-09-03T09:54:13.404Z cpu6:2067)FS3Misc: 1440: Long VMFS3 rsv time on 'VMLUN' (held for 150307 msecs). # R: 1, # W: 0 bytesXfer: 1 sectors
The reservation times were as high as 1.5 seconds to the LUNs. These messages normally appear in the logs when the reservation time exceeds 200ms (maximum VMware reserve time).
Other messages in the vmkernel.log included:
HBX: 231: Reclaimed heartbeat for volume 4f5096b8-67f3f45e-74e7-d4bed9b5a69f (StorageLUN1): [Timeout] [HB state abcdef02 offset 3457024 gen 417 stampUS 14126371426996 uuid 532f5035-cecfff64-1597-d4bed9b5a69f jrnl <FB 9$
Waiting for timed out [HB state abcdef02 offset 3457024 gen 35 stampUS 14126368571853 uuid 532f5035-cecfff64-1597-d4bed9b5a69f jrnl
This pointed towards being a storage issue. The Cisco switches were checked and no errors were appearing on the iSCSI interfaces. I then logged into the Dell MD3200i and again no errors were showing. However, as both hosts were having the same issue the SAN was the main culprit. I called Dell support who agreed there looked nothing wrong on the face of it. They recommended upgrading the FW on the Storage Controllers to the latest version.
The firmware on the MDSM (Modular Disk Storage Manager) was required first before the controller updates. Once complete the firmware on the controllers went from 07.80.41.60 -> 07.84.56.60 the errors on the ESXi hosts disappeared and the storage came to life. A reboot of the VMs affected via a file check and we were back in business.
Dell weren't sure if it was a bug in the firmware or just the controller switch which resolved the issue.