Intermittent Loss of Connectivity to Storage – ESXi

Network:  A small private cloud which containted two ESxi 5.0 hosts and a Dell Powervault SAN connected via stacked Cisco 3750s. A pRDM is attached to one VM which was directly connected through to a QNAP TS-859U-RP+ on the storage network.. Each Esxi host contains 3 VMs.

Scenario:  All 3 VMs on ESXi1 became unavailble. Ping was still accessible but all other services were not working and the VMs would not restart/reset.

 

Upon investigation both ESXi hosts were showing lots of the following messages in the Events tab:

Lost access to volume xxxxx.xxxxxxxx.xxxxxxxx (datastore) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Successfully restored access to volume xxxxx.xxxxxxxx.xxxxxxxx (datastore) following connectivity issues.

It appears that all the Guests on ESXi1 had lost access to their disks ( on the Dell Powervault).  The SAN, switches, cabling and hosts were all checked out but nothing looked out of place. After much head scratching. As both ESXi hosts were suffering the same issue the logical thought was a switch or SAN issue.

Looking in the VMKernel we saw a lot of these messages on both ESXi hosts.

2014-03-05T23:26:09.032Z cpu2:2050)ScsiDeviceIO: 2305: Cmd(0x4124007c4200) 0x12, CmdSN 0x40aeb2 to dev "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 

0x24 0x0.2014-03-05T23:26:09.036Z cpu2:2050)ScsiDeviceIO: 2305: Cmd(0x4124007c4200) 0x12, CmdSN 0x40aeb2 to dev "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x24 0x0.

2014-03-05T23:26:09.077Z cpu6:2054)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x12 (0x4124007c4200) to dev "naa.xxxxxxxxxxxxxxxxxxx" on path "vmhba42:C0:T1:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x24 0x0.Act: NONE

The VMware error code for this is:

[SCSI Code]: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x24 0x0
ASC/ASCQ:    24h / 00h
ASC Details:    INVALID FIELD IN CDB
Status:    02h – CHECK CONDITION

Description:
When the target returns a Check Condition in response to a command it is indicating that it has entered a contingent allegiance condition. This means that an error occurred when it attempted 

to execute a SCSI command. The initiator usually then issues a SCSI Request Sense command in order to obtain a Key Code Qualifier (KCQ) from the target. [Reference]
Hostebyte:    0x00 – DID_OK – No error
Sensekey:    Bh
Sensekey MSG:    ABORTED COMMAND
Sensekey Notes:    Indicates that the device server aborted the command. The application client may be able to recover by trying the command again.
Plugin notes:    VMK_SCSI_PLUGIN_GOOD – No error.

 

The device this related to was the QNAP box. A lunreset (vmkfstools -L lunreset naa.xxxxxx.xxxxx.xxx) and a reboot the of QNAP box failed to resolve the issue. A hexdump (hexdump -C naa.xxxxx.xxxxx.xxxxxx) on Esxi2 showed everything fine was at disk level. On ESXI1 the command just hung.  The command  (esxcfg-mpath -bto show multipaths on ESXi1 also hung.

The VMkernal logs hadnt updated for 3 weeks on ESXi1. Under further investigation it appeared that there was a fault on the QNAP device and the two ESXi hosts were inundated with iSCSI communication errors over the storage network fabric. These incessant and repetitive messages filled the VMkernel buffers up which in-turn caused intermittent connectivity issues accessing the LUNS located on the Dell PowerVault.

ESXi2 managed to survive the intermittent loss of iSCSI connections and all VMs on that host remained available.  However, the timeout limits on Esxi Host 1 were surpassed leaving the VMs in an unusable state.

ESXI1 was barely responding to any commands and attempts to reboot from the GUI and CLI (reboot -f) just hung. In the end a hard reboot was initiated and the VMs automatically vmotioned to ESXi2. They then had to be individually rebooted and fortunately all came backup. The pRDM was removed from the VM and iSCSI path to the QNAP was then detached on both ESXi hosts. The error then disappeared.

QNAP support  could not find any errors on their device but  the recommendation was to upgrade the firmware to the latest version.