I inherited a vsphere 4.1 environment attached to two MD3000i's. One of the md3000i's functions flawlessly, and both did before I made changes to the second.
Last month, I decommissioned the 10 LUNs served by the second SAN. I created 5 new LUNs, and served them to my cluster.
The datastore shows up, has paths over each of the SAN vmks, and looks fine.
This month, I created new guests living on the datastore. When there is any sort of IO in the guest on this datastore, I get an email of the following.
([Event alarm expression: Lost Storage Connectivity] OR [Event alarm expression: Lost Storage Path Redundancy] OR [Event alarm expression: Degraded Storage Path Redundancy])
Occasionally I also receive:
Issue detected on esx03 in datacenter: ScsiDeviceIO: 2368:Failed write command to write-quiesced partition naa.60024e800070282800004dd04f7c6bf0:1
(34:06:21:27.996 cpu15:4438724)
The host logs:
Jun 7 07:26:43 esx03 vmkernel: 34:06:21:27.996 cpu15:4438724)Fil3: 1035: Sync WRITE error ('') (ioFlags: 16) : IO was aborted
Jun 7 07:26:43 esx03 vmkernel: 34:06:21:27.997 cpu4:6566)Fil3: 1035: Sync READ error ('.fbb.sf') (ioFlags: 8) : IO was aborted by VMFS via a virt-reset on the device
Jun 7 07:26:43 esx03 iscsid: Kernel reported iSCSI connection 9:0 error (1017) state (3)
Jun 7 07:26:43 esx03 vmkernel: 34:06:21:28.717 cpu7:4103)ScsiDeviceIO: 1688: Command 0x2a to device "naa.60024e800070282800004dd54f7c6cbe" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Jun 7 07:26:43 esx03 vmkernel: 34:06:21:28.717 cpu7:4103)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027efd8340) to NMP device "naa.60024e800070282800004dd54f7c6cbe" failed on physical path "vmhba33:C0:T1:L45" H:0x2 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1.
(plenty more of the same, for multiple LUNs)
After receiving that error, I can see visible path failures in the VIC. Shortly, within say 4 minutes, those errors clear up, and all the pathing registers as happy. The guests running on that datastore don't seem to notice any blips or problems.
VMware support said talk to my storage vendor.
Dell support diagnosed a failed controller [0,1], and replaced it.
The problems continue.
I have walked through the Dell md3000i setup guide for using iscsi and ESXi to verify my configuration.
I have continuous vmkpings between all hosts and the san controllers.
I have the current firmware for the san controllers.
I have the current 4.1 patches installed on all hosts.
I have switched hosts, switched paths, tried both MRU and RR, and the errors continue.
I'm hoping for community sugggestions, or maybe other dell md3000i customers who have run into similiar can tell me what I'm missing.
Thanks.