Windows Clustering on physical hardware is a pain at the best of times. Just getting it to work can sometimes be a little try and effort… with a whole lot of luck. Getting clustering to work in VMware is just cruel.
So when tasked to create a VM of a physical Windows Cluster for a test environment, boy was I excited! {Sarcasm sign}.
Actually creating the VM within ESX wasn’t that difficult. Using Converter I created a VM of the OS. Then using our DELL EqualLogic SAN I made clone copies of the cluster volumes. I presented those volumes with the newly created VM as RDMs. The process seemed to work really well until. The OS booted up. I could see all my presented volumes. Issues began when I tried to start the Clustering Service and take it out of manual mode. Out of the 6 volumes I had only one would ever become Online while all the others would (after some time) fail.
I spent days working through the issue (I’m pretty sure this is why I’m balding). Articles seemed to lead me to DISKPART and trying to change the SAN Online Policy, manually online the disk, changing the READONLY attribute. None of these seemed to work. I’m assuming because there was an attribute that said the disk was Clustered and would prevent me making any changes. Still, I thought I was on the wrong ‘path’ and began looking into a lower level issue at the ESX level.
The crux of my issue turned out to be a iSCSI multipathing problem. DELL EqualLogic SANs run in an Active / Active pathing method where I/O is sent over all paths. DELL has a third party Storage API plugin for ESXi that change the default behaviour of how mutlipathing works. This is normally a good thing but for Windows Clustering in ESX… this is bad.
The solution is fairly simple to resolve. The steps below is a rough outline of how to identify and change the multipathing policy.
Using vSphere vCenter, the changes are made within the Storage Adaptor. In this case it’s the iSCSI Software Adaptor under the Configuration tab.
In the bottom pane select the paths view. Expand the Target column and identify one of the cluster volumes with issues. In this example I have a Dead path due to a recently removed SAN volume which is safe to ignore. The one below is of interest as it’s one of the clustered volumes. Remember the Runtime Name in the left column.
Change to the Devices view and locate the Runtime Name. Right click on this device and select Manage Paths. In this example DELL_PSP_EQL_ROUTED was selected as default. Changing this to Most Recently Used (VMware) sends I/O only ever down one path. The change is immediate. As my volumes are offline I can safely make the changes. On a working production volume I wouldn’t be making path selection changes during business hours.
Back over on the Windows Cluster VM I can now restarted the Clustering Service and have it correctly Online all the volumes.
MSCS is quite in depth and not for the faint hearted or something configured before you end home for the night. Virtualising MSCS requires additional planning and thought in addition to regular planning.
Appendix
VMware -- Setup for Failover Clustering and Microsoft Cluster Service