Thanks to a spate of upgrades to vSphere 5.1, I recently (re)discovered the following inconvenient result when applying an update to a DRS cluster from Update Manager (22.214.171.12471, using vCenter Server Appliance 5.1.0 build 947673):
Remediate entity ‘vm11.solori.labs’ Host has VMs ‘View-PSG’ , vUM5 with connected removable media devices. This prevents putting the host into maintenance mode. Disconnect the removable devices and try again.
Immediately I thought: “Great! I left a host-only ISO connected to these VMs.” However, that assumption was as flawed as Update Manager’s assumption that the workloads cannot be vMotion’d without disconnecting the removable media. In fact, the removable media indicated was connected to a shared ISO repository available to all hosts in the cluster. However, I was to blame and not Update Manager, as I had not remembered that Update Manager’s default response to removable media is to abort the process. Since cluster remediation is a powerful feature made possible by Distributed Resource Scheduler (DRS) in Enterprise (and above) vSphere editions that may be new to the feature to many (especially uplifted “Advanced AK” users), it seemed like something worth reviewing and blogging about.
Why is this a big deal?
More the the point, why does this seem to run contrary to “a common sense” response?
First, the manual for remediation of a host in a DRS cluster would include:
- Applying “Maintenance Mode” to the host,
- Selecting the appropriate action for “powered-off and suspended” workloads, and
- Allowing DRS to choose placement and finally vMotion those workloads to an alternate host.
In the case of VMs with removable media attached, this set of actions will result in the workloads being vMotion’d (without warning or hesitation) so long as the other hosts in the cluster have access to the removable media source (i.e. shared storage, not “Host Device.”) However, in the case of Update Manger remediation, the following are documented road blocks to a successful remediation (without administrative override):
- A CD/DVD drive is attached (any method),
- A floppy drive is attached (any method),
- HA admission control prevents migration of the virtual machine,
- DPM is enabled on the cluster,
- EVC is disabled on the cluster,
- DRS is disabled on the cluster (preventing migration),
- Fault Tolerance (FT) is enabled for a VM on the host in the cluter.
Therefore it is “by design” that a scheduled remediation would have failed – even if the removable media would be eligible for vMotion. To assist in the evaluation of “obstacles to successful deferred remediation” a cluster remediation report is available (see below).
In fact, the report will list all possible road blocks to remediation whether or not matching overrides are selected (potentially misleading, certainly not useful for predicting the outcome of the remediation attempt). While this too is counter intuitive, it serves as a reminder of the show-stoppers to successful remediation. For the offending “removable media” override, the appropriate check-box can be found on the options page just prior to the remediation report:
The inclusion of this override allows Update Manager to slog through the remediation without respect to the attached status of removable media. Likewise, the other remediation overrides will enable successful completion of the remediation process; these overrides are:
- Maintenance Mode Settings:
- VM Power State prior to remediation: Do not change, Power off, Suspend
- Temporarily disable any removable media devices;
- Retry maintenance mode in case of failure (delay and attempts);
- Cluster Settings:
- Temporarily Disable Distributed Power Management (forces “sleeping” hosts to power-on prior to next steps in remediation);
- Temporarily Disable High Availability Admission Control (allows for host remediation to violate host-resource reservation margins);
- Temporarily Disable Fault Tolerance (FT) (admonished to remediate all cluster hosts in the same update cycle to maintain FT compatibility);
- Enable parallel remediation for hosts in cluster (will not violate DRS anti-affinity constraints);
- Automatically determine the maximum number of concurrently remediated hosts, or
- Limit the number of concurrent hosts (1-32);
- Migrate powered off and suspended virtual machines to other hosts in the cluster (helpful when a remediation leaves a host in an unserviceable condition);
- PXE Booted ESXi Host Settings:
- Allow installation of additional software on PXE booted ESXi 5.x hosts (requires the use of an updated PXE boot image – Update Manager will NOT reboot the PXE booted ESXi host.)
These settings are available at the time of remediation scheduling and as host/cluster defaults (Update Manager Admin View.)
SOLORI’s Take: So while it follows that the remediation process is NOT as similar to the manual process as one might think, it still can be made to function accordingly (almost.) There IS a big difference between disabling removable media and making vMotion-aware decisions about hosts. Perhaps VMware could take a few cycles to determine whether or not a host is bound to a removable media device (either through Host Device or local storage resource) and make a more intelligent decision about removable media.
vSphere already has the ability to identify point-resource dependencies, it would be nice to see this information more intelligently correlated where cluster management is concerned. Currently, instead of “asking” DRS for a dependency list, it just seems to just ask the hosts “do you have removable media plugged-into any VM’s” – and if the answer is “yes” it stops right there… Still, not very intuitive for a feature (DRS) that’s been around since Virtual Infrastructure 3 and vCenter 2.