After upgrading a View installation for a client this weekend from View 4.0 to View 5.0 all seemed well. The upgrade process took them from vSphere 4.0U2 to vSphere 5.0U1 in the bargain – about 15-20 hours of work including backups and staging. Testing and the first 24 hours of production went swimmingly with no negative reports or hiccups. (The upgrade process and spectres of dead pilots-turned-production is an issue for another blog post.)
I got a call about vCenter 5.0 dying (and then magically working again before the local admin could get to it – a couple of minutes or so.) Two mysteries, one easy, one VERY frustrating…
Mystery One – vCenter Dies and Comes Back to Life
This was the easy one: the VMware VirtualCenter Server service is set to a “300000 millisecond” recovery delay upon failure by default. The local site admin didn’t have his prayer answered, the system just recovered as planned. (Note to upgraders – set your recovery time to more or less hold-down time as your site needs – probably no less than 120000 milliseconds.)
The VMware VirtualCenter Server service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 300000 milliseconds: Restart the service.
– Service Control Manager
Why would five minutes (yep, 300000 milliseconds) be a good amount of recovery time? The socratic answer is this: how long will it take for all of the vCenter log and dump files to be written based on your environment? In the case of this issue, the dump file was about 500MB in size with about another 150MB in various other logs. At a “leisurely pace” of 5 MB/sec (let’s assume the worst), that would require about two minutes of “hold time” before restart.
Mystery Two – vCenter Died. Why?
Here’s the problem: vCenter needs to be bullet proof. vCenter’s installer asks for your environmental size during the installation and sets parameters to accommodate the basic needs. Also, during the SQL upgrade process from vCenter 4.0 to 5.0, the SQL database is set from SIMPLE (the recommended setting for vCenter) to BULK-LOGGING, but just for the duration of the upgrade. After the upgrade it’s reset back to SIMPLE.
Fast forward 48 hours. vCenter is running with a couple of hundred virtual machines in a View environment and is tracking all of that lovely host and performance data we appreciate when dealing with complex enterprise systems. It’s happily responding to View Connection Server’s request for power-ons and power-offs when all of a sudden the worst happens: it crashes!
Suddenly, 10’s of thousands of dollars worth of infrastructure is waiting for a 5 minute recovery interval and View logins requiring VM power-ons wont happen until then. All is not right in your virtual world now, buckaroo! Let’s see if Windows Event Viewer can elicit a solution:
The description for Event ID 1000 from source VMware VirtualCenter Server cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
Log directory: C:\ProgramData\VMware\VMware VirtualCenter\Logs.
the message resource is present but the message is not found in the string/message table
– Event Viewer, Application Log
Okay, Event ID 1000 – there’s got to be a KB on that one, but seriously, ID 1000 sound pretty generic for me to have a ton of hope. But sure enough, VMware Knowledge Base immediately coughs up KB article 1015101, applicable to vCenter 5.0. Unfortunately, vCenter Server is not installed on an IIS platform, so this is just an empty rabit hole…
Next, let’s have a look at the vCenter Server logs (thoughtfully pointed to in the Event log, above) for vCenter at-or-around the time of failure. Sure enough, there is a gzipped log with the restart time stamp available. A quick glance at the end of the log shows the following “impending doom” quality message:
--> Panic: TerminateHandler called
--> backtrace rip 000000018013deba (no symbol)
--> backtrace rip 0000000180101518 (no symbol)
--> backtrace rip 00000000708f2fdf (no symbol)
--> backtrace rip 00000000708f3080 (no symbol)
– vCenter vpxd-X.log file
But a sobering look above the doomsday report gives us a better idea as to the real culprit: SQL execution failed. What? Did I hear you whisper “kill your DBA?” Before walking down to the DBA and calling him out for leaving you in the lurch, let’s visit the SQL logs to find out (perhaps you will have to talk to the DBA after all if your vCenter admins don’t have access to SQL logs in the environment.) Here’s what my SQL log for the vCenter database said:
05/01/2012 08:05:21,spid62,Unknown,The transaction log for database 'VIM_VCDB' is full. To find out why space in the log cannot be reused<c/> see the log_reuse_wait_desc column in sys.databases
05/01/2012 08:05:21,spid62,Unknown,Error: 9002<c/> Severity: 17<c/> State: 4.
05/01/2012 08:00:04,spid75,Unknown,The transaction log for database 'VIM_VCDB' is full. To find out why space in the log cannot be reused<c/> see the log_reuse_wait_desc column in sys.databases
05/01/2012 08:00:04,spid75,Unknown,Error: 9002<c/> Severity: 17<c/> State: 4.
– Microsoft SQL Server Log for VIM_VCDB (vCenter)
Note that something to this effect also shows up as a diagnostic message inside the vCenter log – reducing the number of times you need to traipse down to the DBA’s cubby for a chat. Okay, that cinches it, the DBA’s been meddling in my vCenter database again – probably with some unscheduled and undocumented maintenance. We’re definitely going to have that talk now, right? Nope.
Remember that upgrade we did 48-hours ago? As part of the upgrade process, the database is upgraded from the vCenter 4.0’s format to the more information rich vCenter 5.0 format. Along the way, an upgrade process changes the SQL database’s mode from the preferred “SIMPLE” mode to the “BULK-LOGGING” mode so that a failed upgrade can be more easily rolled-back.
BULK-LOGGING mode can create a HUGE transaction log during a vCenter upgrade process. There are MANY posts about the TLOG filling-up during these processes, with a consensus that the TLOG needs to be allowed to grow to at least 4x the size of your vCenter database or the process will not complete.
You’ve been warned.
In the case of this upgrade, I happen to know that the TLOG was set to at least 4x of the vCenter database PRIOR to the upgrade process. In fact, during this upgrade (final stage) it grew to 1.5X of the vCenter database size. What was unknown to me – until now – is that the TLOG maximum allowed growth was reset to 500MB when the database was returned to “SIMPLE” mode. During a time of high activity (perhaps processing the last 24-hours of data) the TLOG needed to exceed that amount, couldn’t, and vCenter crashed accordingly. The simple fix is to increase the TLOG limit back to the original settings that works well for the environment.
Ouch! Someone feels setup for failure. I never want to hear a customer say: “gosh, everything was great until I logged into vCenter [with the vSphere Client] and then, “all of a sudden” things went sideways” – especially when the cause is that SQL server has been silently modified with setting known to cause it to choke, subsequently resulting in vCenter coming to a crashing halt.
VMware: if you’re modifying my database parameters POST INSTALL you need to WARN ME or post it in the install or upgrade docs. I’ve combed them and can’t find it… let’s get the upgrade process modified so that the database settings are restored after the database is returned to SIMPLE mode, okay?
Updated 05/02/2012: Corrected intro grammar. Link to TLOG upgrade issue added.