Virtualisation: Learning The Hard Way
Featured, GestaltIT, Virtualisation — By Chris Evans on January 20, 2010 at 10:37 AMThey say that you learn the most when you make mistakes and things go wrong. Well, last night I certainly must have learned a lot. What started as a simple physical re-organisation of my hardware turned into a rebuild of my production VMware ESXi server – finishing at 1am. Here’s what happened.
Failing Disk
I started by shutting down and moving my production ESXi Server out and back into the standard rack it occupies. On power up, the server failed to reboot, claiming the boot disk was no longer present. A quick check inside showed that the SAS connector on the boot disk had come loose, so I plugged it back in and tried again (Oh, SAS specification guys – bad design, no retainers on the plugs). Unfortunately, the boot disk had somehow become corrupted and the server wouldn’t come up. No problem, I thought, just repair using the installation media. This is where things started to get complicated.
My ESXi server runs off a Seagate Savvio 2.5″ 15K 73GB drive, one of four Seagate generously loaned me last year for long term testing. More on that another day. The server has two disks installed, one of which has VMs on it. During the repair process I wasn’t sure which disk was the O/S and which was data. ESXi doesn’t help much, only indicating that both disks contained data in partitions, data that would be lost if I reinstalled.
Lesson 1 – Make sure you know exactly how your hardware is configured, down to the SAS ports each drive is plugged into.
Actually having multiple drives of the same type is a pain. So rather than risk data loss, I removed both drives and re-installed the ESXi O/S from a third Savvio drive. All good. Now I need to locate and import all my VMs, however some were on the removed Savvio disks. This meant installing each disk independently and checking the contents to determine which contained VMs and which contained the broken O/S.
Lesson 2 – Wherever possible, place your VMs on disks separate from the server itself.
Yes, I do have most of my VMs on my Iomega ix4-200d, but, rather crucially, not my Windows 2008 AD Server, which needed to be moved from internal disk to the ix4 before I continued (schoolboy error there). The AD server was rather important for accessing my, ahem, ix4, which is configured to validate logins using AD. This creates a bit of a circular reference which could have been a disaster.
Lesson 3 – Place your Windows domain controller on a physical server, or have another independent backup elsewhere.
Having a physical server just for AD control isn’t part of my total virtualisation plan, so I’m looking at whether I can host a backup controller with Amazon AWS and use VPN to secure it into my private network. This way, if I ever have an issue, I can still authenticate. The issue of course is cost, which may make a dedicated server the cheaper option.
So, by 1am everything was back up and running. Did I learn anything else? Well yes…
Lesson 4 – after 22 years in IT, I should remember that adequate documentation and a DR plan are crucial. In fact, in a virtualised environment, they are essential due to the concentration of risk placing all systems on a single server causes.
So what next for my virtual infrastructure? I have a few changes planned; I’ll create a backup ESXi server that can import and run the VMs in the event of a future server failure. I will also be investigating AWS with Windows 2008 and VPN to create a backup domain controller and see if I can continue to work if both server’s hardware failed.
That leaves one Single Point of Failure… my ix4-200d. Anyone want to donate me a spare one?
Tags: ESXi, iomega, ix4-200d, SAS, Savvio, Seagate, VMware



Tweet This
Digg This
Save to delicious
Stumble it
6 Comments
Very entertaining story Chris. It’s especially entertaining when it happens to someone else!
Not to divert from the very good message/example of documentation for DR plans… but there some ways to have gotten things back relatively easy.
All you should have had to do was look at the partition table in whatever program you wanted. You should have seen different partition layouts depending upon if the entire drive was used or not. If you had a drive used just for VM’s it would only contain a single partition (partition type fb), where as the boot disk would have more as you can’t boot an OS directly from vmfs.
Additionally you should have been able to either run esxi from a Live CD or install esxi to a usb thumbdrive and brought up your system. With the caveat that I haven’t done this myself to make sure.
Also related to redundancy if you aren’t as concerned about offsite protection and looking for the inepensive option, you could run vmware server on an existing system locally (so you don’t have to give up your entire system to ESX) and run a second instance of AD server there. You’d be protected against the same type of physical failure as you’d have separate physical pieces of hardware (assuming you were using some local drive on each system not a shared drive)
Seems like a pretty good place to mention benefits of a RAID configuration.
-r
wow, I guess I’m not alone when it comes to vmware homelab failures
my homelab is not as advanced as yours is [single server, 2 disks no raid, 2 nics]. I had my hard drive start acting weird [overall slugging system, console kept spitting out sector read errors].
took me 12 hours to copy 300GB of data from bad disk to good disk, with a barely running esxi host, and get back on my feet. I lost 1 VM due to being careless, level 8 issue, nothing wrong with vmware.
In the end it turned out the failing disk was fine, just the controller port was screwed up, thank god I have 6 more ports