May 2020 - Virtual Ramblings

Rebuilding the vSAN Lab (Hands-on part 3).

So as i complete this series I wanted to include some screenshots, examples of the order of operations I used, and discuss some of the options or ways to make this faster.

Disclaimer: I didn’t end to end script this. I didn’t even leverage host profiles. This is an incredibly low automation rebuild (partly on purpose, as we were training someone relatively new to work with bare-metal servers as part of this project so they would understand things that we often automate). If using a standard switch (In which case SHAME use the vDS!) document what the port groups and VLANs for them are. Also, check the hosts for advanced settings.

First step Documentation

So before you rebuild hosts, make sure you document what the state was beforehand. RVtools is a pretty handy tool for capturing some of this if the previous regime in charge didn’t believe in documentation. Go swing by Robware.net and grab a copy and export a XLS of your vCenter so you can find what that vmk2 IP address was before, and what port group it went on (Note it doesn’t seem to capture opaque port groups).

Put the host into maintenance mode

Now, this sounds simple, but before we do this let’s understand what’s going on in the cluster and the UI gives us quite a few clues!

The banner at the top warns me that another host is already in maintenance mode. Checking on slack, I can see that Teodora and Myles who are several time zones ahead of me have patched a few hosts already. This warning is handy for operational awareness when multiple people manage a cluster!

Next up, I’m going to check on the Go To Pre-Check. I want to see if taking an additional host offline is going to have a significant negative impact (This is a fairly large cluster, this would not be advised on a 4 host cluster where 2 hosts would mean 50% of capacity offline, and an inability to re-protect to full FTT level).

I’m using Ensure Accessibility (The handful of critical VM’s are all running at FTT=2), and I can see I”m not going to be touching a high water mark by putting this host into maintenance mode. If I had more time and aggressively strict SLAs I could simulate a full evacuation. (Again,This is a lab). here’s an image of what this looks like when you are a bit closer to 70%.

Notice the “After” capacity bar is shorter representing the missing host

Now, after pressing enter maintenance mode I’m going to watch as DRS (which is enabled) automatically evacuates the virtual machines from this host. While a rather quick process I’m going to stop and daydream of what 100Gbps Ethernet would be like here and think back to the stone ages of 1Gbps ethernet where vMotions of large VMs was like watching paint dry…

Once the host is in Maintenance Mode I’m going to remove this host from the NSX-T Manager. If your not familiar with NSX-T this can be found under System → Fabric → Nodes and THEN use the drop down to select your vCenter (it defaults to stand alone ESXi hosts).

If you have any issues with uninstalling NSX-T, Rerun the removal with “Force Delete” only

Once the host is no longer in NSX-T, and out of maintenance mode you can go ahead and reboot the host.

Working out of band

As I don’t feel like flying to Washington State to do this rebuild, I’m going to be using my out of band tooling for this project. For Dell hosts this means iDRAC, for HPE hosts this means iLO. When buying servers always make sure these products are licensed to a level that allows full remote KVM, and for Dell and HPE hosts you have the additional licensing required for vLCM. (OMIVV and HPE iLO Amplify). Some odd quirks I’ve noticed is that while I personally hate Java Web start (JWS) as a technology, the JWS console has some nice functionality that the HTML5 does not. Being able to select what the next boot option should be, means I don’t have to pay quite as much attention, however the JWS is missing an on screen keybaord option so I did need to open this from the OS level to pass through some F11 commands.

while I’m at it, I’ll go ahead and mount the Virtual Media, and attach my ESXi ISO to the Virtual CDROM drive.

Firmware and BIOS patching

Now if you are not using vLCM it might be worth doing some firmware/bios update at this time. for Dell hosts, this can be done directly from the iDRAC by pointing them at the downloads.dell.com mirror or loading an ISO.

I’m always a bit terrified when I see the almost 2 hour task limit, then realize “ohh, yah someone put a lot of padding in”

BIOS and boot configuration

For whatever reason, my lab seems to have a high number of memory errors. To help protect against this I’m switching hosts by default to run in Advanced ECC mode. For 2 hosts that have had chronic DIMM issues that I haven’t had time to troubleshoot I’m taking a more aggressive stance and enabling fault resilient mode. This forces the Hypervisor and some core kernel processes to be mirrored between DIMMs so ESXi itself can survive the complete and total failure of a memory DIMM (vs Advanced ECC which will tolerate the loss of subunits within the DIMM). For more information on memory than you ever wanted to know, check out this blog.

Next up I noticed our servers were still set to a legacy boot. I’m not going to write an essay on why UEFI is superior for security and booting, but in my case I needed to update it, if I was going to use our newer iPXE infrastructure.

Huh, I wonder if ESX2 would deploy to my R630…

Note upon fixing this I was greeted by some slightly more modern versions.

Now, if you don’t have a fancy iPXE setup you can always mount the ISO to the virtual CDROM drive.

Note: after changing this you will need to go through a full boot cycle before you can reset the default boot devices within the BIOS boot manager.

Now I didn’t take a screenshot on my Dell of this, but here’s one of the HPE hosts, what it looks like to change the boot order. The key thing here is to make sure the new boot device is first in this list (As we will be using one-time boot selection to start the installer).

In this case I’m booting from a M.2 device as this fancy Gen10 supports them

Coffee Break

Around this time is a great time for some extra coffee. Some things to check on.

Make sure your ISO is attached (Or if using PXE/iPXE the TFTP directory has the image you want!).
make sure the next boot is set to your installer method (Virtual CD Room, or PXE).
Go back into NSX-T manager and make sure it doesn’t think this host is still provisioned or showing errors. If it’s still there unselect “uninstall” and select “Force delete”. This tends to work.
Collect your notes for the rest of the installation NTP servers, DNS servers, DNS suffix/lookup domains, IP and hostname for each host (If your management domain has DNS consider setting reservations for the MAC address of VMK0 which always steals from the first physical NIC so it will be consistent, unlike the other ones that generate from the VMware MAC range.)
Go into vCenter and click “remove from inventory” on the disconnected host. We can’t add a host back in with the same hostname (This will make vCenter angry).

What to look at and do (or not) when recovering from a cluster failure (Part 2)

In part one of this series, I highlighted a scenario where we lost quite a few hosts in a lab vSAN cluster caused by 3 failed boot devices and a power event that forced a reboot of the hosts. Before I get back into the step by step of the recovery I wanted to talk a bit about what we didn’t do.

What should you do?

If this is production please call GSS. They have unusually calm voices and can help validate decisions quickly and safely before you make them. They also have access to recovery tooling, and escalation engineers you do not have.
Try to get core services online first (DNS/NTP/vCenter). This makes restoring other services easier. In our case, we were lucky and had only partial service interruption here (1 of 2 DNS servers were impacted).

Cluster Health Checks

While, I much prefer to work in vCenter, in the event of vCenter having an outage, it is worth noting that vSAN health checks can be run without vCenter.

Run at the CLI
Run from the Native HTML5 client on each ESXi host. The cluster health is a distributed service that is independent of vCenter for core checks.

Solving the chicken egg monitoring problem since 2017!

When reviewing the impact on the vSAN cluster look at the Cluster Health Checks:

How many objects are re-syncing, and what is the progress.

Note, in this case I just captured a re-balance operation

2. How many Components are healthy vs. unhealthy

3. Drive status – How many drives and disk groups are offline. note, within the disk group monitoring you can see what virtual machine components were on the impacted disk groups.

4. Service Check. See how many hosts are reporting issues with vSAN related services. In my case this was the hint that one of my hosts had managed to partially boot, but something was wrong. Inversely if you may see a host that is showing disconnected from vCenter, but is still contributing storage. It is worth noting that vSAN can continue to run and process storage IO as long as the vSAN services start, and the vSAN network is functional. It’s partly for this reason that when you enable vSAN, the HA heartbeats move to the vSAN network, as it’s important to keep your HA fencing in line with storage.

5. Time is synchronized across the cluster. For security reasons, hosts will become isolated if clocks drift too far (Similar to active directory replication breaking, Kerberos authentication not working etc. Thankfully there is a handy health check for this.

Host 16 used a stratum 16 $10 rolex I bought for cheap while traveling.

What Not to do?

Don’t panic!

Ok, so we had to do a bit more than that…

Also, while you are at it, don’t reboot random hosts.

This advice isn’t even specifically vSAN advice, but unlike your training with Microsoft desktop operating systems, the solution to problems with ESXi is not always to “tactically reboot” a host by mashing reset from the iDRAC. You might end up rebooting a perfect health host that was in the middle of a resync, or HA operation. Rebooting more health hosts does a few things:

It causes more HA events. HA events trigger boot storms. large bursts of disk IO as an Operating system reboots, databases force log rechecks, in-memory databases rebuild their memory caches and other processes that are normally staggered.
Interrupt object rebuilds. In our case (3 hosts failures and FTT=1) we had some VM’s that we lost quorum on, but many more that only lost 1 of 3 pieces. Making sure all objects that can be repaired are repaired quickly was the first order of battle.
Rebooting hosts can dump logs or crash dumps that are not being written to persistent disk. GSS may want to scrape some data out of even a 1/2 dead host if possible.

Assemble the brain trust

Remember, always have a Irish guy named Myles ready to help fix things

A few other decisions came up as Myles, Teodora and I spoke about what we needed to do to recover the cluster. We also ruled out a few recovery methods and decided on a course of action to get the cluster stable, and then begin the process of proactively preventing this from impacting us with other hosts.

Salvage a boot device from a capacity device – We briefly discussed grabbing one of the capacity devices out of the dead hosts and using it as a boot device. Technically this would not be a supported configuration (or controller is not supported to act as both a boot device and a device hosting vSAN capacity devices). The challenge here is we wanted to get back 100% of our data and it would have been tedious to identify which disk group was safe to sacrifice in a host for this purpose. If we were completely unable to get remote hands to install boot devices or were only interested in the recovery of a single critical VM at all costs, this might have made sense to investigate.
Drive Switcharo– Another option for recovery has our remote hands pull the entire disk group out of the dead servers and shove them into free drive bays on existing healthy servers. Pete Koehler mentioned this is something GSS has had success and something I’d like to dedicate to its own blog topic at some point. Why does this work? Again, vSAN does not store metadata or file system structures on the boot devices, purposely to increase survivability in cases where the entire server must be replaced. This historically was not a common behavior in enterprise storage arrays that would often put this data on OS/vault drives (that might not be movable even, or embedded). Given we had adequate drive bays free to split the 6 impacted disk groups (2 per host) across the remaining 13 hosts in the cluster this was an option. In our case, we decided we didn’t want to deal with moving them back after this was done. My remote hand’s teams were busy enough with vSphere 7 launch tasks, and COVID related precautions were reducing the staffing levels.
Fancy boot devices – We decided to avoid trying to use SD cards going forward as our primary boot option (even mirrored). Once these impacted hosts were online and the cluster was healthy we had ops plug in all of our new boot devices so we could proactively one host at a time process a fresh install. In a perfect world we would have had M.2 boot devices, but adding a PCI-E riser for this purpose on 4-year-old lab hosts was a bit more than we wanted to spend.

What did we do?

In our case, we called our data center ops team and had them plug in some “random USB drives we have laying around” and began fresh installs to get the hosts online and restore access to all virtual machines. I ordered some high endurance Sandisk USB devices and as a backup some high endurance SD cards (Designed for 4K Dashcam video usage). Once these came in, we reinstalled ESXi to the USB devices allowing our ops teams to recover their USB devices. The fresh high-quality SD cards will be useful for staging ISOs inside the out of band, as well as serving as an emergency boot device in the event a USB device fails.

Next up in the series. A walk through of installing ESXi from bare metal, some changes we made to the hosts and I’ll answer the question of “what’s up withe snake hiding in our R&D datacenter”.

How to rebuild a VCF/vSAN cluster with multiple corrupt boot devices

Note: this is the first part of a series.

In my lab, I recently had an issue where a large number of hosts needed to be rebuilt. Why did they need to be rebuilt? If you’ve followed this blog for a while, you’ve seen the issues I’ve run into with SD cards being less than reliable boot devices.

Why didn’t I move to M.2 based boot devices? Unfortunately, these are rather old hosts and unlike modern hosts, there is not an option for something nice like a BOSS device. This is also an internal lab cluster used by the technical marketing group, so while important, it isn’t necessary “mission critical” by any means.

As a result of this, and a power hiccup I ended up with 3 hosts offline that could not restart. Given that many of my VM’s were set to only FTT=1 this means complete and total data loss right?

Wrong!

First off, the data was still safe on the disk groups of the 3 offline hosts. Once I can get the hosts back online the missing components will be detected and the objects will become healthy again (yah, data loss!). vSAN does not keep the metadata or data structures for the internal files systems and object layout on the boot devices. We do not use the boot device as a “Vault” (if your familiar with the old storage array term). If needed all of the drives in a dead host can be moved to a physically new host and recovery would be similar to the method I used of reinstalling the Hypervisor on each host.

What’s the damage look like?

Hopping into my out of band management (My datacenter is thousands of miles away) I discovered that 2 of the hosts could not detect their boot devices, and the 3rd failed to fully reboot after multiple attempts. I initially tried reinstalling ESXi on the existing devices to lifeboat them but this failed. As I noted in a previous blog, SD cards don’t always fully fail.

Live view of the SD cards that will soon be thrown into a Volcano

If vSAN was only configured to tolerate a single failure, wouldn’t all of the data at least be inaccessible with 3 hosts offline? It turns out this isn’t the case for a few reasons.

vSAN does not by default stripe data wide to every single capacity device in the cluster. Instead, it chunks data out into fresh components every 255GB (Note you are welcome to set strip width higher and force more sub-components being split out of objects if you need to).
Our cluster was large. 16 hosts and 104 physical Disks (8 disks in 2 disk groups per host).
Most VM’s are relatively small, so out of the 104 physical disks in the cluster, having 24 of them offline (8 per host in my case). still means that the odds of those 24 drives hosting 2 of the 3 components needed for a quorum is actually quite low.
A few of the more critical VM’s were moved to FTT=2 (vCenter, DNS/NTP servers) making their odds even better.

Even in the case of a few VM’s that were impacted (A domain Controller, some front end web servers), we were further lucky by the fact that these were redundant virtual machines already. Given both of the VMs providing these services didn’t fail, it became clear with the compounding ods in our favor that for a service to go offline was more in the odds of rolling boxcars twice, than a 100% guarantee.

This is actually something I blogged about quite a while ago. It’s worth noting that this was just an availability issue. In most cases of actual true device failure for a drive, there would normally be enough time between loss to allow for repair (and not 3 hosts at once) making my lab example quite extreme.

Lessons Learned and other takeaways:

Raise a few Small but important VM’s to a higher FTT level if you have enough hosts. Especially core management VMs.
vSAN clusters can become MORE resilient to loss of availability the larger they are, even keeping the same FTT level.
Use higher quality boot devices. M.2 32GB and above with “real endurance” are vastly superior to smaller SD cards and USB based boot devices.
Consider splitting HA service VM’s across clusters (IE 1 Domain Controller in one of our smaller secondary clusters).
For Mission-Critical deployments use of a management workload domain when using VMware Cloud Foundation, can help ensure the management is fully isolated from production workloads. Look at stretched clustering, and fault domains to take availability up to 11.
Patch and reboot your hosts often. Silently corrupt embedded boot devices may be lurking in your USB/SD powered hosts. You might not know it until someone trips a breaker and suddenly you need to power back on 10 hosts with dead SD devices. Regular patching will catch this one host at a time.
While vSAN is incredibly resilient always have BC/DR plans. Admins make mistakes and delete the wrong VMs. Datacenters are taken down by “Fire/Flood/Blood” all the time.

I’d like to thank Myles Grey and Teodora Todorova Hristov for helping me make sense of what happened and getting the action plan to put this back together and grinding through it.

Keeping track of VCF and vSAN cluster driver/firmware

Are you building out a new VMware Cloud Foundation cluster, and trying to make sure you stay up to date with your vSAN ReadyNodes driver/firmware updates? Good news, there are a few options for tracking new driver/firmware patches.

The first method is simple, try out the new vLCM functionality. This allows for seamless updates of firmware/drivers for drives and controllers as well as system BIOS and other devices. It also has integration to verify key driver/firmware levels for the vSAN VCG sub-components. For those of you looking into this go check, the VCG for compatible hardware check out this blog post.

What about for clusters where you can not use vLCM yet. Maybe, your servers are not yet supported?

The vSAN VCG notification service can help fill the gap. It allows you to subscribe to changes. Subscribing will set you up for email alerts that will show changes to driver and firmware versions, as well as when updates and major releases. You can sign up for individual components, as well as for an entire ReadyNode specification.

Changes are reflected in a clear color-coded view showing what has been removed and what has been added to replace the entry.

The ReadyLabs team keep continuing to make it easier to keep your VMware Cloud Foundation environment up to date. If you have any more questions about the service, be sure to check out the FAQ. If you have any questions on this or the vSAN VCG reach out by email to vsan-hcl@vmware.com