Let me preference this discussion of iSCSI with my own personal opinion about iSCSI in the year 2020. With the support of shared VMDKs for SCSI-3PR applications, NFS and SMB shares, the need for iSCSI has reduced quite a bit. If you are using iSCSI today I’d like to talk about some alternatives to delivering that shared access or external cluster storage requirement. That said, I know there are still some uses cases for it so let us go deeper on this topic.
Previously iSCSI on vSAN was only supported with stretched clusters by a limited RPQ. Why was this?
Normally vSAN Stretched clusters implement a site locality construct to avoid unnecessary inter-site latency being added to the read IO path (They prefer all reads from the local site). The challenge came from the fact that the iSCSI service had no awareness of the two fault domains, and you easily get in a situation where an iSCSI target would be placed on the secondary site, while serving IO to virtual machines on the first site. As a result it would be possible for data at the preferred site where a virtual machine is being served to be sent to an iSCSI target on the remote site, and then come back as an iSCSI packet to a virtual machine running at the preferred site.
To prevent this problem, vSAN 7 Update 1 now supports setting a preferred site for an iSCSI target to live. Note, in the event of a complete site failure, the preference will be discarded and the service will cleanly fail over to the other side of the stretched cluster. This combined with other networking improvements and performance optimizations I mentioned in this blog, should help round out this new use case.
It is possible to relocated the preferred location. You will also receive a warning if something has caused a target to run at the non-preferred location.
Again, when clustering windows applications i tend to prefer the native VMDKs these days, but for those of you using iSCSI today (or already under RPQ) this may be a useful setting to look at.
I have two on demand VMworld Sessions that can be found at this link, and additional round table for premium pass holders.
The first Session is a return of the IO path deep dive, where we go below the hood on what has changed within the IO path. this year is going to feature quite a few new features and low level improvements so I”d encourage everyone to come check it out.
Deconstructing vSAN: A Deep Dive into the Internals of vSAN [HCI1276] John Nicholson, Senior Technical Marketing Architect, VMware –
The other Session: vSAN File Services: Deep Dive [HCI1825] Is with my podcast co-host Pete Flecha. It is a great review of what’s improved in vSAN File Services as well as some details on how it works behind the scenes.
For anyone attending my session who has question my DMs are open for the week of VMworld and you can find me on twitter as @Lost_signal.
I also have some live roundtables where Jason Massae and I will be talking about space reclaim (UNMAP, Zero Reclaim, and beyond). There’s been some improvements lately in this space, as well as years of steady improvement to talk about in Storage Management – How to Reclaim Storage Today on vSAN, VMFS and vVols with John Nicholson [HCI2726].
Ok, I’ll agree I probably owe an explanation of why.
A quick history lesson of the 3 kinds of VMDKs used on VMFS.
thin: Space is not guaranteed, it’s consumed as the disk is written to.
thick (Sometimes called Lazy Thick): Space is reserved by the blocks are initialized lazily, so VMFS has a zero’ing cost the first time a block is written
EZT or Eager Zero Thick: The entire disk is zero’d so that VMFS never needs to write metadata. This is required for shared disk use cases, as VMFS can’t coordinate metadata updates.
These virtual disk types have no meaning on vSAN. From a vSAN perspective all objects are thin (thin vs thick vs EZT makes no difference). This is similar to how NFS is always sprase, and reservations are something you can do (with NFS VAAI) but it doesn’t change the fact that it doesn’t actually write out zeros or fill the space like it did on VMFS. On VMFS this could be mitigated largely using ATS and WRITE_SAME (With further improvements in the works). I”ve always been dubious on this benefit given how many database and applications would often write out and per-allocate file systems in advance. There are likely some corner cases such as creating a ext4 file system is slower but you can generally work around that if you really care (mkfs.ext4 -E nodiscard).
Having an EZT disk won’t necessarily improve vSAN performance the same way it does for VMFS. Pre-filling zeros on VMFS had an advantage of avoiding “burn in” bottlenecks tied to metadata allocation.
First off, vSAN has it’s own control of space reservation. the Object Space Reservation policy OSR=100% (Thick) or OSR=0% (Thin). This is used in cases where you want to reserve capacity and prevent allocations that would allow you to run out of space on the cluster. In general I recommend the default of “Thin” as it offers the most capacity flexibility and thick provisioning tends to be reserved for cases where it is impossible or incredibly difficult to ever add capacity to a cluster and you have very little active monitoring of a cluster.
What about FT and Shared Disk use cases (RAC, SQL, WFC, etc)? This requirement has been removed for vSAN. For vSAN you also do not need to configure object space reservation to thick to take advantage of these functionalities. VMFS there may still be some benefits here do to how shared VMDK metadata updates are owned by a single owner (vSAN metadata updates are distributed so not an issue).
What do I gain by using Thin VMDKs as my default?
You save a lot of space. Talking to others in the industry 20-30% capacity savings. If combined with TRIM/UNMAP automated reclaim, even more space can be crawled back! This can lead to huge savings on storage costs. As an added bonus, Deduplication and Compression work as intended only when OSR=0% and the default thin VMDK type is chosen.
Also if configured to auto-reclaim you lower the chances of an out of space condition, so this can increase availability. Eager Zero Thick or Thick VMDKs on vSAN are not going to reserve capacity in a way that prevents over commitment on vSAN, instead it will simply use more capacity.
So lets talk about “maximum supported” and why it’s just a weird thing to focus on. Example building the biggest vSAN cluster possibly supported. It is today ONLY 64 nodes, but lets unpack what I can run in 64 nodes.
vSAN 7 supports 32TB Capacity devices. 5 disk groups *7 devices that’s ~1.12PB per host of raw capacity (Please don’t do this without calling me, but hey it is there).
With AMD 2 socket servers 128 Cores currently. Intel 56 core that’s 224 threads (yes, I’m aware the 9200 may require it’s own nuke plant to run and cool) I’m going to ignore quad socket for this conversation but yes, we support quad sockets like Synergy 660 Gen10 for SAP HANA.
BitFusion allows for remote CUDA calls to be served from remote hosts (and ones not even in the cluster). So GPU workload scaling potentially could get pretty nuts and I’m going to leave others to speculate on how many GPU cores could serve a cluster.
Maximum Memory is 16TB DRAM per host, and 12TB of PMEM per host. So a PB of DRAM, and 768TB of PMEM per cluster.
Now, addressable memory gets more fun as TPS, Memory ballooning, and DRS (with new cool capabilities in 7) mean that the actually allocated could be a bit higher. And when you are spending the price of what I assume a G5 Jet costs, that you are going to use these features. Should you design to these maximums? In general no. Most people have other reasons to split a cluster up (Remember we can do shared nothing migrations between clusters always). People will want to limit the blast zone of a management domain etc.
Part of the benefit of HCI is you can easily scale it down to 2 node even… Also remember Hosts can be reclaimed and moved to other workload domains in VMware Cloud Foundation.
Lastly, just because you can, doesn’t mean you should. Most sane people don’t need or want 40 drives in a host, or want an HA event to result in 6TB of memory worth of VM’s rebooting at once. Respect operational reasons to limit blast zones and sizing. As VMware makes it easier to migrate or share resources between clusters these kinds of limits matter less and less.
This one’s a long time coming, and there will be more posts on this topic. Go ahead and start playing this playlist to hear the, good the bad and the ugly (It will only take a minute).
So I took some time today to do some testing using some microphones laying around.
Zoom (It’s common enough) and they use pretty good codecs that are rarely a bottleneck.
Local recording (I’ll do WAN impact simulations at another time).
I did use the Original microphone sound mode (Not all conference systems have noise suppression and it’s worth noting that noise suppression while good for many cases does reduce quality). I’ll test these capabilities with actual noise another day (Crying baby and firetruck simulator?).
I haven’t tested every microphone in this house yet (I’ve got a sure SM58, a steel series headset my wife uses, and i’m sure some other bluetooth devices).
I tried to not adjust the gain on any devices, or the audio input volume (Which shows on the analog earbuds which are way to quiet). Simulating someone joining a call late and in a hurry.
The AC came on at one point and I tried to note it, but it’s hard to hear it too much. There is a NAS and desktop (fairly quiet fans) and the laptop fan. I plan to do another test with music, firetruck/baby in background to simulate some of the COVID WFH lifestyles.
I ate dairy (tends to give me some sinus congestion) and drank a lot of diet Dr. pepper. Both of these always negatively impact my throat etc for speaking. Drinking water, standing, and avoiding dairy would make me speak better, but I was going for something more realistic.
I use the following test phrases:
Oak is strong and also gives shade. Cats and dogs each hate the other. The pipe began to rust while new. Open the crate but don’t break the glass
Realistically it woulds be better to read something a lot more mundane, but these are known test phrases that have a diversity of sounds.
About what I tested:
The Heil PR-40 microphone was attached to a Blu USB-> XLR adapter with no XLR cable used. Gain was a bit high.
Razer Nari (Note I don’t have their crazy audio tools installed, like most software built by hardware companies I find it to be an absolute nightmare to use). These headphones use proprietary 2.4Ghz wireless. and not bluetooth.
Apple AirPod Pros. Note while they are connected to a Macbook Pro, I had 2 other bluetooth devices (keyboard, touch pad connected) so a non-optimal codec was likely used. Again, trying to simulate minimal effort real world, what a worse case Bluetooth codec would sound like. Remember the audio quality on Bluetooth is often 10x worse for the microphone than the sound input so just because you hear “good” sound doesn’t mean your transmitting it.
PSTN bridge. I dialed into the zoom call (not the app) using an iPhone 11 Pro. I tested this twice, once holding up to my ear and another on speaker phone laying on my desk. I should probably test using the app, but I wanted to simulate the ugly truth of what it sounds like when people use the dial in code under semi-optimal conditions. (I get good phone service, and wasn’t in a car with the AC blasting).
I’ll post some more information later (gotta run to the bank), but here’s a playlist of my initial tests.
Do you own tests:
By no means accept my testing as authoritative. Do your own test and customize for
The room you use
The devices you want to use
Try adjusting with volume, gain and other settings.
Try standing (it helps some people speak more clearly).
Have someone else listen to them.
Audio quality is often the result of many factors that are out of our control like:
People who still use Lync/Skype4Buisness.
Accents (Myles, if I have 2 trees, and add 1 tree how many trees do I have?)
Still despite this it’s worth knowing what you sound like and doing some quick tests to see if you can make yourself heard more readily. Being clearer on calls leads to less repeating things, more understanding and hopefully shorter/faster and more productive phone calls.
Lastly to the Managers out there. Talk to your reports about audio quality. Get people gear, get people an Ero WIFI bundle, reimburse better internet. A team who’s well heard is a team that is productive.
This is going to be a quick blog post, as someone asked about if the PERC 740 is going to be certified, or if VMware is going to try to certify it. A few things to consider….
Dell already sells a lower power, small form factor HBA330 (13Gen) and HBA 330+ (14 Gen servers) that has ultra deep queue depths, is simple to configure (no configuration), is cheap (less than 1/2 the street price of the PERC 740), and more importantly is brilliantly stable.
VMware does not unilaterally certify devices. It would require the OEM (Dell) often working with the ODM for the ASIC (In this case Broadcom/Avago) to submit the device to the ReadyLabs for testing. This has not happened, and it is my understanding is likely never to happen for a device that is frankly inferior in every way (Cost, Stability of pass through, performance, heat, power) for the use case of a pass through device.
NVMe devices do not need HBAs (the controller is built into the drive). Longer term as All Flash vSAN evolves, I expect low cost NVMe to “beat” the price of SAS/SATA Read Optimized flash drives plus the overhead of even a $250 cheap HBA.
But John, what about RAID for my boot devices?
I’m glad you asked! VMware has updated our boot device requirements with vSphere 7 and for Dell the BOSS mirrored M.2 solution provides a great blend of endurance, affordability, and fault isolation for boot, crash dumps, and log placement.
But what if my VAR rep said it’s ok to use?
Well beyond them being wrong, if you try using it you will encounter quite a few issues. vSAN health alarm checks will detect a non-supported controller and throw angry alarms. It’s not exactly easy to “sneak” into production with the wrong controller, vSAN Health will light up like a Christmas tree at you. On top of this you will not be able to life cycle the controller to a supported driver/firmware version using vLCM. VMware will not support the configuration (obviously). It’s worth noting that buying “ReadyNode” SKU’s from Dell (Chassis personality codes that end in -RN) will block this configuration entirely from being built in the factory.
If this happens to you feel free to reach out and I’ll happily introduce you to your vSAN account team, and the Dell ReadyNode teams who can help set the record straight.
I spend a lot of time on zoom, and I’ve noticed a trend. Some individuals sound good (and a few even great) on a consistent basis. I have weekly calls with people who consistently don’t have issues. I also have calls with other individuals where I consistently hear a few different consistent phrases.
“Zoom is having audio issues”
“My ISP is flaking out again”
“Let me try the WIFI in another room”
“I think the kids are streaming again”
“Let me try dialing in instead or using my cell phone to join the call”
“Can anyone hear George, I think he’s cutting out” (Followed by everyone in unison saying George’s audio is just fine).
One common refrain in all these statements is they make the assumption that this is normal and that nothing can be done to fix it. The reality is most of these situations can be fixed (some easily, some with difficulty). Note this blog is not meant on how to make your audio go from good to great (That will be another blog discussing audio gear and recording environment).
“Let me try the WIFI in another room”
This one can have a number of root causes, but the solutions are all fairly simple.
You are using the WIFI gateway that came from your ISP and is baked into the gateway device.
For the price of $10 a month many ISPs will lease you a modem/gateway and there’s a free WIFI access point thrown in! This is problematic for a few reasons.
Your gateway generally isn’t located in a central location of the house to provide even coverage. People often want to hide this ugly box, and so it’s shoved behind other dense items.
The WIFI radio’s in these devices are generally sub-par and low quality.
In some extreme cases they might only support 2.4Ghz witch is incredibly saturated to the point of being largely useless in dense urban areas.
How do I know it’s wifi or my internet connection?
Open a console (Win+r and type “cmd” and enter on windows, open terminal.app on OS X).
Ping Something on the internet (“ping 22.214.171.124” in windows or “ping 126.96.36.199 -A” on mac) the capital -A flag on mac will cause it to make a “beep” noise on every dropped packet.
Ping something local (Generally your edge router). For most people, the edge router will be either 192.168.0.1 or 192.168.1.1. Do this to both devices for 5-10 minutes and then use Cntrl c/command c to close the command.
You will get summery. does the WIFI one show large spikes in latency (above 10ms). Do you see a consistent amount of timeouts (dropped packets). If the WIFI is bad the internet will only be as bad or worse. If the wifi is 100% delivery of packets, and low latency and the internet is all over the place, then the problem is with your internet connection and you can skip this section.
Below is what a good connection to your local router should look like.
What can you do about it?
The simplest solution is don’t use WIFI. Run an ethernet cable all the way to your desk, and for your laptop have a USB-C dock. For under $100 you can basically ignore WIFI as a problem entirely.
If you can move your devices to a more central location in the home this might help. For fiber, this is generally difficult, but for cable, this may be as simple as moving the device to a more central drop.
If you are a cable customer you can stop wasting $10-15 a month on equipment rental and just buy your own cable modem. The Wirecutter has a great guide on shopping for cable modems. While the cable companies will try to hard-sell this solution, rented modems are rarely upgraded and you often end up with older more questionable devices over time.
Once you run your own cable modem you will need a router/firewall and an Access point. There are combination devices (WIFI Routers). The WIFI router offers simplicity (Single device) but these devices tend to have more issues with security, still limit you to a single location (commonly next to your ISP gateway), and tend to have a low tolerance for power quality issues (They tend to die a lot easier). Note AT&T Fiber customers will need to put their modem in “Gateway mode” and consider disabling the WIFI that is included (or just ignore it).
The next step up, and probably the best option for people in larger multi-walled houses, is a mesh system. This can cove a larger home effectively and benefits from not having to run cable (or repurpose cable to remote access points). The Wirecutter has some reviews here, but Eero seems to be a pretty popular high-end option that “just works”.
Lastly, if a mesh system is not a good solution (old house with chicken wire in the walls holding up the plaster, extremely noisy local RF environment, you have ethernet runs throughout the house, you need to extend wifi to an external shed, garage apartment) a modular solution where you deploy standalone access points with dedicated ethernet runs is an option. Ubiquity Unifi is what I use. Note, I didn’t have to run cable, instead, I repurposed the phone wires in my house as they were already cat5e ethernet, and I deployed a PoE switch in the central closet to power remote access points. This solution is a bit more complex (central controller, dedicated firewall, etc). Starting with the Unifi DreamMachine might make for a simple start and add APs as needed.
“I think the kids are streaming again”
This one is a bit more challenging as it’s a function of bandwidth availability and priority. There are still a few solutions to it.
Throw bandwidth at the problem.Speedtest.net to first make sure you are getting what you pay for, and then call your ISP and go up in package. Some things to note are:
Downstream bandwidth may not be your issue. Upload (what is used to send your voice or video) may be the problem. I recently left Comcast for AT&T because while Comcast could sell me 1Gbps down, they couldn’t go beyond 35Mbps up. AT&T’s gigabit product offered 1Gbps up and down.
Do the math. 3Mbps for SD quality (potato 480p) HD (720P) 5Mbps and 25Mbps for 4K. Downgrading the streaming quality can help, but this math gets ugly in reverse when you have multiple people trying to stream video from themselves.
It’s worth noting that Zoom by default maxes out at 720P outside of webinars, and requires enterprise accounts for HD. For HD settings and info https://support.zoom.us/hc/en-us/articles/207347086-Group-HD
Get usage in shape. Some of the fancier firewalls can try to shape or block traffic by traffic type this is increasingly harder to do in a world of TLS encryption hiding traffic and traffic routing through CDNs. It may be easier to have Zoom prioritize your laptop above all other clients. I’m not, a fan of traffic shaping on cheaper firewalls as it adds overhead and just slows everything down as it requires per-packet inspection and many firewalls can not run at 1Gbps line rate while doing this.
Abandon the (local) network. While expensive, having a separate data plan (Use your phone as a hotspot) that you use for your conference will take you off the shared local network if it truly is a lost cause. This is my emergency plan. Note data caps may apply, but if the alternative is sounding terrible on a 500 person conference call, you have to do what you have to do. I would recommend tethering by a cable rather than wifi or Bluetooth.
Clean up your ISP – If your local ISP is having packet loss issues (When you plug your laptop directly into the modem) call them. Do troubleshooting. Get a tech out. Check the weather and see if it always happens when it rains (a common issue for DSL or Coax is exposed cable will get wet and cause intermittent issues). Physically inspect the connections outside the house. Upgrade to a business class plan. While expensive this allows enforcement of SLAs and gets you priority on a lineman to fix your issues. The squeaky wheel gets the cheese, and just keep calling. If this is a chronic issue that is not being fixed start filing regular complaints with your state public utility commission. Fines at this level can get ugly on business class connections.
Change ISPs – Ask neighbors who they use, and see if there are other options. In rural areas WISPs often offer an alternative. Check with the wireless providers for 5G service. In my neighborhood, Verizon is already running test gear, and Sprint/Tmobile is deploying backhaul ahead of new towers.
Move – This sounds drastic, but it is the year 2020. There I can get 50Mbps down and 10Mbps up at my ranch by the devil sinkhole that is hours from real civilization. When moving look for at LEAST 2 high-quality ISPs that can offer 100Mbps upload as well as 200Mbps download. ISPs know when they have competition and take it seriously with price drops, better service, and free speed upgrades to compete. Being in a market served only by DSL and DOCSIS 2 cable means a slow death. Beware apartment buildings with contracts to a single small ISP as speeds will largely remain frozen. Another dangerous option is communities with political actors who are fighting the 5G rollout and ban “ugly telephone polls”. Fiber is 10x more expensive to deliver by digging than by the air, and you will be tying your internet hopes to coax last updated in the 1990s. Expecting a magic internet fairy to fix this is the definition of insanity. Houston’s lack of zoning and willingness to allow the telos to have the poles look like a combination of Bangkok and Manilla does well for my work from home needs. Think about this way. You wouldn’t live 300 miles from the office, and by moving somewhere with bad connectivity that is what you are effectively doing with your connectivity to the internet.
So as i complete this series I wanted to include some screenshots, examples of the order of operations I used, and discuss some of the options or ways to make this faster.
Disclaimer: I didn’t end to end script this. I didn’t even leverage host profiles. This is an incredibly low automation rebuild (partly on purpose, as we were training someone relatively new to work with bare-metal servers as part of this project so they would understand things that we often automate). If using a standard switch (In which case SHAME use the vDS!) document what the port groups and VLANs for them are. Also, check the hosts for advanced settings.
First step Documentation
So before you rebuild hosts, make sure you document what the state was beforehand. RVtools is a pretty handy tool for capturing some of this if the previous regime in charge didn’t believe in documentation. Go swing by Robware.net and grab a copy and export a XLS of your vCenter so you can find what that vmk2 IP address was before, and what port group it went on (Note it doesn’t seem to capture opaque port groups).
Put the host into maintenance mode
Now, this sounds simple, but before we do this let’s understand what’s going on in the cluster and the UI gives us quite a few clues!
The banner at the top warns me that another host is already in maintenance mode. Checking on slack, I can see that Teodora and Myles who are several time zones ahead of me have patched a few hosts already. This warning is handy for operational awareness when multiple people manage a cluster!
Next up, I’m going to check on the Go To Pre-Check. I want to see if taking an additional host offline is going to have a significant negative impact (This is a fairly large cluster, this would not be advised on a 4 host cluster where 2 hosts would mean 50% of capacity offline, and an inability to re-protect to full FTT level).
I’m using Ensure Accessibility (The handful of critical VM’s are all running at FTT=2), and I can see I”m not going to be touching a high water mark by putting this host into maintenance mode. If I had more time and aggressively strict SLAs I could simulate a full evacuation. (Again,This is a lab). here’s an image of what this looks like when you are a bit closer to 70%.
Now, after pressing enter maintenance mode I’m going to watch as DRS (which is enabled) automatically evacuates the virtual machines from this host. While a rather quick process I’m going to stop and daydream of what 100Gbps Ethernet would be like here and think back to the stone ages of 1Gbps ethernet where vMotions of large VMs was like watching paint dry…
Once the host is in Maintenance Mode I’m going to remove this host from the NSX-T Manager. If your not familiar with NSX-T this can be found under System → Fabric → Nodes and THEN use the drop down to select your vCenter (it defaults to stand alone ESXi hosts).
Once the host is no longer in NSX-T, and out of maintenance mode you can go ahead and reboot the host.
Working out of band
As I don’t feel like flying to Washington State to do this rebuild, I’m going to be using my out of band tooling for this project. For Dell hosts this means iDRAC, for HPE hosts this means iLO. When buying servers always make sure these products are licensed to a level that allows full remote KVM, and for Dell and HPE hosts you have the additional licensing required for vLCM. (OMIVV and HPE iLO Amplify). Some odd quirks I’ve noticed is that while I personally hate Java Web start (JWS) as a technology, the JWS console has some nice functionality that the HTML5 does not. Being able to select what the next boot option should be, means I don’t have to pay quite as much attention, however the JWS is missing an on screen keybaord option so I did need to open this from the OS level to pass through some F11 commands.
while I’m at it, I’ll go ahead and mount the Virtual Media, and attach my ESXi ISO to the Virtual CDROM drive.
Firmware and BIOS patching
Now if you are not using vLCM it might be worth doing some firmware/bios update at this time. for Dell hosts, this can be done directly from the iDRAC by pointing them at the downloads.dell.com mirror or loading an ISO.
BIOS and boot configuration
For whatever reason, my lab seems to have a high number of memory errors. To help protect against this I’m switching hosts by default to run in Advanced ECC mode. For 2 hosts that have had chronic DIMM issues that I haven’t had time to troubleshoot I’m taking a more aggressive stance and enabling fault resilient mode. This forces the Hypervisor and some core kernel processes to be mirrored between DIMMs so ESXi itself can survive the complete and total failure of a memory DIMM (vs Advanced ECC which will tolerate the loss of subunits within the DIMM). For more information on memory than you ever wanted to know, check out this blog.
Next up I noticed our servers were still set to a legacy boot. I’m not going to write an essay on why UEFI is superior for security and booting, but in my case I needed to update it, if I was going to use our newer iPXE infrastructure.
Note upon fixing this I was greeted by some slightly more modern versions.
Now, if you don’t have a fancy iPXE setup you can always mount the ISO to the virtual CDROM drive.
Note: after changing this you will need to go through a full boot cycle before you can reset the default boot devices within the BIOS boot manager.
Now I didn’t take a screenshot on my Dell of this, but here’s one of the HPE hosts, what it looks like to change the boot order. The key thing here is to make sure the new boot device is first in this list (As we will be using one-time boot selection to start the installer).
Around this time is a great time for some extra coffee. Some things to check on.
Make sure your ISO is attached (Or if using PXE/iPXE the TFTP directory has the image you want!).
make sure the next boot is set to your installer method (Virtual CD Room, or PXE).
Go back into NSX-T manager and make sure it doesn’t think this host is still provisioned or showing errors. If it’s still there unselect “uninstall” and select “Force delete”. This tends to work.
Collect your notes for the rest of the installation NTP servers, DNS servers, DNS suffix/lookup domains, IP and hostname for each host (If your management domain has DNS consider setting reservations for the MAC address of VMK0 which always steals from the first physical NIC so it will be consistent, unlike the other ones that generate from the VMware MAC range.)
Go into vCenter and click “remove from inventory” on the disconnected host. We can’t add a host back in with the same hostname (This will make vCenter angry).
In part one of this series, I highlighted a scenario where we lost quite a few hosts in a lab vSAN cluster caused by 3 failed boot devices and a power event that forced a reboot of the hosts. Before I get back into the step by step of the recovery I wanted to talk a bit about what we didn’t do.
What should you do?
If this is production please call GSS. They have unusually calm voices and can help validate decisions quickly and safely before you make them. They also have access to recovery tooling, and escalation engineers you do not have.
Try to get core services online first (DNS/NTP/vCenter). This makes restoring other services easier. In our case, we were lucky and had only partial service interruption here (1 of 2 DNS servers were impacted).
Cluster Health Checks
While, I much prefer to work in vCenter, in the event of vCenter having an outage, it is worth noting that vSAN health checks can be run without vCenter.
Run at the CLI
Run from the Native HTML5 client on each ESXi host. The cluster health is a distributed service that is independent of vCenter for core checks.
When reviewing the impact on the vSAN cluster look at the Cluster Health Checks:
How many objects are re-syncing, and what is the progress.
2. How many Components are healthy vs. unhealthy
3. Drive status – How many drives and disk groups are offline. note, within the disk group monitoring you can see what virtual machine components were on the impacted disk groups.
4. Service Check. See how many hosts are reporting issues with vSAN related services. In my case this was the hint that one of my hosts had managed to partially boot, but something was wrong. Inversely if you may see a host that is showing disconnected from vCenter, but is still contributing storage. It is worth noting that vSAN can continue to run and process storage IO as long as the vSAN services start, and the vSAN network is functional. It’s partly for this reason that when you enable vSAN, the HA heartbeats move to the vSAN network, as it’s important to keep your HA fencing in line with storage.
5. Time is synchronized across the cluster. For security reasons, hosts will become isolated if clocks drift too far (Similar to active directory replication breaking, Kerberos authentication not working etc. Thankfully there is a handy health check for this.
What Not to do?
Also, while you are at it, don’t reboot random hosts.
This advice isn’t even specifically vSAN advice, but unlike your training with Microsoft desktop operating systems, the solution to problems with ESXi is not always to “tactically reboot” a host by mashing reset from the iDRAC. You might end up rebooting a perfect health host that was in the middle of a resync, or HA operation. Rebooting more health hosts does a few things:
It causes more HA events. HA events trigger boot storms. large bursts of disk IO as an Operating system reboots, databases force log rechecks, in-memory databases rebuild their memory caches and other processes that are normally staggered.
Interrupt object rebuilds. In our case (3 hosts failures and FTT=1) we had some VM’s that we lost quorum on, but many more that only lost 1 of 3 pieces. Making sure all objects that can be repaired are repaired quickly was the first order of battle.
Rebooting hosts can dump logs or crash dumps that are not being written to persistent disk. GSS may want to scrape some data out of even a 1/2 dead host if possible.
Assemble the brain trust
A few other decisions came up as Myles, Teodora and I spoke about what we needed to do to recover the cluster. We also ruled out a few recovery methods and decided on a course of action to get the cluster stable, and then begin the process of proactively preventing this from impacting us with other hosts.
Salvage a boot devicefrom a capacity device – We briefly discussed grabbing one of the capacity devices out of the dead hosts and using it as a boot device. Technically this would not be a supported configuration (or controller is not supported to act as both a boot device and a device hosting vSAN capacity devices). The challenge here is we wanted to get back 100% of our data and it would have been tedious to identify which disk group was safe to sacrifice in a host for this purpose. If we were completely unable to get remote hands to install boot devices or were only interested in the recovery of a single critical VM at all costs, this might have made sense to investigate.
Drive Switcharo– Another option for recovery has our remote hands pull the entire disk group out of the dead servers and shove them into free drive bays on existing healthy servers. Pete Koehler mentioned this is something GSS has had success and something I’d like to dedicate to its own blog topic at some point. Why does this work? Again, vSAN does not store metadata or file system structures on the boot devices, purposely to increase survivability in cases where the entire server must be replaced. This historically was not a common behavior in enterprise storage arrays that would often put this data on OS/vault drives (that might not be movable even, or embedded). Given we had adequate drive bays free to split the 6 impacted disk groups (2 per host) across the remaining 13 hosts in the cluster this was an option. In our case, we decided we didn’t want to deal with moving them back after this was done. My remote hand’s teams were busy enough with vSphere 7 launch tasks, and COVID related precautions were reducing the staffing levels.
Fancy boot devices – We decided to avoid trying to use SD cards going forward as our primary boot option (even mirrored). Once these impacted hosts were online and the cluster was healthy we had ops plug in all of our new boot devices so we could proactively one host at a time process a fresh install. In a perfect world we would have had M.2 boot devices, but adding a PCI-E riser for this purpose on 4-year-old lab hosts was a bit more than we wanted to spend.
What did we do?
In our case, we called our data center ops team and had them plug in some “random USB drives we have laying around” and began fresh installs to get the hosts online and restore access to all virtual machines. I ordered some high endurance Sandisk USB devices and as a backup some high endurance SD cards (Designed for 4K Dashcam video usage). Once these came in, we reinstalled ESXi to the USB devices allowing our ops teams to recover their USB devices. The fresh high-quality SD cards will be useful for staging ISOs inside the out of band, as well as serving as an emergency boot device in the event a USB device fails.
Next up in the series. A walk through of installing ESXi from bare metal, some changes we made to the hosts and I’ll answer the question of “what’s up withe snake hiding in our R&D datacenter”.
Why didn’t I move to M.2 based boot devices? Unfortunately, these are rather old hosts and unlike modern hosts, there is not an option for something nice like a BOSS device. This is also an internal lab cluster used by the technical marketing group, so while important, it isn’t necessary “mission critical” by any means.
As a result of this, and a power hiccup I ended up with 3 hosts offline that could not restart. Given that many of my VM’s were set to only FTT=1 this means complete and total data loss right?
First off, the data was still safe on the disk groups of the 3 offline hosts. Once I can get the hosts back online the missing components will be detected and the objects will become healthy again (yah, data loss!). vSAN does not keep the metadata or data structures for the internal files systems and object layout on the boot devices. We do not use the boot device as a “Vault” (if your familiar with the old storage array term). If needed all of the drives in a dead host can be moved to a physically new host and recovery would be similar to the method I used of reinstalling the Hypervisor on each host.
What’s the damage look like?
Hopping into my out of band management (My datacenter is thousands of miles away) I discovered that 2 of the hosts could not detect their boot devices, and the 3rd failed to fully reboot after multiple attempts. I initially tried reinstalling ESXi on the existing devices to lifeboat them but this failed. As I noted in a previous blog, SD cards don’t always fully fail.
If vSAN was only configured to tolerate a single failure, wouldn’t all of the data at least be inaccessible with 3 hosts offline? It turns out this isn’t the case for a few reasons.
vSAN does not by default stripe data wide to every single capacity device in the cluster. Instead, it chunks data out into fresh components every 255GB (Note you are welcome to set strip width higher and force more sub-components being split out of objects if you need to).
Our cluster was large. 16 hosts and 104 physical Disks (8 disks in 2 disk groups per host).
Most VM’s are relatively small, so out of the 104 physical disks in the cluster, having 24 of them offline (8 per host in my case). still means that the odds of those 24 drives hosting 2 of the 3 components needed for a quorum is actually quite low.
A few of the more critical VM’s were moved to FTT=2 (vCenter, DNS/NTP servers) making their odds even better.
Even in the case of a few VM’s that were impacted (A domain Controller, some front end web servers), we were further lucky by the fact that these were redundant virtual machines already. Given both of the VMs providing these services didn’t fail, it became clear with the compounding ods in our favor that for a service to go offline was more in the odds of rolling boxcars twice, than a 100% guarantee.
This is actually something I blogged about quite a while ago. It’s worth noting that this was just an availability issue. In most cases of actual true device failure for a drive, there would normally be enough time between loss to allow for repair (and not 3 hosts at once) making my lab example quite extreme.
Lessons Learned and other takeaways:
Raise a few Small but important VM’s to a higher FTT level if you have enough hosts. Especially core management VMs.
vSAN clusters can become MORE resilient to loss of availability the larger they are, even keeping the same FTT level.
Use higher quality boot devices. M.2 32GB and above with “real endurance” are vastly superior to smaller SD cards and USB based boot devices.
Consider splitting HA service VM’s across clusters (IE 1 Domain Controller in one of our smaller secondary clusters).
For Mission-Critical deployments use of a management workload domain when using VMware Cloud Foundation, can help ensure the management is fully isolated from production workloads. Look at stretched clustering, and fault domains to take availability up to 11.
Patch and reboot your hosts often. Silently corrupt embedded boot devices may be lurking in your USB/SD powered hosts. You might not know it until someone trips a breaker and suddenly you need to power back on 10 hosts with dead SD devices. Regular patching will catch this one host at a time.
While vSAN is incredibly resilient always have BC/DR plans. Admins make mistakes and delete the wrong VMs. Datacenters are taken down by “Fire/Flood/Blood” all the time.
I’d like to thank Myles Grey and Teodora Todorova Hristov for helping me make sense of what happened and getting the action plan to put this back together and grinding through it.