It’s time for a talk on Boot devices. No, we are not talking about SD cards, instead, we are going to talk about encryption and security of boot devices!
One trend lately has been to use PCI-E attached RAID controllers for a pair of M.2 SATA/NVMe devices that boot the server. Example Dell BOSS (Great option!). One challenge for some customers is these controllers often lack encryption support.
So first off. Do you even need to worry about this? What is the attack surface of an ESXi boot device?
Attestation – You may want to make sure someone didn’t meddle with the binaries, and you can trust the full chain of code used to boot the system including firmware. Secure boot and host attestation require a TPM and cover this. VMkernel.Boot.execInstalledOnly is a setting that will make sure arbitrarily uploaded binaries can’t be executed. Remember you don’t actually have to encrypt the full boot device to protect the binary integrity, this is handled by verifying signatures and UEFI secure boot.
Protecting the configuration file from tampering and or being read – While I find it unlikely anyone is going to physically do anything interesting with my ESXi information (Ohh no, they might learn I use time.vmware.com for NTP /s) there are some paranoid customers out there who have hosts in less than secure locations or consider the IP address of their DNS servers to be highly proprietary. Starting in vSphere 7 U2 the ESXi configuration is encrypted by default, and with a TPM the encryption keys will be securely sealed in the TPM. For more information on this see docs.vmware.com
Summary of a secure boot chain
So with a TPM + Secure Boot + the VMkernel.Boot.execInstalledOnly + TPM sealed configuration encryption a stolen or physically tampered with boot device will not expose sensitive data, or be able to be used to compromise a host.
“Is this enough?”
Personally, I think the above techniques will cover 98% of customer requirements to secure their boot devices and encrypt and sign what matters in a way that someone can’t do anything useful even with physical access to a boot device… For the truly paranoid though I would be remiss to not mention the following ways to 100% encrypt the entire boot device. Note If you go down this path you would still likely want to implement the above steps anyways and will still need/want a TPM, so this is not an “or” option necessarily as anyone this paranoid is going to need/want defense in depth.
Full Device Encryption
But what if my security team is demanding full volume encryption? Well for these cases there are some options.
Buy a RAID controller that supports SEDs.
Look at virtual raid-on-chip systems (VROC) for NVMe devices. Intel VMD is one system that can provide RAID 1 for boot devices of NVMe without the need for an add-in card, and also can manage encryption if SED NVMe devices are used. Note you will still need SEDs, as Intel VMD itself doesn’t do the encryption, just passes off the keys to the out-of-band controller (iLO/iDRAC/CIMC etc).
Generally, you will need external KMIP compliant KMS to make this securely work, but again talk to your server OEM.
Final Thoughts
I don’t claim to be the expert on vSphere Security or all compliance scenarios. I would love to hear your feedback and concerns. I’m on Twitter @Lost_signal.
I’m going to keep a blog of sessions and Events I”m checking out and interested in for VeeamOn2022. This will get updated as the week goes on (and may serve as the basis for some Podcast interviews).
Object First
I’ve been tracking out of the corner of my eye ObjectFirst.com as a stealth project. They seem to be building “the best backup optimized object storage system” (or something like it). I have few details and a few theories but am strongly looking forward to the announcement on Monday.
Lab Warz
I still remember the first time I sat down for Veeam’s quirky take on a competitive hands on lab competition. The quirky theming, practical skill testing, and adrenaline pumping “time to do this fast!” feeling was unlike anything I’d ever seen at a conference. I see this listed as virtual only so I look forward to seeing if I can barrel roll through it without too shameful of a score. Even if you don’t feel up to the challenge see if you can learn a thing or two about some features you might be able to find value in.
Veeam Plug-in for SAP Now and Later
In a former life I used to deal with application level recovery for various applications ranging from the usual suspects (Exchange, SQL, Oracle) to a few weirder ones. I like checking out occasionally the backup and recovery of virtualized applications that I never operated. It exposes me to challenges that are the similar (distributed state concerns) but also the uniqueness of metadata and blending of VADP/CBT and native tooling. Way too many application backups end up maintained by scripts by DBA’s with dubious alerting and it’s good to see how Veeam is working with SAP and their Backupint framework to offer protection in a way that is supported and allows for consistent restores while still using fancier hypervisor and storage level snapshot offload. One unique workflow I wasn’t familiar with was as “restore license key” on restore which seemed like a pretty nice thing to include as restoring state often includes small things people forget about.
Debanjan Banerjee does a great job walking through how the different pieces come together, and it serves as a good reminder of why virtualizing SAP is always a good idea.
I tweeted this out while in Prague for the Veeam Vanguard Summit, and I’m overdue on writing out my thoughts on the topic of how do I recover quickly a 40TB database virtual machine.
When talking about having different SLAs for products or services you often hear “Good, Better, best” as the segmentation of options based on budgets and requirements. When it comes to architecture deployed for recovery from backup and replication sadly I often see people instead debate between:
“bad, awful, flaming dumpster fire” as their 3 options
. Often the worst offenders end up with data being backed up directly to backup appliances. How did we get here? I’d like to explore some of the architectural challenges facing data protection today and why dedupe appliances often fail to live up to their promise.
Disk-based dedupe appliances were not inherently bad on their own. When they first hit the market they were a great drop-in replacement for Tape. They reduced the need to manually swap tapes, and for backup workflows that sent highly duplicate data over they could optimize and compact this data. They added a significant amount of computing to these appliances so they could highly optimize data ingest speed as well as provide data reduction that previously backup software tended to not handle, or not handle well. If you wanted to stuff a lot of data into a box they were pretty useful.
The challenge of Dedupe Appliances is at the cost of being “good” at holding lots of data, they tend to be fairly bad at recovering said data. When you stuffed hundreds of virtual machines into them, often people think “how am I going to get them out of this data center clown car?”
Backup vendors have long been asked to “perform magic” and deliver faster and faster restores, despite the “physics” of moving large amounts of data taking too much time. One way to “Cheat” that has become popular is to expose an NFS share as a datastore and allow a virtual machine to be “booted” from the backup repository. Veeam Instant Recovery was an early mover in this space, but other backup vendors and DRaaS solutions have adopted similar capabilities. This works great as it avoids the traditional bottlenecks of the source and target disk speeds and network and goes straight to a running VM… RIGHT? I’ll just power on the virtual machine and then storage vMotion it over later!
Bring on the Clowns
One of the challenges of trying to run a production virtual machine is it expects the same IO performance as your primary disk. 8 years ago when primary storage was 15K RPM drives, and backup appliances used 7.2K drives that were 1/3 as fast this might have been problematic, but doable especially for a single virtual machine. Today, application owners EXPECT flash-based primary storage that delivers 100K IOPS per host at low latency. Using 7.2K drives that deliver 100 IOPS each, at 30ms+ of latency is well… A clown show. Trying to run a database virtual machine off of this storage is a bit like trying to jump-start a 737 airplane using a motorcycle engine.
How do we solve this problem?
There are quite a few approaches I’ve seen to try dig out of this hole once people realize this is not going to work:
Identify that the vendors never promised it would work or often had limited promises (Some vendors often will support a low single digit number of virtual machines).
Move to a 2 stage backup system, where backups land on a all flash DAS system initially and then copy out to the appliance for long term retention. (Similar to Disk to Disk to Tape workflows of old). This allows you to keep using the appliance, but just use it for what it best used for. Tiering this data out to an object storage bucket is increasing the “right choice” vs trying to have an all in one appliance.
Use caching to solve or partially mitigate this (Veeam can redirect writes, but even with this option a read heavy database on a slow dedupe target will suffer).
Look at All Flash dedupe appliances or ones with large flash caches (Personally I’m not sold on this idea vs. just depoying a set of DL380/Apollo servers full of flash as the primary landing zone).
Disaster Recovery to the rescue.
I’ve had chats with a few customers lately who’ve recognized that for large-scale recovery of anything important, the backup repository speed is unsalvagable. Instead they “punt” and move to split out those critical recovery workflows to be powered from Replica’s that sit on a primary storage solution somewhere else. They may choose a second data center, but increasingly a DRaaS option is often making more sense, as maintaining a data center that sits idle often is not worth the effort. The other benefit of shifting to DRaaS is it often can be tied to immutable retention and provide additional ransomware recovery capabilities.
The greatest storage and systems administrator of all time was Montgomery “Scotty” Scott. No matter how far outside of design the ship was pushed he generally found away after saying “the ship can’t take anymore” to find the capacity to prevent disaster. His key secret?
Expectation setting (He always looked good when he under promised and over delivered).
Hiding reserve capacity (A key tallent in many storage management practices).
A magic ability to get limitless budget for repairs, replacement parts and ships.
The reality in storage management is we can not all be Scotty (nor should we need to be). Sometimes we end up in scenarios that the system was not designed for. Thankfully there are sometimes capabilities of storage systems that vendors can expose that allow us to opportunistically exceed design expectations and “win” in Kobayashi Maru “no-win” scenarios. When data has gone missing or is expected to be gone for good, what is involved in your plan?
When planning disaster recovery or business continuity should you include “Might be there” safety nets? When drafting target Recovery Point Objectives (RPO) or Recovery Time Objectives (RTO) should or can you count on these vs. properly investing in a good backup/disaster recovery solution?
Restore Accidentally Deleted LUN
A lot of data loss scenarios are murkier to plan for than you realize Accidently deleting a LUN is a shockingly common occurrence. Poorly updated LUN number abstraction maps, and separation of duties (3 people involved in identifying a volume to delete on a SQL cluster) can all lead to this.
Some storage arrays have magic un-delete buttons. This can range from a trashcan to an obscure command that requires support to invoke. This capability is generally contingent on free space being available to retain the data that was deleted. I’m always nervous about including this in an RPO/RTO promise. The problem is in an out of space condition one of two things happen when counting no this capability:
1. The array will go read-only and crash every virtual machine (well abruptly pause if VAAI is working)
2. The snapshots will auto-delete
“But John I don’t have a high enough change rate, and I run my array at 20% usage!”
This may be true, but ransomware has a nasty habit of:
1. Re-writing all of your data.
2. Encrypting the data so that 4x dedupe and compression turn to a negative dedupe rate. Either of these activities can trigger an out-of-space condition.
You also need to be concerned with ransomware like IO activities coming from your users/application owners:
DBA decides to turn on encryption on a database and doesn’t tell anyone.
Large batch process re-writes the data
Large data ingestion events
“But why would this problem happen at the same time I’m deleting a LUN?”
One of these things often causes the other. An out-of-space condition will often make all volumes on an array go read-only. This generally forces a storage admin to delete LUNs quickly. This outage often can happen at weird hours without proper caffeine, visibility, or communication.
Capacity Reservation Mitigation
Preventing out of space conditions (to prevent this scenario) can be done by “always provisioning thick” and reserving 110% capacity for snapshots, but practically the costs associated with doing this with storage that doesn’t tier into cheap S3 isn’t a feasible solution for all but the most deep-pocketed of datacenters. It may be tempting to “throw primary storage” at this problem, but that budget is often better invested in other mitigations.
Unplanned Data Loss
Other scenarios where “maybe I can recover your data” tools come into play are failures that exceed the design of the storage platform.
Force Rebuild
An example that would cause this is the rebuild of your 92 SATA disk RAID 5 hits a Latent Sector Error (LSE) causing an Unrecoverable read error. A single read failure in this situation causes the raid rebuild to stall. In theory, your data is lost. Depending on your platform and the tooling of your storage partner though, you may be able to accept a small amount of data loss and force the rebuild to go forward anyway.
Luck based rebuilds on multi-drive failure
Some platforms limit the rebuild domain for an LSE impact by using per volume RAID/rebuilds (vSAN does this) to reduce the impact of a drive failure that exceeds tolerance. Depending on how the error works you could be accepting an unspecified corruption of a few files or you could be hoping for “luck” in where the error is to not lose data. The only thing I like to count on in design for these is the speed of recovery. Rather than need to invoke a disaster recovery plan on 3 of 100 drives failing simultaneously, knowing I only need to rehydrate 3% (or potentially much much less) of the data from backup helps with planning cache/simultaneous restore plans.
Overriding split-brain protection
Specific to vSAN if you had a thermal meltdown in the data center on your HCI cluster and lost quorum and 1 copy of the data on a RAID 1 mirror from the cascading cooling failures you would have data unavailability. You can call support and they can upload a recovery tool to attempt to defy the angry storage gods and clone a full copy.
All of these scenarios involve a few things:
1. Operational failures.
2. Design failures of some kind.
3. Require the equivalent of a D20 dice roll to get your data back.
If you needed one of these “might be there” recovery options to hit an RPO/RTO/SLA it generally can be solved by better design.
How to better prevent accidental deletions
VASA/vVols
If you live in a data center with highly SILO’d ITIL operations, miscommunications are a risk in all operational changes that involve storage volumes/LUNs. There are a few ways though to improve communications and reduce errors between the storage and virtualization teams.
vSphere Storage APIs for Storage Awareness (VASA) allows VMware administrators better visibility into the storage layer. This allows VMware administrators to have a vision into what the internal volume numbers are for a given virtual machine or datastore.
Virtual Volumes simplifies communication even further by offloading the deletion task entirely to the VMware administrator. Deleting a virtual machine automatically deletes the associated volumes with it, removing any miscommunication between the VMware and storage team.
Operational Methods To Prevent Accidental LUN deletion
The best operational advice for storage arrays I have is to train your staff to disconnect LUNs and then wait 48-72 hours before deleting LUNs. There shouldn’t be an urgent need to delete a LUN.
“But John we urgently need that space back!”
Pretty much all modern storage arrays support TRIMUNMAP/DEALLOCATE as a way to allow the operating system/hypervisor to perform deletions from a higher layer and push through those deleted blocks. Rather than blindly deleting an entire volume, making sure deletions of VMDKs are pushed through from VMFS is a much safer/easier alternative. Auto shrinking VMDKs also allow for deletions from guest OSs to be pushed through end to end. The closer you can delete data to the application the less chance you risk miscommunication.
Lastly, using vSAN or vVols simplifies this further. If you delete a VMDK the space is freed up, and vSAN supports thin volumes shrinking by UNMAP/TRIM from the guest OS in the virtual machine. vSAN and vVols pierce through layers of abstraction to make storage capacity management just a simpler way to handle things.
Final Thoughts?
These various “tricks” are great when they work. I still don’t think they play a primary role in planning your recovery speed, or the point of recovery for recovering from failure. The smartest thing Scotty ever did was keep his “might work” tools in his back pocket and promise only what the ship was designed for.
This blog came about from a conversation with some of the other Veeam Vanguards.
A while back I spoke to some customers who were trying to test VDI. They wanted to spend several months testing out multiple storage systems for a VDI system for 500 users. This was rather confusing to me, as the labor time spent validating the storage was likely going to cost more than just throwing a reasonably beefy all-flash cluster at the problem, and properly configuring Horizon for their use case. The first use case they were concerned about, as they were testing copying an ISO from one desktop to another. It was slower than a test they ran in another VM. Upon further investigation, it was determined:
They were not testing an actual copy in both instances (One was being offloaded using Microsoft ODX).
Their test (if it was working) was a test of a low queue depth large block write operation. This wasn’t consistent with a review of vSCSI traces of their existing VDI use case.
It was still fairly fast when comparing against someone’s laptop.
Interviewing the use case (Doctors in a hospital) and having a consult with my wife (MD) it was determined that doctors do not copy large ISO files as part of their daily acivities
Normally the best testing of VDI is:
Spin up a test pool and redirect some users on the pool (taking care to select users who will be using the same applications and workflows as the users that will be scaled later).
Use a VDI benchmarking application taking great pains to properly configure it. I will note on LoginVSI published benchmarks you sometimes see some hilariously non-realistic desktop testing done to publish “hero numbers”.
Pull a vSCSI trace and use a automated scaled stesting system to “replay” an amplified synthentic copy of the storage requirements (note this doesn’t test CPU in the same way).
Upon further discussion they decided to just put some users on the cluster, perform a proper pilot test, and scale at user densities they were able to achieve on the pilot going forward. Here is a review of some of the mitigations and discussions we had that helped cool off the storage team’s fears.
Why is VDI percieved as demanding on storage?
Virtual Desktops in the past were known to be a “Scarry” storage heavy workload that put fear into storage admins and brought disk arrays to dust. Why was this?
Boot Storms – Recompose actions or under-provisioned pools needing to catch up with demand would lead to OS boot events. While the steady-state IOPS per desktop might be in the single or two digits, this could result in a spike of 800 IOPS or more per desktop.
Login Storms – Roaming profiles with hundreds or thousands of users who all log in at the same time resulted in huge amounts of data being copied into desktops.
Antivirus Scan Storms – Copy pasting the security posture of your existing desktops, often leads to the security team trying to scan every desktop at noon at the same time.
The reality is these problems have been largely solved (for years), but have sometimes been perpetuated as still issues by storage vendors trying to sell some feature or solution. *Disclaimer, I work for a storage product and while I’d love you to buy vSAN and think it is frankly awesome for VDI, I’m not going to pretend that the above problems can’t be largely mitigated in other ways*.
Boot Storm Mitigations
Use Instant Clones – Instant Clones are “born running”. They use VMFork technology to create writable snapshots of the memory of a running virutal machine. This has the advantage of insanely fast (seconds) desktop creation times.
Pre-stage desktops/rolling recompose – At some scale you can always just schedule recompose operations. A popular trick I used back in the stone ages of lined clones was to create a new pool and set it to auto-scale. I would disable net new connections to the old pool and set the users to only see the new pool. This allowed for a slower transition to the new pool. Combined with throttling new desktop creations to a manageable speed this new pool could slowly grow to the needed capacity. This required a few slack resources but the vSphere scheduler and memory compaction technoligies was generally good for it if you were not running absurd vCPU rations, to begin with. Note, other methods largely solve this from a resourcing method but this method can still be used as a means of slowing testing a new image and allow for rapid “roll back” if the new image has issues (re-enable the old pool and direct new connections back to it).
Cache the blocks used for OS boot – This has been discussed before, but OS boot only needs to call up a few hundred MB of blocks into RAM. Various VDI solutions to provide a DRAM cache to hold these blocks have existed for years (Horizon Content-Based Read Cache, or CBRC). This allows multi-GB read caches to be deployed for the base OS disks to accelerate them. Citrix also with PVS has similar capabilities. Beyond this modern storage arrays with dedupe and multi-hundred GB DRAM caches will make short work of these bits. Remember even for “full clones” any solution with dedupe (or dedupe cache like CBRC) can handle the fact that is it 300MB of hot blocks X 2000 Desktops. vSAN even goes so far as to put DRAM cache local to the hosts where VMs are running to reduce even storage network traffic hits.
Login Storm Mitigations
Profile Virtualization – Technology to cache, and optimize profile load through various mechanisms have been around for a while. While I was cutting my teeth on Persona years ago (which worked, it just required you to know which folders to exclude from the stubbing system) VMware Dynamic Environment Manager is a fantastic solution today. FXLogix and other solutions also exist that can even deal with some of the more annoying elements of profile virtualization *GLARE INTENSIFIES AT OUTLOOK OST FILES THAT DROVE ME CRAZY *. It’s true we used to have to do weird/stupid things with application customization to make profile virtualization work (Make sure Exchange was colocated 1ms from the VDI pool) but those days are long gone.
Antivirus Storm Mitigations
I’ll leave others to speak more in the comments to this one, but a blend of on-access scanning policies and agentless and network-based introspection has largely calmed the challenge of virus scans taking out a cluster. Security is about many layers of an onion providing security here.
Other Minor VDI Resource Issues to think about
Windows Search – This and other services we used to disable to better optimize desktops. I’ll call out that disabling this also breaks outlook email search and even if this leads to 3% increase in density I would argue you don’t need to go to these extremes to optimize desktops. While there are certain things you should optimize, breaking user experience to get an extra 10 users in a cluster likely isn’t worth it anymore. Hardware is cheaper at this point than the emotional cost of annoying users.
Hardware refreshes need to be at way more than 1:1 –I advised a bank recently that was replacing an ancient 5.5 environment with windows XP desktops. They were expecting that by buying hosts with 5x the resources they would get 5x the host density. They were disappointed to learn that:
The 3 anti-virus solutions they had installed were at war with each other for the 1 vCPU’s they were allocating to each desktops and over subscribing 15:1
1GB of RAM wasn’t enough to make users happy
Their base images were now 6x larger
The reality is we used to make some awful compromises on VDI usability and user experience to make the numbers “work”. Make sure when sizing solutions to understand that with lowered resource cost comes options to do more than save capital costs.
But John? What if I Can’t do X,Y,Z?
Just throw a little more all-flash storage at the problem. We used to get excited about getting the cost of storage down to $100 per user for VDI. Now with all-flash, instant clones and dedupe the storage costs have kind of become a rounding error on the total VDI solution. There used to be an entire field of “VDI storage-specific vendors”, and you’ll find that most of them have completely disappeared. This is because the problem of VDI and storage has largely gone away.
Having been around the industry I’ve noticed there are a lot of changes but a few guarantees when it comes to benchmarking shared storage and HCI clusters:
Benchmarking is generally poorly represenatatitive of what the production workload will look like.
Benchmarking is about trade offs. There are “easy” ways to do them, but often these are so far from accurate for what production will look like they might as well be skipped.
Real benchmarking is hard. There are shortcuts to easier benchmarking. Some are good, some are bad. It’s critical either way you understand what trade offs you make when you chose one.
There are good easy buttons for testing a cluster (HCI Bench is a personal favorite) and there are bad easy buttons (Crystal Disk, ATTO Disk, IO meter, and other synthetic workload desktop-focused testing tools). Today we are going to talk about why single workload tests are normally poorly done.
It’s often poorly executed – The single workload test
A lot of people can spin up a single virtual machine, fire up a synthetic disk testing application like CrystalDisk or IOmeter and push “Test run”. While this does generate IO, it doesn’t necessarily generate a workload against an HCI cluster that looks anything like what a customer would run. Breaking down some quick fundamentals.
In your typical VMware cluster, you will find multiple virtual machines with different numbers of drives processing different block sizes, read-write mixtures, different overlaps when they send data (Some bursty, some constant).
Even clusters with homogonous dense workloads don’t look like this single VMDK test. Even monster scale-out in-memory databases like SAP HANA and Casandra and container platforms recommend more than 1 virtual machine. Amongst these applications, you still will always see more than 1 virtual hard drive (VMDK) processing disk IO, possibly with multiple vHBAs attached.
Other common mistakes that go along with using these tools:
The default Crystal Disk only uses a relatively small working set size (below 5GB). In any tiered/cached system, there is a strong chance you end up testing IO that largely is served from DRAM caches (either inside the SSDs or within caching of the system). A 24/7 production environment with large data flows will result in wildly different outcomes.
IO Meter can be configured for multiple workers, but doing so at scale with a diverse set of workloads is going to be problematic vs. using something that has better synthetic engines with more options and easier control and reporting like HCI Bench. It’s worth noting that IOmeter has seen 1 release since 2008 when Intel made it abandonware. VDBench and FIO that are used by HCIBench have seen a lot more development attention.
Fixed QD or block sizes. Crystal Disk tests 4 different blends of block size and queue depth but:
There’s a strong corelation between people fretting about large block throughput, and people who are running workloads that don’t actually send large blocks.
The tests are run sequentially, and not in parallel. Again, real storage systems handle what is thrown at them and can’t ask applications to nicely wait 30 seconds for their turn to run a homogeneous workload.
These workloads tend to generate high entropy data (So no dedupe/compression). It could be argued that setting the workload to include to low of entry is cheating but using real data sets (or tuning synthetic to mirror entropy of the real data) is going to give you a more accurate idea of what production will look like.
Not reporting latency is a bit like reporting horse power and top speed of a car but ignoring torque when people want to tow a boat…
There also is a fatal flaw in CrystlalDisks presentation of data. It’s a simple average summery for each benchmark that fails to show a time series of data. Without understanding what a system looks like at the beginning of a test (When cache may be less warm, but write buffers less full) vs. the end of the test (when cache hits may increase, or buffers may be exhausted) its very hard to understand what steady state under load performance may look like. This is magnified further in that Crystal Disk and the like are short tests. For systems that will run under load for hours/days you want tools that can sustain testing to better emulate your production duty cycle for IO (Not that it would make a good synthetic workload generator if you could run it for longer). Often things like tail latency, jitter or 99% latency can have disastrous impacts on systems that users have to interact with.
A good storage system has to handle a wide variety of workloads simultaneously. The single workload/disk test is a bit like testing the effectiveness of an air traffic controller at an airfield that sees 1 airplane a day. You might see the different variations in his communication quality to that one airplane but any serious test is going to stress tracking different planes on different trajectories.
Next up, Bad VDI testing – No Copying an ISO is in not benchmarking VDI…
There is more to discuss here now that 7 Update 3 is out on where things are going:
A few points of clarification:
The deprocation of SD/USB devices to be used as the sole boot and OS relate storage for ESXi was announced, but to be clear; This does NOT mean that support was pulled vSphere from 7 Update 3 for these configurations. I put this in bold because I’ve heard this misconception quite a few times.
For people who are not in a position to upgrade their boot device, we will continue to support SD/USB boot for the 7.x release. I will caviot this with PLEASE upgrade to 7 Update 3 (or at least 7 U2c at a minimum) as a number of mitigations to lower the chances of premature device failure as some fixes have been applied.
What was fixed?
See this KB and the release notes here. Additionally, 7 Update 3 does a better job of making customers aware they are running in a degraded state where only a low endurance boot device exists for system usage. The limitations of using a RAM disk for redirection are noted below.
What are my paths forward? (Greenfield)
For net-new host purchases, I ask you to move away from USB/SD card boot devices. It will make life simpler, and the additive cost for a 128GB boot device vs a pair of larger capacity SD cards and the controller for them is less than you would think. For those that can, this also will work for brownfield.
What is my path forward brownfield
There are a few options.
Replace the boot devices – Note this requires a reinstalation of ESXi. Configurations can be moved using various methods. To speed up this process you can use this KB to perform a backup and restore. Note you will need to restore the exact same ESXi build.
Legacy configuration but still supported – This allows you to keep operating with the existing boot install on the device without having to perform a reinstall. This KB outlines a new boot flag that will automatically format a RAW (IE no partition tables) device that is 128GB or larger, and consume it for OSDATA usage. This will allow you to move forward with the existing install on SD/USB in a supported manner. Simple adding a properly sized M.2 SSD to your host and using the autoPartition=TRUE boot flag should create and redirect the necessary bits to keep running in a non-degraded or deprocated configuration. Note this configuration will be supported on future releases, but given the added complexity/cost vs. just using a proper boot device to begin with, is not something I recomend for greenfield (Hence why it’s called Legacy/supported).
AutoDeploy – I will ask that for forward compatability support of new features I would start moving in the direction of Stateful Installs for Autodeploy.
Boot from SAN – Keep on rocking, just make those LUNs a bit larger please. VMware wants to see 32GB at a minimum.
What is this warning about Degraded Mode?
Degraded mode is a state where logs and state might not be persistent (get lost when the host is rebooted), with a side effect that it can cause boot up to be slower.
The /scratch partition will be created on a RAMDisk under a /tmp folder with a limited space of 250 MB. This is not recommended, and it will impact the ESXi host performance once /tmp runs out of capacity.
Why is this bad? Why Prefer local storage for logging?
There’s a lot of advantages to redirecting locally. Consistency of performance as well as the ability to collect logs on issues that impact the availability of the storage network or HBA (for Example the NIC or FC HBA firmware crashing). Note Boot from SAN is still completely an option here, but this is (by virtue of physics) and advantage for a local quality device is that it will always be in a superior position to collect logs in specific situations.
Ehhhh, this isn’t a long-term solution. See the bottom of this KB for this discussion. Beyond the cost of RAM the bigger issue is volatility. 99% of customers I talk to want support and engineering to be able to identify the source of problems and this becomes incredibly hard when all logs and crash dumps are destroyed on host restart.
What about NVMe SD cards (SDExpress)?
This is something I’ve honestly asked engineering PM about. They are shipping in small quantities right now. My biggest concern looking at the hardware itself is thermal throttling causing complete yoyo’s on consistent performance. Logs and crash dump they look alright but future demands on the OSDATA may require more performance This is partly why vSphere 7 at GA requiring higher endurance and performance requirements for boot devices as preparation for future demands. Technically they will look like a NVMe device so I assume at least for home lab usage they should work. If anyone has any samples laying around and wants to test them shoot me a message on twitter (@Lost_Signal).
I have a home lab, and I”m out of drive bays and curious on cheap/low cost non-supported options?
Personally, I went and bought a $12 PCI-E to M.2 (SATA) adapter. They also make NVMe compatible brackets Just make sure the bracket you get supports your drive type. No need, to spend hundreds of dollars upgrading your hosts in the lab.
Where can I find this information on an official VMware.com page?
The challenge in giving nuanced guidance is people tend to read “It’s supported” and ignore the rest of the sentence of why something is a bad idea. Given the blog post explaining this, KBs, and changes in u2c and U3 were still in the works I wanted for people looking to buy a new host to get a no-nonsense response in hardware selection.
This is going to (hopefully) be a short post dismissing some common VMware backup myths.
Myth: We should not use virtual machine backups because they will take longer to process.
Reality: Changed block tracking reduces the need to scan for differences between backup jobs. VMware keeps a block map of exactly what has changed, reducing the need for backup agents to read blocks and look for changes.
Myth: Virtual machine backups are not useful for file, or app level recovery. As a result you may need to backup the same data multiple times, once as a virtual machine, and once with file or application agents.
Reality: A number of backup vendors can recover files or even application-level details from a single virtual machine backup.
Myth: HotAdd requires 1 virtual machine per host in the cluster, and will slow down backups.
Reality: HotAdd requires 1 virtual machine per cluster. HotAdd backup mode is a powerful way to manage LAN overhead, by allowing initial backup processing to happen directly on a host in the cluster. There is a slight additional fixed overhead in time to mount the snapshot. For network congested backups or larger virtual machines, this is easily compensated for with faster jobs.
Myth: Agent-based backups are “lighter-weight” than hypervisor-assisted backups.
Reality: Agent-based backups tend to slam the CPU, and generally have poor awareness of shared resources. vSphere sits in a position where it can better manage throttling of concurrent jobs, Network IO Control can throttle backup traffic itself, and host-based transport avoids unnecessary overhead.
Myth: To make virtual machine backups run faster, always Eager Zero ThickVMDKs
Reality: That “EZT4Life” tattoo was a bad idea. UNMAP/TRIM inside a VM can delete blocks no longer used, and make backup jobs shorter as the backup software will no longer need to process “dead space”.
Myth: SAN Mode Transport backups are “LAN free” and superior to all other methods
Reality: SAN mode backups that allow the backup software to directly mount VMFS and bypass the host for the sake of moving data have helped save many an 8Gbps Fibre Channel user from the pain of slow 1Gbps networking. Still, with modern networking (25/100Gbps Ethernet). Also, even when SAN Transport is used for backups, restores will often (based on restore settings and your vendor) flow over the network, so highly asymmetric network speeds can lead to less than satisfactory restore times.
Myth: Virtual Machine backups are less secure
Reality: Virtual Machine backups can offer significantly enhanced security. NBDSSL offers the ability to encrypt the networking transport of the backup jobs. HotAdd allows the backup vendor to “own” security of the transport of the backup data. SAN Transport backups allow you to avoid the LAN entirely for the backup job itself.
Myth: Network based backups will bottleneck on the vCenter
Reality: Network block device (NBD) backups NEVER flow through the vCenter. vCenter is simply a control plane.
Myth: vCenter can limit scaling of parallel backups vs. agent based backups
Reality: for environments doing large amounts of parallel backups, per vCenter limitations could have previously become a problem. Using VPXA did not allow for customization of the memory buffer for jobs initiated by connecting through vCenter. vCenter Server 7.0 U1 now uses hostd service on ESXi Host and allows for tunable memory configurations to enabling scaling of the number of backup streams. For 50 concurrent backups per host, 96MB would be the recommended setting.
Myth: Snapshots Suck.
Reality: Snapshots and or data protection doesn’t have to suck. Modern vSphere uses a mirror driver and avoids the need for a helper snapshot. This reduces IO on snapshot merge and reduces stun a good deal. vSAN uses vSAN sparseSE snapshots that leverage a memory cache for reads, and vVols can offload snapshots to the array. Beyond all of this, vSphere APIs for I/O Filtering (VAIO) offer the ability to do data protection without the need for snapshots. Check out the VAIO VCG for supporting products.
Let me preference this discussion of iSCSI with my own personal opinion about iSCSI in the year 2020. With the support of shared VMDKs for SCSI-3PR applications, NFS and SMB shares, the need for iSCSI has reduced quite a bit. If you are using iSCSI today I’d like to talk about some alternatives to delivering that shared access or external cluster storage requirement. That said, I know there are still some uses cases for it so let us go deeper on this topic.
Previously iSCSI on vSAN was only supported with stretched clusters by a limited RPQ. Why was this?
Normally vSAN Stretched clusters implement a site locality construct to avoid unnecessary inter-site latency being added to the read IO path (They prefer all reads from the local site). The challenge came from the fact that the iSCSI service had no awareness of the two fault domains, and you easily get in a situation where an iSCSI target would be placed on the secondary site, while serving IO to virtual machines on the first site. As a result it would be possible for data at the preferred site where a virtual machine is being served to be sent to an iSCSI target on the remote site, and then come back as an iSCSI packet to a virtual machine running at the preferred site.
To prevent this problem, vSAN 7 Update 1 now supports setting a preferred site for an iSCSI target to live. Note, in the event of a complete site failure, the preference will be discarded and the service will cleanly fail over to the other side of the stretched cluster. This combined with other networking improvements and performance optimizations I mentioned in this blog, should help round out this new use case.
It is possible to relocated the preferred location. You will also receive a warning if something has caused a target to run at the non-preferred location.
Again, when clustering windows applications i tend to prefer the native VMDKs these days, but for those of you using iSCSI today (or already under RPQ) this may be a useful setting to look at.
I figured I’d cover in a blog some of the less obvious changes in vSAN 7 Update 1.
Simplified Layer 3 – vSAN has supported layer 3 (hosts within a cluster being on different subnets) since the early days. This is a popular topology when using stretched clustering, and 2 node configurations. vSAN VMkernel ports share the same gateway setting specified for the management network. As the vSAN network (ideally) often on a completely different subnet, this means that a static route would need to be set on each host. To simplify alternative gateway configuration, the vCenter Server UI now supports overriding the default gateway for a VMkernel port. ESXCLI or PowerCLI can still configure a gateway (there’s even now a ESXCLI -g flag to set a default gateway).
Data-In-Transit encryption – historically the focus on storage transport security was focused on restricting access to the storage networks (dedicated VLANs for Ethernet, or hard zoning for Fibre Channel) or limited authentication and access filtering (NFS IP ACL, IQN filteriing, CHAP, Soft zoning). If an adversary could capture the frames in transit on the storage network none of these technologies (or even data at rest encryption) protected you from data exfiltration. To address this, vSAN now supports data in transit encryption. This leverages the FIPS 140-2 validated Cryptographic modules to encrypt vSAN network traffic in flight. this allows custom rekey windows (The default is 1 day). No KMS is required for this solution to be deployed, and this feature complements other VMware in flight encryption technology (encrypted vMotion, encrypted HCX/NSX tunnels etc) so you can now encrypt all the things.
General Performance and monitoring improvements
As customers move to 25Gbps and 100Gbps switching, further optimizations have been made to the networking stack to increase parallelization of the CPU threads used for networking transport, increasing the efficiency this parallelizations balancing of and reduce overall CPU consumption per thread. These benefits will be most pronounced with RAID 5/6 usage, and multiple disk groups.
Networking monitoring improvements have been made to the vSAN network health checks. This will result in faster, more accurate automated network testing.