Should “Luck” factor into your DR/BC plans?

This guy always has another backup copy job running

The greatest storage and systems administrator of all time was Montgomery “Scotty” Scott. No matter how far outside of design the ship was pushed he generally found away after saying “the ship can’t take anymore” to find the capacity to prevent disaster. His key secret?

Expectation setting (He always looked good when he under promised and over delivered).
Hiding reserve capacity (A key tallent in many storage management practices).
A magic ability to get limitless budget for repairs, replacement parts and ships.

The reality in storage management is we can not all be Scotty (nor should we need to be). Sometimes we end up in scenarios that the system was not designed for. Thankfully there are sometimes capabilities of storage systems that vendors can expose that allow us to opportunistically exceed design expectations and “win” in Kobayashi Maru “no-win” scenarios. When data has gone missing or is expected to be gone for good, what is involved in your plan?

When planning disaster recovery or business continuity should you include “Might be there” safety nets? When drafting target Recovery Point Objectives (RPO) or Recovery Time Objectives (RTO) should or can you count on these vs. properly investing in a good backup/disaster recovery solution?

Restore Accidentally Deleted LUN

A lot of data loss scenarios are murkier to plan for than you realize Accidently deleting a LUN is a shockingly common occurrence. Poorly updated LUN number abstraction maps, and separation of duties (3 people involved in identifying a volume to delete on a SQL cluster) can all lead to this.

Some storage arrays have magic un-delete buttons. This can range from a trashcan to an obscure command that requires support to invoke. This capability is generally contingent on free space being available to retain the data that was deleted. I’m always nervous about including this in an RPO/RTO promise. The problem is in an out of space condition one of two things happen when counting no this capability:

1. The array will go read-only and crash every virtual machine (well abruptly pause if VAAI is working)

2. The snapshots will auto-delete

“But John I don’t have a high enough change rate, and I run my array at 20% usage!”

This may be true, but ransomware has a nasty habit of:

1. Re-writing all of your data.

2. Encrypting the data so that 4x dedupe and compression turn to a negative dedupe rate. Either of these activities can trigger an out-of-space condition.

You also need to be concerned with ransomware like IO activities coming from your users/application owners:

DBA decides to turn on encryption on a database and doesn’t tell anyone.
Large batch process re-writes the data
Large data ingestion events

“But why would this problem happen at the same time I’m deleting a LUN?”

One of these things often causes the other. An out-of-space condition will often make all volumes on an array go read-only. This generally forces a storage admin to delete LUNs quickly. This outage often can happen at weird hours without proper caffeine, visibility, or communication.

Capacity Reservation Mitigation

Preventing out of space conditions (to prevent this scenario) can be done by “always provisioning thick” and reserving 110% capacity for snapshots, but practically the costs associated with doing this with storage that doesn’t tier into cheap S3 isn’t a feasible solution for all but the most deep-pocketed of datacenters. It may be tempting to “throw primary storage” at this problem, but that budget is often better invested in other mitigations.

Unplanned Data Loss

Other scenarios where “maybe I can recover your data” tools come into play are failures that exceed the design of the storage platform.

Force Rebuild

An example that would cause this is the rebuild of your 92 SATA disk RAID 5 hits a Latent Sector Error (LSE) causing an Unrecoverable read error. A single read failure in this situation causes the raid rebuild to stall. In theory, your data is lost. Depending on your platform and the tooling of your storage partner though, you may be able to accept a small amount of data loss and force the rebuild to go forward anyway.

Luck based rebuilds on multi-drive failure

Some platforms limit the rebuild domain for an LSE impact by using per volume RAID/rebuilds (vSAN does this) to reduce the impact of a drive failure that exceeds tolerance. Depending on how the error works you could be accepting an unspecified corruption of a few files or you could be hoping for “luck” in where the error is to not lose data. The only thing I like to count on in design for these is the speed of recovery. Rather than need to invoke a disaster recovery plan on 3 of 100 drives failing simultaneously, knowing I only need to rehydrate 3% (or potentially much much less) of the data from backup helps with planning cache/simultaneous restore plans.

Overriding split-brain protection

Specific to vSAN if you had a thermal meltdown in the data center on your HCI cluster and lost quorum and 1 copy of the data on a RAID 1 mirror from the cascading cooling failures you would have data unavailability. You can call support and they can upload a recovery tool to attempt to defy the angry storage gods and clone a full copy.

All of these scenarios involve a few things:

1. Operational failures.

2. Design failures of some kind.

3. Require the equivalent of a D20 dice roll to get your data back.

If you needed one of these “might be there” recovery options to hit an RPO/RTO/SLA it generally can be solved by better design.

How to better prevent accidental deletions

VASA/vVols

If you live in a data center with highly SILO’d ITIL operations, miscommunications are a risk in all operational changes that involve storage volumes/LUNs. There are a few ways though to improve communications and reduce errors between the storage and virtualization teams.

vSphere Storage APIs for Storage Awareness (VASA) allows VMware administrators better visibility into the storage layer. This allows VMware administrators to have a vision into what the internal volume numbers are for a given virtual machine or datastore.

Virtual Volumes simplifies communication even further by offloading the deletion task entirely to the VMware administrator. Deleting a virtual machine automatically deletes the associated volumes with it, removing any miscommunication between the VMware and storage team.

Operational Methods To Prevent Accidental LUN deletion

The best operational advice for storage arrays I have is to train your staff to disconnect LUNs and then wait 48-72 hours before deleting LUNs. There shouldn’t be an urgent need to delete a LUN.

“But John we urgently need that space back!”

Pretty much all modern storage arrays support TRIMUNMAP/DEALLOCATE as a way to allow the operating system/hypervisor to perform deletions from a higher layer and push through those deleted blocks. Rather than blindly deleting an entire volume, making sure deletions of VMDKs are pushed through from VMFS is a much safer/easier alternative. Auto shrinking VMDKs also allow for deletions from guest OSs to be pushed through end to end. The closer you can delete data to the application the less chance you risk miscommunication.

Lastly, using vSAN or vVols simplifies this further. If you delete a VMDK the space is freed up, and vSAN supports thin volumes shrinking by UNMAP/TRIM from the guest OS in the virtual machine. vSAN and vVols pierce through layers of abstraction to make storage capacity management just a simpler way to handle things.

Final Thoughts?

These various “tricks” are great when they work. I still don’t think they play a primary role in planning your recovery speed, or the point of recovery for recovering from failure. The smartest thing Scotty ever did was keep his “might work” tools in his back pocket and promise only what the ship was designed for.

This blog came about from a conversation with some of the other Veeam Vanguards.

Benchmarking Badly (Part 2) Bad VDI testing (and some thoughts on how much VDI resourcing has changed)

A while back I spoke to some customers who were trying to test VDI. They wanted to spend several months testing out multiple storage systems for a VDI system for 500 users. This was rather confusing to me, as the labor time spent validating the storage was likely going to cost more than just throwing a reasonably beefy all-flash cluster at the problem, and properly configuring Horizon for their use case. The first use case they were concerned about, as they were testing copying an ISO from one desktop to another. It was slower than a test they ran in another VM. Upon further investigation, it was determined:

They were not testing an actual copy in both instances (One was being offloaded using Microsoft ODX).
Their test (if it was working) was a test of a low queue depth large block write operation. This wasn’t consistent with a review of vSCSI traces of their existing VDI use case.
It was still fairly fast when comparing against someone’s laptop.
Interviewing the use case (Doctors in a hospital) and having a consult with my wife (MD) it was determined that doctors do not copy large ISO files as part of their daily acivities

Normally the best testing of VDI is:

Spin up a test pool and redirect some users on the pool (taking care to select users who will be using the same applications and workflows as the users that will be scaled later).
Use a VDI benchmarking application taking great pains to properly configure it. I will note on LoginVSI published benchmarks you sometimes see some hilariously non-realistic desktop testing done to publish “hero numbers”.
Pull a vSCSI trace and use a automated scaled stesting system to “replay” an amplified synthentic copy of the storage requirements (note this doesn’t test CPU in the same way).

Upon further discussion they decided to just put some users on the cluster, perform a proper pilot test, and scale at user densities they were able to achieve on the pilot going forward. Here is a review of some of the mitigations and discussions we had that helped cool off the storage team’s fears.

Why is VDI percieved as demanding on storage?

Virtual Desktops in the past were known to be a “Scarry” storage heavy workload that put fear into storage admins and brought disk arrays to dust. Why was this?

Boot Storms – Recompose actions or under-provisioned pools needing to catch up with demand would lead to OS boot events. While the steady-state IOPS per desktop might be in the single or two digits, this could result in a spike of 800 IOPS or more per desktop.

Login Storms – Roaming profiles with hundreds or thousands of users who all log in at the same time resulted in huge amounts of data being copied into desktops.

Antivirus Scan Storms – Copy pasting the security posture of your existing desktops, often leads to the security team trying to scan every desktop at noon at the same time.

The reality is these problems have been largely solved (for years), but have sometimes been perpetuated as still issues by storage vendors trying to sell some feature or solution. *Disclaimer, I work for a storage product and while I’d love you to buy vSAN and think it is frankly awesome for VDI, I’m not going to pretend that the above problems can’t be largely mitigated in other ways*.

Boot Storm Mitigations

Use Instant Clones – Instant Clones are “born running”. They use VMFork technology to create writable snapshots of the memory of a running virutal machine. This has the advantage of insanely fast (seconds) desktop creation times.

Pre-stage desktops/rolling recompose – At some scale you can always just schedule recompose operations. A popular trick I used back in the stone ages of lined clones was to create a new pool and set it to auto-scale. I would disable net new connections to the old pool and set the users to only see the new pool. This allowed for a slower transition to the new pool. Combined with throttling new desktop creations to a manageable speed this new pool could slowly grow to the needed capacity. This required a few slack resources but the vSphere scheduler and memory compaction technoligies was generally good for it if you were not running absurd vCPU rations, to begin with. Note, other methods largely solve this from a resourcing method but this method can still be used as a means of slowing testing a new image and allow for rapid “roll back” if the new image has issues (re-enable the old pool and direct new connections back to it).

Cache the blocks used for OS boot – This has been discussed before, but OS boot only needs to call up a few hundred MB of blocks into RAM. Various VDI solutions to provide a DRAM cache to hold these blocks have existed for years (Horizon Content-Based Read Cache, or CBRC). This allows multi-GB read caches to be deployed for the base OS disks to accelerate them. Citrix also with PVS has similar capabilities. Beyond this modern storage arrays with dedupe and multi-hundred GB DRAM caches will make short work of these bits. Remember even for “full clones” any solution with dedupe (or dedupe cache like CBRC) can handle the fact that is it 300MB of hot blocks X 2000 Desktops. vSAN even goes so far as to put DRAM cache local to the hosts where VMs are running to reduce even storage network traffic hits.

Login Storm Mitigations

Profile Virtualization – Technology to cache, and optimize profile load through various mechanisms have been around for a while. While I was cutting my teeth on Persona years ago (which worked, it just required you to know which folders to exclude from the stubbing system) VMware Dynamic Environment Manager is a fantastic solution today. FXLogix and other solutions also exist that can even deal with some of the more annoying elements of profile virtualization *GLARE INTENSIFIES AT OUTLOOK OST FILES THAT DROVE ME CRAZY *. It’s true we used to have to do weird/stupid things with application customization to make profile virtualization work (Make sure Exchange was colocated 1ms from the VDI pool) but those days are long gone.

Antivirus Storm Mitigations

I’ll leave others to speak more in the comments to this one, but a blend of on-access scanning policies and agentless and network-based introspection has largely calmed the challenge of virus scans taking out a cluster. Security is about many layers of an onion providing security here.

Other Minor VDI Resource Issues to think about

Windows Search – This and other services we used to disable to better optimize desktops. I’ll call out that disabling this also breaks outlook email search and even if this leads to 3% increase in density I would argue you don’t need to go to these extremes to optimize desktops. While there are certain things you should optimize, breaking user experience to get an extra 10 users in a cluster likely isn’t worth it anymore. Hardware is cheaper at this point than the emotional cost of annoying users.

Hardware refreshes need to be at way more than 1:1 – I advised a bank recently that was replacing an ancient 5.5 environment with windows XP desktops. They were expecting that by buying hosts with 5x the resources they would get 5x the host density. They were disappointed to learn that:

The 3 anti-virus solutions they had installed were at war with each other for the 1 vCPU’s they were allocating to each desktops and over subscribing 15:1
1GB of RAM wasn’t enough to make users happy
Their base images were now 6x larger

The reality is we used to make some awful compromises on VDI usability and user experience to make the numbers “work”. Make sure when sizing solutions to understand that with lowered resource cost comes options to do more than save capital costs.

But John? What if I Can’t do X,Y,Z?

Just throw a little more all-flash storage at the problem. We used to get excited about getting the cost of storage down to $100 per user for VDI. Now with all-flash, instant clones and dedupe the storage costs have kind of become a rounding error on the total VDI solution. There used to be an entire field of “VDI storage-specific vendors”, and you’ll find that most of them have completely disappeared. This is because the problem of VDI and storage has largely gone away.

Benchmarking Badly (Part 1) The Single workload test

Having been around the industry I’ve noticed there are a lot of changes but a few guarantees when it comes to benchmarking shared storage and HCI clusters:

Benchmarking is generally poorly represenatatitive of what the production workload will look like.
Benchmarking is about trade offs. There are “easy” ways to do them, but often these are so far from accurate for what production will look like they might as well be skipped.
Real benchmarking is hard. There are shortcuts to easier benchmarking. Some are good, some are bad. It’s critical either way you understand what trade offs you make when you chose one.

For anyone who wants a history lesson on why Benchmarking is bad @VMpete has some old but great blogs on this topic.

There are good easy buttons for testing a cluster (HCI Bench is a personal favorite) and there are bad easy buttons (Crystal Disk, ATTO Disk, IO meter, and other synthetic workload desktop-focused testing tools). Today we are going to talk about why single workload tests are normally poorly done.

It’s often poorly executed – The single workload test

A lot of people can spin up a single virtual machine, fire up a synthetic disk testing application like CrystalDisk or IOmeter and push “Test run”. While this does generate IO, it doesn’t necessarily generate a workload against an HCI cluster that looks anything like what a customer would run. Breaking down some quick fundamentals.

In your typical VMware cluster, you will find multiple virtual machines with different numbers of drives processing different block sizes, read-write mixtures, different overlaps when they send data (Some bursty, some constant).

Even clusters with homogonous dense workloads don’t look like this single VMDK test. Even monster scale-out in-memory databases like SAP HANA and Casandra and container platforms recommend more than 1 virtual machine. Amongst these applications, you still will always see more than 1 virtual hard drive (VMDK) processing disk IO, possibly with multiple vHBAs attached.

Other common mistakes that go along with using these tools:

The default Crystal Disk only uses a relatively small working set size (below 5GB). In any tiered/cached system, there is a strong chance you end up testing IO that largely is served from DRAM caches (either inside the SSDs or within caching of the system). A 24/7 production environment with large data flows will result in wildly different outcomes.

IO Meter can be configured for multiple workers, but doing so at scale with a diverse set of workloads is going to be problematic vs. using something that has better synthetic engines with more options and easier control and reporting like HCI Bench. It’s worth noting that IOmeter has seen 1 release since 2008 when Intel made it abandonware. VDBench and FIO that are used by HCIBench have seen a lot more development attention.

Fixed QD or block sizes. Crystal Disk tests 4 different blends of block size and queue depth but:

There’s a strong corelation between people fretting about large block throughput, and people who are running workloads that don’t actually send large blocks.
The tests are run sequentially, and not in parallel. Again, real storage systems handle what is thrown at them and can’t ask applications to nicely wait 30 seconds for their turn to run a homogeneous workload.
These workloads tend to generate high entropy data (So no dedupe/compression). It could be argued that setting the workload to include to low of entry is cheating but using real data sets (or tuning synthetic to mirror entropy of the real data) is going to give you a more accurate idea of what production will look like.
Not reporting latency is a bit like reporting horse power and top speed of a car but ignoring torque when people want to tow a boat…
Synthetic engines don’t look like real traffic for (a lot of other reasons).

Short Benchmark, Simple Summery

There also is a fatal flaw in CrystlalDisks presentation of data. It’s a simple average summery for each benchmark that fails to show a time series of data. Without understanding what a system looks like at the beginning of a test (When cache may be less warm, but write buffers less full) vs. the end of the test (when cache hits may increase, or buffers may be exhausted) its very hard to understand what steady state under load performance may look like. This is magnified further in that Crystal Disk and the like are short tests. For systems that will run under load for hours/days you want tools that can sustain testing to better emulate your production duty cycle for IO (Not that it would make a good synthetic workload generator if you could run it for longer). Often things like tail latency, jitter or 99% latency can have disastrous impacts on systems that users have to interact with.

Who wouldn’t test their 10 Million Dollar ERP solution with this “serious” testing tool!

A good storage system has to handle a wide variety of workloads simultaneously. The single workload/disk test is a bit like testing the effectiveness of an air traffic controller at an airfield that sees 1 airplane a day. You might see the different variations in his communication quality to that one airplane but any serious test is going to stress tracking different planes on different trajectories.

Next up, Bad VDI testing – No Copying an ISO is in not benchmarking VDI…

Is my SD card is resilient enough for production ESXi usage?

If you are trying to buy a new host? “No.”

If you have an existing install: It depends…

There is more to discuss here now that 7 Update 3 is out on where things are going:

A few points of clarification:

The deprocation of SD/USB devices to be used as the sole boot and OS relate storage for ESXi was announced, but to be clear; This does NOT mean that support was pulled vSphere from 7 Update 3 for these configurations. I put this in bold because I’ve heard this misconception quite a few times.
For people who are not in a position to upgrade their boot device, we will continue to support SD/USB boot for the 7.x release. I will caviot this with PLEASE upgrade to 7 Update 3 (or at least 7 U2c at a minimum) as a number of mitigations to lower the chances of premature device failure as some fixes have been applied.

What was fixed?

See this KB and the release notes here. Additionally, 7 Update 3 does a better job of making customers aware they are running in a degraded state where only a low endurance boot device exists for system usage. The limitations of using a RAM disk for redirection are noted below.

What are my paths forward? (Greenfield)

For net-new host purchases, I ask you to move away from USB/SD card boot devices. It will make life simpler, and the additive cost for a 128GB boot device vs a pair of larger capacity SD cards and the controller for them is less than you would think. For those that can, this also will work for brownfield.

What is my path forward brownfield

There are a few options.

Replace the boot devices – Note this requires a reinstalation of ESXi. Configurations can be moved using various methods. To speed up this process you can use this KB to perform a backup and restore. Note you will need to restore the exact same ESXi build.
Legacy configuration but still supported – This allows you to keep operating with the existing boot install on the device without having to perform a reinstall. This KB outlines a new boot flag that will automatically format a RAW (IE no partition tables) device that is 128GB or larger, and consume it for OSDATA usage. This will allow you to move forward with the existing install on SD/USB in a supported manner. Simple adding a properly sized M.2 SSD to your host and using the autoPartition=TRUE boot flag should create and redirect the necessary bits to keep running in a non-degraded or deprocated configuration. Note this configuration will be supported on future releases, but given the added complexity/cost vs. just using a proper boot device to begin with, is not something I recomend for greenfield (Hence why it’s called Legacy/supported).
AutoDeploy – I will ask that for forward compatability support of new features I would start moving in the direction of Stateful Installs for Autodeploy.
Boot from SAN – Keep on rocking, just make those LUNs a bit larger please. VMware wants to see 32GB at a minimum.

What is this warning about Degraded Mode?

Degraded mode is a state where logs and state might not be persistent (get lost when the host is rebooted), with a side effect that it can cause boot up to be slower.

The /scratch partition will be created on a RAMDisk under a /tmp folder with a limited space of 250 MB. This is not recommended, and it will impact the ESXi host performance once /tmp runs out of capacity.

Why is this bad? Why Prefer local storage for logging?

There’s a lot of advantages to redirecting locally. Consistency of performance as well as the ability to collect logs on issues that impact the availability of the storage network or HBA (for Example the NIC or FC HBA firmware crashing). Note Boot from SAN is still completely an option here, but this is (by virtue of physics) and advantage for a local quality device is that it will always be in a superior position to collect logs in specific situations.

How do I move my install to a new device?

See this KB

Couldn’t a Big RAMDISK just fix this?

Ehhhh, this isn’t a long-term solution. See the bottom of this KB for this discussion. Beyond the cost of RAM the bigger issue is volatility. 99% of customers I talk to want support and engineering to be able to identify the source of problems and this becomes incredibly hard when all logs and crash dumps are destroyed on host restart.

What about NVMe SD cards (SDExpress)?

This is something I’ve honestly asked engineering PM about. They are shipping in small quantities right now. My biggest concern looking at the hardware itself is thermal throttling causing complete yoyo’s on consistent performance. Logs and crash dump they look alright but future demands on the OSDATA may require more performance This is partly why vSphere 7 at GA requiring higher endurance and performance requirements for boot devices as preparation for future demands. Technically they will look like a NVMe device so I assume at least for home lab usage they should work. If anyone has any samples laying around and wants to test them shoot me a message on twitter (@Lost_Signal).

I have a home lab, and I”m out of drive bays and curious on cheap/low cost non-supported options?

Personally, I went and bought a $12 PCI-E to M.2 (SATA) adapter. They also make NVMe compatible brackets Just make sure the bracket you get supports your drive type. No need, to spend hundreds of dollars upgrading your hosts in the lab.

Where can I find this information on an official VMware.com page?

https://core.vmware.com/resource/esxi-system-storage-faq

Why did this blogpost just say “No” before.

The challenge in giving nuanced guidance is people tend to read “It’s supported” and ignore the rest of the sentence of why something is a bad idea. Given the blog post explaining this, KBs, and changes in u2c and U3 were still in the works I wanted for people looking to buy a new host to get a no-nonsense response in hardware selection.

Backup Myths

This is going to (hopefully) be a short post dismissing some common VMware backup myths.

Myth: We should not use virtual machine backups because they will take longer to process.

Reality: Changed block tracking reduces the need to scan for differences between backup jobs. VMware keeps a block map of exactly what has changed, reducing the need for backup agents to read blocks and look for changes.

Myth: Virtual machine backups are not useful for file, or app level recovery. As a result you may need to backup the same data multiple times, once as a virtual machine, and once with file or application agents.

Reality: A number of backup vendors can recover files or even application-level details from a single virtual machine backup.

Myth: HotAdd requires 1 virtual machine per host in the cluster, and will slow down backups.

Reality: HotAdd requires 1 virtual machine per cluster. HotAdd backup mode is a powerful way to manage LAN overhead, by allowing initial backup processing to happen directly on a host in the cluster. There is a slight additional fixed overhead in time to mount the snapshot. For network congested backups or larger virtual machines, this is easily compensated for with faster jobs.

Myth: Agent-based backups are “lighter-weight” than hypervisor-assisted backups.

Reality: Agent-based backups tend to slam the CPU, and generally have poor awareness of shared resources. vSphere sits in a position where it can better manage throttling of concurrent jobs, Network IO Control can throttle backup traffic itself, and host-based transport avoids unnecessary overhead.

Myth: To make virtual machine backups run faster, always Eager Zero Thick VMDKs

Reality: That “EZT4Life” tattoo was a bad idea. UNMAP/TRIM inside a VM can delete blocks no longer used, and make backup jobs shorter as the backup software will no longer need to process “dead space”.

Myth: SAN Mode Transport backups are “LAN free” and superior to all other methods

Reality: SAN mode backups that allow the backup software to directly mount VMFS and bypass the host for the sake of moving data have helped save many an 8Gbps Fibre Channel user from the pain of slow 1Gbps networking. Still, with modern networking (25/100Gbps Ethernet). Also, even when SAN Transport is used for backups, restores will often (based on restore settings and your vendor) flow over the network, so highly asymmetric network speeds can lead to less than satisfactory restore times.

Myth: Virtual Machine backups are less secure

Reality: Virtual Machine backups can offer significantly enhanced security. NBDSSL offers the ability to encrypt the networking transport of the backup jobs. HotAdd allows the backup vendor to “own” security of the transport of the backup data. SAN Transport backups allow you to avoid the LAN entirely for the backup job itself.

Myth: Network based backups will bottleneck on the vCenter

Reality: Network block device (NBD) backups NEVER flow through the vCenter. vCenter is simply a control plane.

Myth: vCenter can limit scaling of parallel backups vs. agent based backups

Reality: for environments doing large amounts of parallel backups, per vCenter limitations could have previously become a problem. Using VPXA did not allow for customization of the memory buffer for jobs initiated by connecting through vCenter. vCenter Server 7.0 U1 now uses hostd service on ESXi Host and allows for tunable memory configurations to enabling scaling of the number of backup streams. For 50 concurrent backups per host, 96MB would be the recommended setting.

Myth: Snapshots Suck.

Reality: Snapshots and or data protection doesn’t have to suck. Modern vSphere uses a mirror driver and avoids the need for a helper snapshot. This reduces IO on snapshot merge and reduces stun a good deal. vSAN uses vSAN sparseSE snapshots that leverage a memory cache for reads, and vVols can offload snapshots to the array. Beyond all of this, vSphere APIs for I/O Filtering (VAIO) offer the ability to do data protection without the need for snapshots. Check out the VAIO VCG for supporting products.

Myth: Disaster recovery is hard.

Reality: DRaaS can help make it easy.

vSAN 7 Update 1 What (Else) is new – iSCSI Stretched Cluster Support

Let me preference this discussion of iSCSI with my own personal opinion about iSCSI in the year 2020. With the support of shared VMDKs for SCSI-3PR applications, NFS and SMB shares, the need for iSCSI has reduced quite a bit. If you are using iSCSI today I’d like to talk about some alternatives to delivering that shared access or external cluster storage requirement. That said, I know there are still some uses cases for it so let us go deeper on this topic.

Previously iSCSI on vSAN was only supported with stretched clusters by a limited RPQ. Why was this?

Normally vSAN Stretched clusters implement a site locality construct to avoid unnecessary inter-site latency being added to the read IO path (They prefer all reads from the local site). The challenge came from the fact that the iSCSI service had no awareness of the two fault domains, and you easily get in a situation where an iSCSI target would be placed on the secondary site, while serving IO to virtual machines on the first site. As a result it would be possible for data at the preferred site where a virtual machine is being served to be sent to an iSCSI target on the remote site, and then come back as an iSCSI packet to a virtual machine running at the preferred site.

To prevent this problem, vSAN 7 Update 1 now supports setting a preferred site for an iSCSI target to live. Note, in the event of a complete site failure, the preference will be discarded and the service will cleanly fail over to the other side of the stretched cluster. This combined with other networking improvements and performance optimizations I mentioned in this blog, should help round out this new use case.

It is possible to relocated the preferred location. You will also receive a warning if something has caused a target to run at the non-preferred location.

Again, when clustering windows applications i tend to prefer the native VMDKs these days, but for those of you using iSCSI today (or already under RPQ) this may be a useful setting to look at.

vSAN 7 Update 1 What (Else) is new – Networking

I figured I’d cover in a blog some of the less obvious changes in vSAN 7 Update 1.

Simplified Layer 3 – vSAN has supported layer 3 (hosts within a cluster being on different subnets) since the early days. This is a popular topology when using stretched clustering, and 2 node configurations. vSAN VMkernel ports share the same gateway setting specified for the management network. As the vSAN network (ideally) often on a completely different subnet, this means that a static route would need to be set on each host. To simplify alternative gateway configuration, the vCenter Server UI now supports overriding the default gateway for a VMkernel port. ESXCLI or PowerCLI can still configure a gateway (there’s even now a ESXCLI -g flag to set a default gateway).

Data-In-Transit encryption – historically the focus on storage transport security was focused on restricting access to the storage networks (dedicated VLANs for Ethernet, or hard zoning for Fibre Channel) or limited authentication and access filtering (NFS IP ACL, IQN filteriing, CHAP, Soft zoning). If an adversary could capture the frames in transit on the storage network none of these technologies (or even data at rest encryption) protected you from data exfiltration. To address this, vSAN now supports data in transit encryption. This leverages the FIPS 140-2 validated Cryptographic modules to encrypt vSAN network traffic in flight. this allows custom rekey windows (The default is 1 day). No KMS is required for this solution to be deployed, and this feature complements other VMware in flight encryption technology (encrypted vMotion, encrypted HCX/NSX tunnels etc) so you can now encrypt all the things.

Data-In-Transit Encryption is a single click to enable

General Performance and monitoring improvements

As customers move to 25Gbps and 100Gbps switching, further optimizations have been made to the networking stack to increase parallelization of the CPU threads used for networking transport, increasing the efficiency this parallelizations balancing of and reduce overall CPU consumption per thread. These benefits will be most pronounced with RAID 5/6 usage, and multiple disk groups.

Networking monitoring improvements have been made to the vSAN network health checks. This will result in faster, more accurate automated network testing.

John’s 2020 vSAN VMworld Sessions

I have two on demand VMworld Sessions that can be found at this link, and additional round table for premium pass holders.

The first Session is a return of the IO path deep dive, where we go below the hood on what has changed within the IO path. this year is going to feature quite a few new features and low level improvements so I”d encourage everyone to come check it out.

Deconstructing vSAN: A Deep Dive into the Internals of vSAN [HCI1276]
John Nicholson, Senior Technical Marketing Architect, VMware –

The other Session: vSAN File Services: Deep Dive [HCI1825] Is with my podcast co-host Pete Flecha. It is a great review of what’s improved in vSAN File Services as well as some details on how it works behind the scenes.

For anyone attending my session who has question my DMs are open for the week of VMworld and you can find me on twitter as @Lost_signal.

I also have some live roundtables where Jason Massae and I will be talking about space reclaim (UNMAP, Zero Reclaim, and beyond). There’s been some improvements lately in this space, as well as years of steady improvement to talk about in Storage Management – How to Reclaim Storage Today on vSAN, VMFS and vVols with John Nicholson [HCI2726].

Should you be using Eager Zero Thick on vSAN (or VMFS)?

For vSAN No.

For VMFS, maybe?, but probably not.

Ok, I’ll agree I probably owe an explanation of why.

Asparagus thick vs. thin — Thick vs Thin 2020 Edition!

A quick history lesson of the 3 kinds of VMDKs used on VMFS.

thin: Space is not guaranteed, it’s consumed as the disk is written to.

thick (Sometimes called Lazy Thick): Space is reserved by the blocks are initialized lazily, so VMFS has a zero’ing cost the first time a block is written

EZT or Eager Zero Thick: The entire disk is zero’d so that VMFS never needs to write metadata. This is required for shared disk use cases, as VMFS can’t coordinate metadata updates.

These virtual disk types have no meaning on vSAN. From a vSAN perspective all objects are thin (thin vs thick vs EZT makes no difference). This is similar to how NFS is always sprase, and reservations are something you can do (with NFS VAAI) but it doesn’t change the fact that it doesn’t actually write out zeros or fill the space like it did on VMFS. On VMFS this could be mitigated largely using ATS and WRITE_SAME (With further improvements in the works). I”ve always been dubious on this benefit given how many database and applications would often write out and per-allocate file systems in advance. There are likely some corner cases such as creating a ext4 file system is slower but you can generally work around that if you really care (mkfs.ext4 -E nodiscard).

Having an EZT disk won’t necessarily improve vSAN performance the same way it does for VMFS. Pre-filling zeros on VMFS had an advantage of avoiding “burn in” bottlenecks tied to metadata allocation.

First off, vSAN has it’s own control of space reservation. the Object Space Reservation policy OSR=100% (Thick) or OSR=0% (Thin). This is used in cases where you want to reserve capacity and prevent allocations that would allow you to run out of space on the cluster. In general I recommend the default of “Thin” as it offers the most capacity flexibility and thick provisioning tends to be reserved for cases where it is impossible or incredibly difficult to ever add capacity to a cluster and you have very little active monitoring of a cluster.

What about FT and Shared Disk use cases (RAC, SQL, WFC, etc)? This requirement has been removed for vSAN. For vSAN you also do not need to configure object space reservation to thick to take advantage of these functionalities. VMFS there may still be some benefits here do to how shared VMDK metadata updates are owned by a single owner (vSAN metadata updates are distributed so not an issue).

What do I gain by using Thin VMDKs as my default?

You save a lot of space. Talking to others in the industry 20-30% capacity savings. If combined with TRIM/UNMAP automated reclaim, even more space can be crawled back! This can lead to huge savings on storage costs. As an added bonus, Deduplication and Compression work as intended only when OSR=0% and the default thin VMDK type is chosen.

Also if configured to auto-reclaim you lower the chances of an out of space condition, so this can increase availability. Eager Zero Thick or Thick VMDKs on vSAN are not going to reserve capacity in a way that prevents over commitment on vSAN, instead it will simply use more capacity.

Expect a health alarm if you have set EZT or thick VMDKs stored on a vSAN datastore. It is considered a faulty configuration. KB 66758 includes more information about this. If you need to reserve capacity (for now) use the SPBM capacity reservation policy. Additionally William Lam has a blog on this topic as well.

Have a different opinion? More questions? I’m Lost_signal on twitter.

Maximum supported vSAN/vSphere/VCF Cluster

So lets talk about “maximum supported” and why it’s just a weird thing to focus on. Example building the biggest vSAN cluster possibly supported. It is today ONLY 64 nodes, but lets unpack what I can run in 64 nodes.

vSAN 7 supports 32TB Capacity devices. 5 disk groups *7 devices that’s ~1.12PB per host of raw capacity (Please don’t do this without calling me, but hey it is there).

With AMD 2 socket servers 128 Cores currently. Intel 56 core that’s 224 threads (yes, I’m aware the 9200 may require it’s own nuke plant to run and cool) I’m going to ignore quad socket for this conversation but yes, we support quad sockets like Synergy 660 Gen10 for SAP HANA.

Maximum GPU’s served for a vSAN Cluster is actually a fun one. 16GPU per host, is going to need interesting cooling solutions…. or will it? https://configmax.vmware.com/guest?vmwareproduct=vSphere&release=vSphere%207.0&categories=2-0

BitFusion allows for remote CUDA calls to be served from remote hosts (and ones not even in the cluster). So GPU workload scaling potentially could get pretty nuts and I’m going to leave others to speculate on how many GPU cores could serve a cluster.

Maximum Memory is 16TB DRAM per host, and 12TB of PMEM per host. So a PB of DRAM, and 768TB of PMEM per cluster.

Now, addressable memory gets more fun as TPS, Memory ballooning, and DRS (with new cool capabilities in 7) mean that the actually allocated could be a bit higher. And when you are spending the price of what I assume a G5 Jet costs, that you are going to use these features. Should you design to these maximums? In general no. Most people have other reasons to split a cluster up (Remember we can do shared nothing migrations between clusters always). People will want to limit the blast zone of a management domain etc.

Part of the benefit of HCI is you can easily scale it down to 2 node even… Also remember Hosts can be reclaimed and moved to other workload domains in VMware Cloud Foundation.

Lastly, just because you can, doesn’t mean you should. Most sane people don’t need or want 40 drives in a host, or want an HA event to result in 6TB of memory worth of VM’s rebooting at once. Respect operational reasons to limit blast zones and sizing. As VMware makes it easier to migrate or share resources between clusters these kinds of limits matter less and less.