Benchmarking Badly (Part 2) Bad VDI testing (and some thoughts on how much VDI resourcing has changed)
A while back I spoke to some customers who were trying to test VDI. They wanted to spend several months testing out multiple storage systems for a VDI system for 500 users. This was rather confusing to me, as the labor time spent validating the storage was likely going to cost more than just throwing a reasonably beefy all-flash cluster at the problem, and properly configuring Horizon for their use case. The first use case they were concerned about, as they were testing copying an ISO from one desktop to another. It was slower than a test they ran in another VM. Upon further investigation, it was determined:
- They were not testing an actual copy in both instances (One was being offloaded using Microsoft ODX).
- Their test (if it was working) was a test of a low queue depth large block write operation. This wasn’t consistent with a review of vSCSI traces of their existing VDI use case.
- It was still fairly fast when comparing against someone’s laptop.
- Interviewing the use case (Doctors in a hospital) and having a consult with my wife (MD) it was determined that doctors do not copy large ISO files as part of their daily acivities
Normally the best testing of VDI is:
- Spin up a test pool and redirect some users on the pool (taking care to select users who will be using the same applications and workflows as the users that will be scaled later).
- Use a VDI benchmarking application taking great pains to properly configure it. I will note on LoginVSI published benchmarks you sometimes see some hilariously non-realistic desktop testing done to publish “hero numbers”.
- Pull a vSCSI trace and use a automated scaled stesting system to “replay” an amplified synthentic copy of the storage requirements (note this doesn’t test CPU in the same way).
Upon further discussion they decided to just put some users on the cluster, perform a proper pilot test, and scale at user densities they were able to achieve on the pilot going forward. Here is a review of some of the mitigations and discussions we had that helped cool off the storage team’s fears.
Why is VDI percieved as demanding on storage?
Virtual Desktops in the past were known to be a “Scarry” storage heavy workload that put fear into storage admins and brought disk arrays to dust. Why was this?
Boot Storms – Recompose actions or under-provisioned pools needing to catch up with demand would lead to OS boot events. While the steady-state IOPS per desktop might be in the single or two digits, this could result in a spike of 800 IOPS or more per desktop.
Login Storms – Roaming profiles with hundreds or thousands of users who all log in at the same time resulted in huge amounts of data being copied into desktops.
Antivirus Scan Storms – Copy pasting the security posture of your existing desktops, often leads to the security team trying to scan every desktop at noon at the same time.
The reality is these problems have been largely solved (for years), but have sometimes been perpetuated as still issues by storage vendors trying to sell some feature or solution. *Disclaimer, I work for a storage product and while I’d love you to buy vSAN and think it is frankly awesome for VDI, I’m not going to pretend that the above problems can’t be largely mitigated in other ways*.
Boot Storm Mitigations
Use Instant Clones – Instant Clones are “born running”. They use VMFork technology to create writable snapshots of the memory of a running virutal machine. This has the advantage of insanely fast (seconds) desktop creation times.
Pre-stage desktops/rolling recompose – At some scale you can always just schedule recompose operations. A popular trick I used back in the stone ages of lined clones was to create a new pool and set it to auto-scale. I would disable net new connections to the old pool and set the users to only see the new pool. This allowed for a slower transition to the new pool. Combined with throttling new desktop creations to a manageable speed this new pool could slowly grow to the needed capacity. This required a few slack resources but the vSphere scheduler and memory compaction technoligies was generally good for it if you were not running absurd vCPU rations, to begin with. Note, other methods largely solve this from a resourcing method but this method can still be used as a means of slowing testing a new image and allow for rapid “roll back” if the new image has issues (re-enable the old pool and direct new connections back to it).
Cache the blocks used for OS boot – This has been discussed before, but OS boot only needs to call up a few hundred MB of blocks into RAM. Various VDI solutions to provide a DRAM cache to hold these blocks have existed for years (Horizon Content-Based Read Cache, or CBRC). This allows multi-GB read caches to be deployed for the base OS disks to accelerate them. Citrix also with PVS has similar capabilities. Beyond this modern storage arrays with dedupe and multi-hundred GB DRAM caches will make short work of these bits. Remember even for “full clones” any solution with dedupe (or dedupe cache like CBRC) can handle the fact that is it 300MB of hot blocks X 2000 Desktops. vSAN even goes so far as to put DRAM cache local to the hosts where VMs are running to reduce even storage network traffic hits.
Login Storm Mitigations
Profile Virtualization – Technology to cache, and optimize profile load through various mechanisms have been around for a while. While I was cutting my teeth on Persona years ago (which worked, it just required you to know which folders to exclude from the stubbing system) VMware Dynamic Environment Manager is a fantastic solution today. FXLogix and other solutions also exist that can even deal with some of the more annoying elements of profile virtualization *GLARE INTENSIFIES AT OUTLOOK OST FILES THAT DROVE ME CRAZY *. It’s true we used to have to do weird/stupid things with application customization to make profile virtualization work (Make sure Exchange was colocated 1ms from the VDI pool) but those days are long gone.
Antivirus Storm Mitigations
I’ll leave others to speak more in the comments to this one, but a blend of on-access scanning policies and agentless and network-based introspection has largely calmed the challenge of virus scans taking out a cluster. Security is about many layers of an onion providing security here.
Other Minor VDI Resource Issues to think about
Windows Search – This and other services we used to disable to better optimize desktops. I’ll call out that disabling this also breaks outlook email search and even if this leads to 3% increase in density I would argue you don’t need to go to these extremes to optimize desktops. While there are certain things you should optimize, breaking user experience to get an extra 10 users in a cluster likely isn’t worth it anymore. Hardware is cheaper at this point than the emotional cost of annoying users.
Hardware refreshes need to be at way more than 1:1 – I advised a bank recently that was replacing an ancient 5.5 environment with windows XP desktops. They were expecting that by buying hosts with 5x the resources they would get 5x the host density. They were disappointed to learn that:
- The 3 anti-virus solutions they had installed were at war with each other for the 1 vCPU’s they were allocating to each desktops and over subscribing 15:1
- 1GB of RAM wasn’t enough to make users happy
- Their base images were now 6x larger
The reality is we used to make some awful compromises on VDI usability and user experience to make the numbers “work”. Make sure when sizing solutions to understand that with lowered resource cost comes options to do more than save capital costs.
But John? What if I Can’t do X,Y,Z?
Just throw a little more all-flash storage at the problem. We used to get excited about getting the cost of storage down to $100 per user for VDI. Now with all-flash, instant clones and dedupe the storage costs have kind of become a rounding error on the total VDI solution. There used to be an entire field of “VDI storage-specific vendors”, and you’ll find that most of them have completely disappeared. This is because the problem of VDI and storage has largely gone away.