Benchmarking Badly (Part 2) Bad VDI testing (and some thoughts on how much VDI resourcing has changed)

A while back I spoke to some customers who were trying to test VDI. They wanted to spend several months testing out multiple storage systems for a VDI system for 500 users. This was rather confusing to me, as the labor time spent validating the storage was likely going to cost more than just throwing a reasonably beefy all-flash cluster at the problem, and properly configuring Horizon for their use case. The first use case they were concerned about, as they were testing copying an ISO from one desktop to another. It was slower than a test they ran in another VM. Upon further investigation, it was determined:

  • They were not testing an actual copy in both instances (One was being offloaded using Microsoft ODX).
  • Their test (if it was working) was a test of a low queue depth large block write operation. This wasn’t consistent with a review of vSCSI traces of their existing VDI use case.
  • It was still fairly fast when comparing against someone’s laptop.
  • Interviewing the use case (Doctors in a hospital) and having a consult with my wife (MD) it was determined that doctors do not copy large ISO files as part of their daily acivities

Normally the best testing of VDI is:

  • Spin up a test pool and redirect some users on the pool (taking care to select users who will be using the same applications and workflows as the users that will be scaled later).
  • Use a VDI benchmarking application taking great pains to properly configure it. I will note on LoginVSI published benchmarks you sometimes see some hilariously non-realistic desktop testing done to publish “hero numbers”.
  • Pull a vSCSI trace and use a automated scaled stesting system to “replay” an amplified synthentic copy of the storage requirements (note this doesn’t test CPU in the same way).

Upon further discussion they decided to just put some users on the cluster, perform a proper pilot test, and scale at user densities they were able to achieve on the pilot going forward. Here is a review of some of the mitigations and discussions we had that helped cool off the storage team’s fears.

Why is VDI percieved as demanding on storage?

Virtual Desktops in the past were known to be a “Scarry” storage heavy workload that put fear into storage admins and brought disk arrays to dust. Why was this?

Boot Storms – Recompose actions or under-provisioned pools needing to catch up with demand would lead to OS boot events. While the steady-state IOPS per desktop might be in the single or two digits, this could result in a spike of 800 IOPS or more per desktop.

Login Storms – Roaming profiles with hundreds or thousands of users who all log in at the same time resulted in huge amounts of data being copied into desktops.

Antivirus Scan Storms – Copy pasting the security posture of your existing desktops, often leads to the security team trying to scan every desktop at noon at the same time.

The reality is these problems have been largely solved (for years), but have sometimes been perpetuated as still issues by storage vendors trying to sell some feature or solution. *Disclaimer, I work for a storage product and while I’d love you to buy vSAN and think it is frankly awesome for VDI, I’m not going to pretend that the above problems can’t be largely mitigated in other ways*.

Boot Storm Mitigations

Use Instant Clones – Instant Clones are “born running”. They use VMFork technology to create writable snapshots of the memory of a running virutal machine. This has the advantage of insanely fast (seconds) desktop creation times.

Pre-stage desktops/rolling recompose – At some scale you can always just schedule recompose operations. A popular trick I used back in the stone ages of lined clones was to create a new pool and set it to auto-scale. I would disable net new connections to the old pool and set the users to only see the new pool. This allowed for a slower transition to the new pool. Combined with throttling new desktop creations to a manageable speed this new pool could slowly grow to the needed capacity. This required a few slack resources but the vSphere scheduler and memory compaction technoligies was generally good for it if you were not running absurd vCPU rations, to begin with. Note, other methods largely solve this from a resourcing method but this method can still be used as a means of slowing testing a new image and allow for rapid “roll back” if the new image has issues (re-enable the old pool and direct new connections back to it).

Cache the blocks used for OS boot – This has been discussed before, but OS boot only needs to call up a few hundred MB of blocks into RAM. Various VDI solutions to provide a DRAM cache to hold these blocks have existed for years (Horizon Content-Based Read Cache, or CBRC). This allows multi-GB read caches to be deployed for the base OS disks to accelerate them. Citrix also with PVS has similar capabilities. Beyond this modern storage arrays with dedupe and multi-hundred GB DRAM caches will make short work of these bits. Remember even for “full clones” any solution with dedupe (or dedupe cache like CBRC) can handle the fact that is it 300MB of hot blocks X 2000 Desktops. vSAN even goes so far as to put DRAM cache local to the hosts where VMs are running to reduce even storage network traffic hits.

Login Storm Mitigations

Profile Virtualization – Technology to cache, and optimize profile load through various mechanisms have been around for a while. While I was cutting my teeth on Persona years ago (which worked, it just required you to know which folders to exclude from the stubbing system) VMware Dynamic Environment Manager is a fantastic solution today. FXLogix and other solutions also exist that can even deal with some of the more annoying elements of profile virtualization *GLARE INTENSIFIES AT OUTLOOK OST FILES THAT DROVE ME CRAZY *. It’s true we used to have to do weird/stupid things with application customization to make profile virtualization work (Make sure Exchange was colocated 1ms from the VDI pool) but those days are long gone.

Antivirus Storm Mitigations

I’ll leave others to speak more in the comments to this one, but a blend of on-access scanning policies and agentless and network-based introspection has largely calmed the challenge of virus scans taking out a cluster. Security is about many layers of an onion providing security here.

Other Minor VDI Resource Issues to think about

Windows Search – This and other services we used to disable to better optimize desktops. I’ll call out that disabling this also breaks outlook email search and even if this leads to 3% increase in density I would argue you don’t need to go to these extremes to optimize desktops. While there are certain things you should optimize, breaking user experience to get an extra 10 users in a cluster likely isn’t worth it anymore. Hardware is cheaper at this point than the emotional cost of annoying users.

Hardware refreshes need to be at way more than 1:1 I advised a bank recently that was replacing an ancient 5.5 environment with windows XP desktops. They were expecting that by buying hosts with 5x the resources they would get 5x the host density. They were disappointed to learn that:

  • The 3 anti-virus solutions they had installed were at war with each other for the 1 vCPU’s they were allocating to each desktops and over subscribing 15:1
  • 1GB of RAM wasn’t enough to make users happy
  • Their base images were now 6x larger

The reality is we used to make some awful compromises on VDI usability and user experience to make the numbers “work”. Make sure when sizing solutions to understand that with lowered resource cost comes options to do more than save capital costs.

But John? What if I Can’t do X,Y,Z?

Just throw a little more all-flash storage at the problem. We used to get excited about getting the cost of storage down to $100 per user for VDI. Now with all-flash, instant clones and dedupe the storage costs have kind of become a rounding error on the total VDI solution. There used to be an entire field of “VDI storage-specific vendors”, and you’ll find that most of them have completely disappeared. This is because the problem of VDI and storage has largely gone away.

Benchmarking Badly (Part 1) The Single workload test

Having been around the industry I’ve noticed there are a lot of changes but a few guarantees when it comes to benchmarking shared storage and HCI clusters:

  • Benchmarking is generally poorly represenatatitive of what the production workload will look like.
  • Benchmarking is about trade offs. There are “easy” ways to do them, but often these are so far from accurate for what production will look like they might as well be skipped.
  • Real benchmarking is hard. There are shortcuts to easier benchmarking. Some are good, some are bad. It’s critical either way you understand what trade offs you make when you chose one.

For anyone who wants a history lesson on why Benchmarking is bad @VMpete has some old but great blogs on this topic.

There are good easy buttons for testing a cluster (HCI Bench is a personal favorite) and there are bad easy buttons (Crystal Disk, ATTO Disk, IO meter, and other synthetic workload desktop-focused testing tools). Today we are going to talk about why single workload tests are normally poorly done.


It’s often poorly executed – The single workload test

A lot of people can spin up a single virtual machine, fire up a synthetic disk testing application like CrystalDisk or IOmeter and push “Test run”. While this does generate IO, it doesn’t necessarily generate a workload against an HCI cluster that looks anything like what a customer would run. Breaking down some quick fundamentals.

In your typical VMware cluster, you will find multiple virtual machines with different numbers of drives processing different block sizes, read-write mixtures, different overlaps when they send data (Some bursty, some constant).

Even clusters with homogonous dense workloads don’t look like this single VMDK test. Even monster scale-out in-memory databases like SAP HANA and Casandra and container platforms recommend more than 1 virtual machine. Amongst these applications, you still will always see more than 1 virtual hard drive (VMDK) processing disk IO, possibly with multiple vHBAs attached.

Other common mistakes that go along with using these tools:

The default Crystal Disk only uses a relatively small working set size (below 5GB). In any tiered/cached system, there is a strong chance you end up testing IO that largely is served from DRAM caches (either inside the SSDs or within caching of the system). A 24/7 production environment with large data flows will result in wildly different outcomes.

IO Meter can be configured for multiple workers, but doing so at scale with a diverse set of workloads is going to be problematic vs. using something that has better synthetic engines with more options and easier control and reporting like HCI Bench. It’s worth noting that IOmeter has seen 1 release since 2008 when Intel made it abandonware. VDBench and FIO that are used by HCIBench have seen a lot more development attention.

Fixed QD or block sizes. Crystal Disk tests 4 different blends of block size and queue depth but:

  • There’s a strong corelation between people fretting about large block throughput, and people who are running workloads that don’t actually send large blocks.
  • The tests are run sequentially, and not in parallel. Again, real storage systems handle what is thrown at them and can’t ask applications to nicely wait 30 seconds for their turn to run a homogeneous workload.
  • These workloads tend to generate high entropy data (So no dedupe/compression). It could be argued that setting the workload to include to low of entry is cheating but using real data sets (or tuning synthetic to mirror entropy of the real data) is going to give you a more accurate idea of what production will look like.
  • Not reporting latency is a bit like reporting horse power and top speed of a car but ignoring torque when people want to tow a boat…
  • Synthetic engines don’t look like real traffic for (a lot of other reasons).

Short Benchmark, Simple Summery

There also is a fatal flaw in CrystlalDisks presentation of data. It’s a simple average summery for each benchmark that fails to show a time series of data. Without understanding what a system looks like at the beginning of a test (When cache may be less warm, but write buffers less full) vs. the end of the test (when cache hits may increase, or buffers may be exhausted) its very hard to understand what steady state under load performance may look like. This is magnified further in that Crystal Disk and the like are short tests. For systems that will run under load for hours/days you want tools that can sustain testing to better emulate your production duty cycle for IO (Not that it would make a good synthetic workload generator if you could run it for longer). Often things like tail latency, jitter or 99% latency can have disastrous impacts on systems that users have to interact with.

Who wouldn’t test their 10 Million Dollar ERP solution with this “serious” testing tool!

A good storage system has to handle a wide variety of workloads simultaneously. The single workload/disk test is a bit like testing the effectiveness of an air traffic controller at an airfield that sees 1 airplane a day. You might see the different variations in his communication quality to that one airplane but any serious test is going to stress tracking different planes on different trajectories.

Next up, Bad VDI testing – No Copying an ISO is in not benchmarking VDI…