Skip to content

Is VDI really not “serious” production?

This post is in response to a tweet by Chris Evans (Who I have MUCH respect for and is one of the people that I follow on a daily basis on all forms of internet media). The discussion on twitter was unrelated (Discussing the failings of XtremeIO) and the point that triggered this post was when he stated VDI is “Not serious production”.

While I might have agreed 2-3 years ago when VDI was often in POC, or a plaything of remote road warriors or a CEO, VDI has come a long way in adoption. I’m working at a company this week with 500 users and ALL users outside of a handful of IT staff work in VDI at all times. I”m helping them update their service desk operations and a minor issue with VDI (profile server problems) is a critical full stoppage of the business. Even if all of their 3 critical LOB apps going down would be less of an impact. At least people could still access email, jabber and some local files.

There are two perspectives I have from this.

1. Some people are actually dependent on VDI to access all those 99.99999% uptime SLA apps so its part of the dependency tree.

2. We need to quit using 99.9% SLA up-time systems and process’s to keep VDI up. It needs real systems, change control, monitoring and budget. 2 years ago I viewed vCOPS for View as an expensive necessity, now I view it as a must have solution. I’m deploying tools like LogInsight to get better information and telemetry of whats going on, and training service desks on the fundamentals of VDI management (that used to be the task of a handful of sysadmins). While it may not replace the traditional PC and in many ways is a middle ground towards some SaaS web/mobile app future, its a lot more serious today than a lot of people realize.

I’ve often joked that VDI is the technology of last resort when no other reasonable offering made sense (Keep data in datacenter, solve apps that don’t work under RDS, organizations who can’t figure out patch/app distribution, highly mobile but poorly secured workforce). For better or for worse its become the best tool for a lot of shops, and its time to give it the respect it deserves.

At least the tools we use to make VDI serious today (VSAN/VCOPS/LogInsight/HorzionView6) are a lot more serious than the stuff I was using 4 years ago.

My apologies, for calling our Chris (which wasn’t really the point of this article) but I will thank him for giving me cause to reflect on the state of VDI “seriousness” today.

How does your organization view and depend on VDI today, and is there a gap in perception?

Why you shouldn’t run BCA off a Synology (or QNAP, or other cheap Linux NAS)…

In my life as a VMware consultant I run into the following Mad Lib when trying to solve storage problems for Business critical Applications.

A customer discovers they have run out of (IOPS/Capacity/Throughput/HCL) with their existing (EMC/Dell/HP/Netapp) array. They sized only for Capacity without understanding that (RAID 6 with NL-SAS is slow, 2GB of Cache doesn’t deliver 250K IOPS). The have spent all their (Budget/rackspace/Power/Political Power/Moxie). There is also an awkward quiet moment where its realized that (Thick provisioning on Thick provisioning is wasteful, I can’t conjure IOPS out of a hat, Dedupe is only 6%, Snapshots are wasting 1/2 of their array and are still not real backs, They can’t use COW so SRM can’t test failover). Searching for solutions they hear from a junior tech that there is this new (home-made/SOHO appliance) that can meet their (Capacity/IOPS) needs at a cheap price point. And if they buy it, it probably will work… For a while.

Here’s whats missing from the discussion.

1. The business needs more than 3-5 days for parts replacement, or tickets being responded to. (Real experiences with these devices).

2. The business needs something not based on desktop class non-ECC RAM motherboards.

3. The Business needs REAL HCL’s that are verified and not tested on customers. (QNAP was saying Green drives that lacked proper TLER, and are not designed for RAID would be fine to use for quite a while).

4. The Business needs systems that are actually secured

Now I’ve heard the other argument “but John I’ll have 2 of them and just replicate!”

This is fine (once you realize that RSYNC and VMDK’s don’t play nice) until you get bit buy a code bug that hits both platforms. While technically on the VMware HCL, these guys are using open source targets (iSCSI and NFS) and are so incredibly removed from the upstream developers that they can’t quickly get anything fixed or verified quickly. 2 Systems that have a nasty iSCSI MPIO bug, or have a NFS timeout problem are worse than 1 system that “just works”. Also as these boxes are black box’s they often miss out from the benefits of open source (you patch and update on their schedules, which is why My QNAP had a version of OpenSSL at one point that was 4 years old despite being on the newest release). If both systems have hardware problems because of a power surge, or thermal problems, or user error or a bad batch your still stuck waiting days to get a fix. If its software you may be holding your breath for quite a while. With a normal server OEM or Tier 1 storage provider you have parts in 4 hours, and reliability and freedom that these boxes can’t match.

Now at this point your probably saying “but John, I need 40K IOPS and I don’t have 70K to shovel into an array.

And thats where Software Defined Storage bridges the gap. Now with SuperMicro You can get solid off the shelf servers with 4 hour support agreements without breaking the bank (This new parts support program is global BTW). For storage software you can use VMware VSAN, a platform that reduces, costs, complexity, and delivers great performance. You massively reduce your support foot print (one company for hardware, one for software) reducing operational costs and capital costs.

Nothing against the Synology, QNAP, Drobo of the world, but lets stick to the right tool for the right job!

KeyNote Part1

VMware: EVO:RAIL – It looks like our shift to SuperMicro for VSAN was the right choice. Will be be looking for
EVO:Rack – A vBlock without limits? We will see.

OpenStack – VMware is doing a massive amounts of code push to OpenStack so OpenStack can control vSphere, NSX etc allowing for people to run VMware API’s and OpenStack API’s for higher level functionality.

Containers – Docker, Google, Pivotal are allowing very clean and consistent operational deployments.

NSX – Moving security from the edge to Layer 2. Get ready to hear “Zero Trust networking”. The biggest challenge in Enterprise shops is they are going to have to define and understand their networking needs on a granular level. For once network security ability will outrun operational understanding. If your a Sysadmin today get ready to have to understand and defend every TCP connection your application makes, but take comfort in that policy engines will allow this discussion to only have to happen once.

Cloud Volumes – While I’m most excited about this as a replacement for Persona, there are so many use cases (Physical, Servers, Thinapp, Profiles, VDI) that I know its going to take some serious lab time to understand where all we can use this.

vCloud Air – In a final attempt to get SE’s everywhere to quit calling it “Veee – Cheese” VMware is re-branding the name. I was skeptical last year, but have found a lot of interest in clients in recent months as Hurricane Seasons closes in Houston

VMworld Day 1

I’m looking forward to this week and here are a few highlights of what I’ll be looking into.

On the tactical

1. Settling on a primary load balancing partner for VMware View. (Eying Kemp, anyone have any thoughts?). I’ve got a number of smaller deployments (few hundred users) that need non-disruptive maintenance operations, and patching on the infrastructure and are looking to take their smaller pilots or deployments forward.

2. Learn more about VDP-A designs, and best practices. I’ve seen some issues in the lab with snapshots not getting removed from the appliance and need to understand the scaling and design considerations better.

3. Check out some of the HOL updates. Find out if a VVOL lab in the office is worth the investment.

On the more general strategic goals.

Check out cutting edge vendors, and technologies from VMware.

CloudVolumes – More than just application layering. Server application delivery, Profile abstraction thats fast and portable, and a serious uplift to persona and ThinApp. Really interested in use cases having it used as a delivery method for Thinapp.

DataGravity – In the era of software defined storage this is a company making a case that an array can provide a lot of value still. Very interesting technology but the questions remain. Does it work? Does it Scale? Will they add more file systems, and how soon will EMC/HP buy them to bolt this logic into their Tier 1 arrays. Martin Glassborow has made a lot of statements that new vendors don’t do enough to differentiate, or that we’ve reached peak features (Snaps, replication, cache/tier flash, data reduction etc) but its interesting to see someone potentially breaking outside of this mold of just doing the same thing a little better or cheaper.

VSAN – Who is Marvin? What happened to Virsto? I’ve got questions and I hope someone has answers!

LSI 2008 Dell H310 VSAN rebuild performance concerns

Just a quick note for anyone seeing VSAN performance issues with Dell H310 or LSI 2008 controllers. Its not a secret that the LSI 2008 and H310 Dell with stock firmwares have a very shallow queue depth of 25 (a LSI 2208 in comparison is 600, and a Dell H910 is 975). These are some of the weakest cards to be certified on the HCL and for a small ROBO deployment, or a low VM count with low contention should be fine. Remember part of the benefit of SDS is you can scale down.

For a quick look at firmware queue depths check out Duncan’s article on this.

Now one user tried to push things a little to far, running 5 hosts, with only 3 with storage with 70+ VM’s using Dell H310’s. Performance and experience was fine, until he lost one node and a rebuild kicked off. Running 70 VM’s on 2 hosts combined with the replication overhead was too much and caused an interruption of service (but no data corruption). VMware support tied it back to poor performance of the H310 and heavy load on a degraded 2 Node system trying to rebuild.

In Synchronet’s Lab’s one my earlier VSAN lab builds ran into some odd write latency, that I had initially suspected was the result of a excessively cheap 10Gbps switch. While building out a validation of a solidly performing SMB bundle this week I sought out to get to the root cause. A quick test last Friday (running the vCenter Appliance install wizard) showed at ~250 IOPS a perfectly good Intel S3700 200GB SSD drive spiking over 30ms of write latency and continuing upward. This test was performed against a regular SSD and not a VSAN based datastore isolating the issue. Previous IOMeter workloads showed thousands of read IOPS but write IOPS choking pretty quickly with high latency and low cache rates. Subsequent lab equipment did not have this issue, but had used newer switches muddling the root cause.

We have upgraded our LSI cards to the current firmware, and as of this evening confirmed that the queue depth on the 2008 is increased to 600. Now technically while not supported the H310 is an LSI 2008 controller, and if this is for a lab or a POC or your feeling brave you can follow this guide here. (Note this is not endorsed or supported by VMware/Dell). I’ll see how far I can push writes this afternoon but I expect this should have made our similar issues go away. Alternatively upgrading to something with a queue depth of at least 600 should help fix this (its generally ~$150 per host for a mediocre HBA/RAID Controller that supports decent queue depths).

This is a quick flash (And reminder) of why its a good idea to work with a VSAN partner who validates their builds, understands the technology, and is ready to support you from architecture to implementation (Ok thats my quick advertisement). One thing I am offering right now is anyone interested in a quick architecture call, and a setup of the VSAN assessment tool (TM) I can hopefully help get you the raw information you need to make intelligent decisions on things like Queue Depth, and cache size, and understanding things like the data skew that is important to setting a foundation for a solid VSAN design.

As with any new technology I encourage everyone to do their homework. There’s a lot of FUD going around (and just lack of training and knowledge from a lot of vendors) and issues like this is part of why I’ve been happy to work with VMware’s software defined storage team and the hardware OEM’s for our customer builds. Greater flexibility and response on helpful information (like JBOD mode on LSI 2208’s in a previous post, or adjustments to the HCL, or documentation and firmwares to fix the queue depth issue). For those of you looking for a quick summery. VSAN is a great product and very powerful. Remember a configuration that would be fine for 5 VM’s in a branch isn’t quite going to look the same as for 100 VM’s in a datacenter, or 1000 in a VDI farm. Trying to build the cheapest VSAN configuration that has enough capacity should not be a goal, and cutting corners in the wrong areas can sneak up on you. I do expect some updated guidance from support and the HCL on this. I am hearing 256 but realistically given how cheap the 600 Queue Depth, and how annoying it is to swap out an HBA and re-cable things I’d encourage everyone to start at this point if it makes sense.

Special thanks to…

VMware – for getting a quick Root Cause Analysis, and from reading the story provided a steady voice of reason on support so nothing further drastic was done and stayed involved until the situation was resolved.
JasonGill – for providing us a good read and not jumping to conclusions that this was a software bug.
Brandon Wardlaw – for braving the controller upgrades and somehow not bricking any of my old lab gear.
LSI – For making an updated firmware for their old controllers and not hiding it.

Note, Expect follow up posts and edits. As I get more benchmarks and numbers about various queue depth’s I’ll post them (or if anyone has any send them to me!). I’m also going to be benchmarking some different switches (Cisco, Juniper, Brocade, Netgear) over the following weeks and hopefully if I have time publish some of our results on things like a 140mpps 1Gbps Brocade vs. terrifyingly cheap Netgear 10Gbps switch.

*UPDATE*

We did some testing in the lab without VSAN, just doing a basic vCenter Server install to SSDs (Intel 3500) on this datastore. With the 25 queue depth default firmware we saw 30ms of write latency at 352 write iops. Using an LSI based upgrade firmware with 600 queue depth we pushed 1500 iops at a sub 1ms of latency (.17 ms). Its pretty clear that outside of light/ROBO usage the H310 controller is unsuitable for VSAN usage until Dell supports the LSI code upgrade. What is very concerning is that 4 out of the8 of the Dell VSAN nodes are based on this underpowered HBA. This includes one with 15K drives that I’m guessing is meant to be a performance option. This raises a few questions.

1. Does Dell have a firmware upgrade from LSI to push out (We know it exists) to help resolve this.
2. Did Dell run any benchmarks or have any storage architects look at these configs before they submitted them as a VSAN ready build?
3. I’m hearing reports from the field that Dell reps are saying that VSAN isn’t supported for production workloads. Is this underpowered config partly to blame for this?

DellVSANhuh

PSA: Developers and SQL admins do not understand storage

Thin Provisioning is one of my favorite technologies, but with all great technology comes great responsibility.

This afternoon I got a call from a customer having an issue with a SQL backup. They were preparing a major code push and were running a scripted full SQL backup to have a quick restore point if something goes wrong.
I was sent the following


10 percent processed.
20 percent processed.
30 percent processed.
40 percent processed.
50 percent processed.
60 percent processed.
70 percent processed.
80 percent processed.
90 percent processed.
Msg 64, Level 20, State 0, Line 0
A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)

The server had frozen from a thin provisioning issue, but tracing through the workflow that caused this highlighted a common problem of SQL administrators everywhere. Backups where being done to the same volume/VMDK as the actual database. For every 1GB of SQL database there was another 10GB of backups, wasting expensive tier 1 storage.

The Problem:

SQL developers LOVE to make backups at the application level that they can touch/see/understand. They do not trust your magical Veeam/VDP-A. Combined with NTFS being a relatively thin unfriendly file system (always writing to new LBA’s when possible) this means that even if a database isn’t growing much, if backups get placed on the same volume any attempt at being thin even on the back end array is going to require extra effort to reclaim. They also do not understand the concept of a shared failure domain, or data locality. If left to their own devices put the backups on the same RAID group of expensive 15K or flash drives, and go so far as to put it on the same volume/VMDK even if possible. Outside of the obvious problems for performance, risk, cost, and management overhead, this also means that your Changed Block Tracking and backup software is going to be baking up (or having to at least scan) all of these full backups every day.

The Solution:

Give up on arguing with them that your managed backups are good enough. Let them have their cake, but at least pick where the cake comes from and goes.
Create a VMDK on a separate array (in a small shop something as cheap as a SATA backed Synology can provide a really cheap NFS/iSCSI target for this). Exclude this drive from your backups (or adjust when it runs so it doesn’t impact your backup windows).
Careful explain to them that this new VMDK (name the volume backups) is where backups go.
Now accept that they will ignore this and keep doing what they have been doing.
In Windows turn on file screens and block the file extensions for SQL backup files.
Next turn on reporting alerts to email you anytime someone tries to write such a file, so you’ll be able to preemptively offer to help them setup the maintence jobs so they will work.

Why VDI?

I was reading Justin Paul Justifying the Cost of Virtual Desktops: Take 2
http://www.jpaul.me/?p=6597 and had some thoughts on where he see’s the cost model of VDI. I know Brian Madden has talked at great length of all the false cost models for VDI that exist (and I’ve seen it in the field) .

1. I Agree with Justin on power with some narrow changes. Unless its a massive deployment, another 4 hosts in the data center isn’t going to break the bank. Unless your forcing people to use thin clients, your also not saving anything real on the client side (and certain thing (Lync, MMR etc) require Windows Embedded clients at a minimum anyways. The only case where I’ve successfully made this was a call center that was 24/7 and handled disaster operations in Houston. After IKE everyone learned how hard it is find fuel, anything that reduces the generator and battery backup budget actually has real implications.

2. Justin does make good points about SA and keeping up with the Windows OS releases on physical machines is just as expensive as VDA. Sadly this is only true if companies are not just standardizing on Windows 7 and running it into the ground for the next 5-7 years. Hey it worked for XP right?

3. While I agree a ticket system helps track time spent restoring machines etc, no one makes non-billable IT resources track time to the level of detail and meta tags/search to make building an in house ROI model possible. The best luck I’ve had is having people do a week survey with 15 minute intervals broken down is as close as you’ll get in house IT to do. Its painful to get even that done. Unless your desktop support is outsourced (And you have access to their reports!) This is going to always be sadly a fuzzy poorly tracked cost. I’d argue VMware Mirage (or equally good application streaming/imaging system) can provide a lot of the opex benefits without the consolidation and other pro/cons of VDI. VDI extends beyond imaging and breakfix. Its about mobility, security, and

4. People work from home today with VPN, and Shadow IT (LogMeIn etc). The ability to do this isn’t what you sell, its the execution and polish (Give a sales person a well maintained, PCoIP desktop and they will grab their iPad and never come back to the office). Its the little things (like Thin Print letting them print to their home PC). Ultimately it isn’t the “occasional” or snow day remote users that sell VDI. its the road warriors and branch offices (who are practically the same thing with as little attention as they get from central IT typically).

Why software storage is far less riskey to your buisness

I was talking to a customer who was worried about the risks of a software based storage system, but thinking back I keep thinking of all of the risks of buying “hardware” defined storage systems. Here’s a few situations over the years I’ve seen (I’m not picking on any of these vendors here, just explaining situations with context).

1. Customer buys IBM N-Series. Customers FAS unit hits year 4 of operation. Customer discovers support renewal for 1 year will cost 3x buying a new system. Drives have custom firmware and can not be purchased 2nd hand in event system needs emergency life support as tier 2 system.

Solution: Customer can extend support on HP/Dell Servers without ridiculous markups. StarWind/Vmware VSAN and other software solutions don’t care that your in “year 4”.

2. Customer has an old VNXe/VNX kit. Customer would like to use flash or scale up the device with lots and lots of drives. Sadly, The flare code running on this was not multi-threaded. Customer discovers that this critical feature is coming out but will require a forklift. Customer wonders why they were sold an array with multi-core processors that were bragged about when the core storage platform couldn’t actually use them. Flash storage pool is pegging out a CPU core and causing issues with the database.

Solution: Software companies want everyone on the new version. Most storage/software companies (VMware VSAN, Starwind etc.) include new features in the new version. Occasionally there will be something crazy good thats a added feature, but at least your not looking at throwing away all the disks (and investments in controllers) you’ve made just for a single much needed feature.

3. Customer bought MD3000i. One year later VMware puts out a new version, and fail over quits working on the MD3000i. Dell points out the device is end of support and LSI isn’t updating it. Customer gets sick of all path down situations and keeps their enviroment on an old ESXi release, realizing that their 2 year old array is an albatross.
Discussions of sketchy NFS front end kludge come up but in the end the customer is stuck.

Solution: Had another customer have this happen (Was with Datacore) but this customer was running it on COTS (DL180+MSA’s stacked). Customer could easily switch to a different software/storage vendor (Starwind etc). In this case they were coming up on a refresh so we just threw on CentOS and turned the thing into a giant Veeam Target.

Software based storage fundamentally protects you from the #1 unpredictable element in storage. The vendor….

The VSAN build 2 (Watch out for partitions!)

A quick post. So my disks had been burn in tested by AcmeMicro so they had partitions on them. VSAN will to protect you from yourself refuse to install on disks with existing partitions.

A quick check for the disk ID’s (naa.##############). needs to be run.

~ # esxcli storage core device list
naa.5000c500583aeb07
Display Name: Local SEAGATE Disk (naa.5000c500583aeb05)

Once you have the ID’s check for partition tables (Note the second line first number is the partition number so in this case I have a partition 1 and 2).

~ # partedUtil getptbl /dev/disks/naa.5000c500583aeb05
msdos
121601 255 63 1953525168
1 2048 718847 7 128
2 718848 1951168511 7 0

Last we have to delete the partition info.

~ # partedUtil delete /vmfs/devices/disks/naa.5000c500583aeb05 1
~ # partedUtil delete /vmfs/devices/disks/naa.5000c500583aeb05 2

At this point we can now install VSAN and eat cake 🙂

Migrate from Windows to Linux Appliance vCenter Server

A quick post here for people migrating to the VCSA. I just wanted to point out that the Inventory Snapshot tool at VMware Flings is a great way to ease the migration from a Windows to a Linux vCenter Server or help “backup” the configuration of a vCenter Server. It doesn’t get everything You’ll still want to backup and restore distributing switching especially, as well as be aware you’ll loose historical performance information but this does simplify a lot of other re-work that would normally need to be done for the migration. The following does need to be redone or migrated separately but at least this can help quite a bit.

– Cluster rules
– Cluster DRS groups
– Cluster EVC mode setting
– Customization Specifications
– Scheduled tasks
– vDS

https://labs.vmware.com/flings/inventorysnapshot