This is a quick post in reaction to Alex Benik’s post at Gigaom. While I like Gigaom’s commentary on the industry at large, they really don’t seem to understand infastruture always. Alex starts out by stating that the current industry practice of separating out applications with their own dedicated OS instances, and having low utilization is a terrible problem. He almost paints hypervisors as part of the problem. he cites 7% CPU usage on EC2 instances as a key example of what is wrong with virtualization and usage.
I’ve got a few quick thoughts on this.
1. The reason Amazon EC2 can be so cheap is because Amazon can over subscribe instances heavily. Low average CPU workloads is the foundation for virtualization and all kinds of other industries (Shared web hosting etc). He’s turned the reason for virtualization being a great cost saving technology into a problem that needs to be solved. If everyone was running them 100% all the time then there would be a problem.
2. He’s assuming that CPU is the primary bottleneck. As others (Jonathan Frappier) have pointed out that storage is often the bottleneck. There comes a point where you can only get so much disk IO to a virtual machine. In large enterprises with Shared Storage Arrays, eventually bottlenecks in storage IO (Queue Depths on HBA’s, LUN’s etc) start to crop up, and eventually it becomes easier to scale out to more hosts, than try to scale deep. VMware and others have created technologies (CBRC, vfrc, vSAN) that will help this. Also memory is helping and hurting this density problem.
3. Until the recent era of large memory hosts, memory was often a bottleneck. As 64 bit databases and applications became ever hungrier to cache data locally this waged a 2 factor war on CPU utilization. Hosts with VM’s with 16GB of ram quickly ran out of RAM before they ran out of CPU. Also Memory and disk IO subtly influence CPU in ways you don’t factor. Once memory is exhausted on a host, and over subscription is occurring. CPU usage can spike, as process’s take longer to finish. In memory workloads allow CPU’s to process data quicker, and jobs finish sooner. Vendor recommendations for ridiculous memory allocations don’t help the solution either (I’m looking at you Sage). When Vendors recommend 64GB of RAM for a database server serving 150 users, its become clear that SQL monkeys everywhere have given up on actually doing proper indexing or archiving and instead are relying on memory cache. This demand on memory causes hosts to fill up long before CPU usage can become a problem unless managers are willing to trust a balloon driver to intelligently swap out the “memory bloat”. (Internally and with customer vCloud deployments I’ve seen much better utilization by oversubscribing memory 2 or 3 times). This is not a bad thing unless an application has scale. (its now cheaper to throw hardware at the problem than write proper code/index/optimizations).
4. He’s also forgetting the reason we went with virtualization in the first place. To separate out applications so that we could update them independently from each other. No longer running into issues where rebooting a server to fix one application caused another application to go down. Anyone who’s worked in shared tenant container hosting can tell you that its not really that great. Comparability matrix’s, larger failure zones and all kinds of problems can come up. For homogenous web hosting its a fine solution. For the enterprise trying to mix diverse workloads it can be a nightmare. We use BSD containers internally for some websites, but beyond that we stick to the hypervisors, as a more general use, stable and easier to support platform. I’d argue JEOS, vFabric and other stripped down VM approaches are a better solution as they enforce instance isolation while giving us massive efficiency gains from the kitchen sync (I’m going to call out websphere on this)deployments of old.
Getting the ratio of CPU to Memory to Disk IO and capacity is hard. Painfully hard. Given that CPU is often one of the cheapest components (and most annoying to try to upgrade) its no wonder that IT managers everywhere who come from a history of CPU’s being the bottlenecks often get a little out of hand in overkill with CPU purchasing. I’ve been in a lot of meetings where I’ve had to argue with even internal IT staff that more CPU isn’t the solution (The graphs don’t lie!) while disk latency is out the roof. I’d strangely argue a current move to scale out (Nutanix/VSAN etc) might fix a lot of broken purchasing decisions (LOTS of CPU, low memory, disk IO).