How big should my vSAN or vSphere cluster be?
This is a topic that comes up quite a bit. A lot has been written previously about how big should your vSphere clusters be and Duncan’s musings on this topic are still very valid.
It generally starts with:
“I have 1PB in my storage frame today, can I build a 1PB vSAN cluster?”
The short response is yes, you can certainly build a PB vSAN cluster, and build 64 node clusters (there are customers who have broken 2 PB within a cluster, and customers with 64 node clusters), but you stop and think if you should.
We have to stop and think about things beyond cost control when designing availability. I always chuckle when people talk about arrays having seven 9’s of availability. The question to ask yourself is if the storage is up, but the network is down does anyone care? Once we include things “outside of storage” we often find that the reality of uptime is often more limited. The actual environmental (Power, Cooling) of a datacenter are rated at best 99.98% by the uptime institute. Traditionally we tried to make the floor tile that our gear sat in to be as resilient as possible.
James Hamilton of Amazon has pointed to WAN connectivity to being another key bottleneck to uptime.
“The way most customers work is that an application runs in a single data center, and you work as hard as you can to make the data center as reliable as you can, and in the end you realize that about three nines (99.9 percent uptime) is all you’re going to get,”
The uptime institute has done a fair amount of research in this space, and historically their definition of a Tier IV facility involved providing only up to 99.99% uptime (4 nines).
Getting beyond 4 nines of uptime for remote users (who are the mercy of half finished internet standards like BGP) is possible but difficult.
Availability most be able to account for the infastructure it rests on, and resiliency in storage and applications must account for the physical infrastructure.
Lets review traditional storage cost and operational concepts and why we today have reached a point where customers are putting over 1PB into a storage pool.
- Capital Costs – Some features may be licensed per frame, and significant discounts may be given if large purchase are made up front rather than as capacity is needed. Sparing capacity and overhead as a % of a storage pool become smaller if your growth rate is fixed.
- Opex – While many storage frames may have federation tools, there are still process’s that are often done manually, particularly for change control reasons because of the scale of an outage of a frame (I talked to a customer who had one array fail and take out 4000 VM’s including their management virtual machines).
- Performance – wide striping or on hybrid systems aggregating cache and controllers and ports reduced the change of a bottleneck being reached.
Patching/Change Control – Talking to a lot of customers they are often running the same firmware that their storage array came with. The risk, or the 15 second “gap” in IO as controllers are upgraded is often viewed as a huge risk. This is made worst by the most risk averse application on the cluster effectively dictates patching and change control windows. No one enjoys late night all hands on deck patching windows for storage arrays.
- Parallel remediation in patch windows – Deploying more storage systems means more manual intervention. Traditional arrays often lack good tools for management and monitoring of parallel remediation. Often times more storage arrays means more change control windows.
- Aligning the planets on the HCL – To upgrade a Fibre Channel Array, you must upgrade ESXi, the Array, The Fabric Version, the Fibre Channel HBA firmware, and the server BIOS to align with the ESXi upgrade. This is a lot of moving parts, all of which that carry risks of a corner case being identified.
Lets review how vSAN dresses these costs without driving you to put everything in one giant cluster..
- Capital Costs – vSAN licensing is per socket and hosts can be deployed with empty drive bays. Drives for regular severs regularly fall in in price, making it cheaper to purchase what you need now and add drives to hosts as needed to meet capacity growth. Overhead for spare capacity for rebuilds does reduce as you add hosts, but nothing forces you to fill each host with capacity up front and no additional licensing costs will be invoked by having partially full servers.
- Opex – vSAN’s normal management plane (vCenter) is easily federated and storage policies span clusters without any additional work. Lifecycle management like controller updates from the Config assist, and health monitoring alerts easily roll up to a single pane of glass.
- Performance – All Flash has changed the game. You no longer need 1000 spindles and wide striping to get fast or consistent performance. Pooling workloads with 3 tier storage architecture and storage arrays actually increases the chance that you might saturate throughput, or buffers on fibre channel switching.
- Patching – vSAN patching can be done simply using existing tools for updating ESXi (VMware Update Manager), and lifecycle update for storage controllers can be pushed by a simple click from the UI in vSAN 6.6. Customers already have ESXi patching windows and processes deployed and maintenance mode with vMotion is as trusted and battle tested means to evacuate a host.
- VMware Update manager (VUM) can remediate multiple clusters in parallel. This means you can patch as many (or as few) clusters, and when used with DRS this is fully automated including placement of virtual machines.
- Additional intelligence has been deployed for vSAN to include remediation of Firmware. Given that vSAN does not use proprietary Fibre Channel fabrics, is integrated into ESXi, and lacks the need for proprietary fabric HBA’s this significantly reduces the number of planets to align when planning an upgrade window.
In summery I wanted to say. While vSAN can certainly scale to the multi-PB cluster size, you should look if you actually need to scale up this much. In many cases you would be better served by at scale running multiple clusters.