So it’s time to stand up your new VMware cluster. You have reviewed your compute and storage requirements, and have picked hosts with 1-2TB of RAM, 100-300TB of storage, 32 core x 2 socket CPUs and are ready to begin that important consolidation project. You will be consolidating 3:1 from older hosts and before you deploy you get one additional requirement.
Networking Team: “We can only provision 2 x 10Gbps to each host”
You ask why? and get a number of $REASONS.
- Looking at average utilization for the month it was below 10Gbps.
- 25G/100Gbps cables and optics sounds expensive.
- Faster speeds seem unnatural and scary.
- Networking speed is a luxury for people who have Tigers on gold leashes, and we needed to save money somewhere.
- There is no benefit to operations.
- We are not due to replace our top of rack switches until 2034.
Now all of these are bad reasons, but we will walk through them starting with the first one today.
What is the impact of slow networking on my host?
Now you may think that slow networking is a storage team problem, but the impacts of undersized networking can impact a lot of differnt things. Other issues to expect to run into from undersized networking:
1. Slower vMotions, higher stun times, and longer host evacuations will be caused by slower networking. As you stuff more and more bandwidth intensive traffic on the same link the greater contention for host evacuations. This impacts maintenance mode operations and data resynchronization times.
2. Slow Backup and restore. While backups may be slower, we can somewhat cheat slow networking using CBT (Changed Block Tracking) and only doing forever incremental. Slow large data restore operations are the biggest concern for undersized networking. After a large scale failure or ransomware attack you may discover that rehydrated large amounts of data over 10Gbps is a lot slower than over 100Gbps. There is always a bottleneck in backup and restore speed, but the network is generally the cheapest resource to fix. You can try to mitigate this with scale out backup repositories, and using more data movers/proxys’, and more hosts and SAN ports, but in the end this ends up being far less cost effective than upgrading the network to 25/50/100Gbps.
3. Slower networking for storage, manifests itself in worse storage performance, specifically on large throughput operations, but also short microbursts where latency will creep up. Keep in mind that 10Gbps sounds like a lot but that is *per second*. If you are trying to get a large block of data in under 5ms that time window a single port can only move 6.25MB. As we try to pull average latencies down lower we need to be cognizant of what that link speed means for burst requests. Overtaxed network storage will often mask the true peak demand as back preasure and latency creep in. Pete has a great blog on this topic.
4. Slower large batch operations. Migrations, Database transform and load operations, and other batch jobs are often bandwidth constrained. You the operator may just see this as a 1-2 minute “bip” but making that 1-2 minute reponse in an end user application turn into a 10-20 second response can significantly improve the user experience of your application.
5. Tail latency. Applications with complicated chains of requests often are fundamentally bound by the one outlier in response times. Faster networking reduces the chance of contention somewhere in that 14 layer micro-service application the devops team has built.
6. Limitations on storage density. For HCI or any scale out storage system you will want adequate network bandwidth to handle node failure gracefully. vSAN has a number of tricks to reduce this impact (ESA compresses network resync, durability components) but at the end of the day you will not want 300TB in a vSAN/Ceph/Gluster/Minio node on a 10Gbps connection. There is a insidious feedback loop of slow networking is that it forces expensive, design decisions (lower density hosts and more of them), that often masks the need for faster networking. Even non-scale out platforms eventually will hit walls on density. a Monolithic storage array can scale to a lot more density and run wider fan out ratios using 100Gbps ethernet than 10Gbps ethernet.
Let us first dig into the first and most common objection to upgrading the network.
“Looking at average utilization for the month it was below 10Gbps”
How do you we as architects respond to this statement?
Networks are bursty is my short response to this. Pete Koehler calls this “The curse of averages”. Most of the tooling people use to make this statement is SNMP monitoring tooling that polls every few minutes. This apprach is find for slowly changing things like temperature, or binary health events like “is power supply dead?”. Unfortuently for networking, a packet buffer can fill up and cause back preasure and congestion in as short as 100ms, and SNMP polling every 5 minutes is not going to cut it for this. Inversely context around WHEN a network is saturated is important. If the network is saturated in the middle of the night when backups or databse maintenance or ETL runs I might not actually care. Using an average with a poor samplying frequency of times when I do and do not care about congestion is about the worst way to make a design decision possibly.
There are ways to understand congestion and it’s impacts. You may notice on the outliers of storage latency polling that there is corresponding high networking utilization at the same time. You can also get smarter about monitoring and have switches deliver syslog information about buffer exhaustion to your operations tool and overlay this with other metrics like high CPU usage, or high storage latency to understand the impact on slow undersized networking. (Screenshot of LogInsight generating an alarm).
Why is observability on networking often bad?
Operations teams are often a lot more blind to networking limitations than they realize. Now it’s true this tooling will never be perfect as there becomes some challenges trying to get a 100% perfect network monitoring.
Why not Just SNMP poll every 100ms?
The more frequent the polling on monitoring the more likely the monitoring itself starts to create overhead that impacts the networking devices or hosts themselves. Anyone who has turned on debug logging on a switch and crashed it should understand this. Modern efforts to reduce it (dedicated ASIC functions for observability, seperation of observability from the data plane in switches) do exist. It is worth noting vSAN hsa a network diagnostic mode that goes down to 1 second, which is pretty good for acute troubleshooting.
Can we just monitor links smarter?
Physical fiber taps that sit in line and sniff/process the size/shape/function/latency of every packet do exist. Virtual instruments was a company who did this. People who worked there told me “Storage arrays and networks lie a lot” but the cost of deploying fiber taps, and dedicated monitoring appliances per rack often exceeds just throwing more merchant silicon at the problem and upgrading the network to 100Gbps.
What tooling exists today?
Even driven tooling is often going to be the best way to detect network saturation. Newer ASICs and APIs exist, as well as siply having the switch shoot a syslog event when congestion is happening can help you overlay networking problems with application issues. VMware Cloud Foundation’s built in Log analytics tooling can help with this, and can overal the VCF Operations performance graphs to get a better understanding of when the network is causing issues.
Can we Just squeeze traffic down the 10Gbps better?
A few attempts have been made to “make 10Gbps work”. The reality is I have seen hosts that could deliver 120K IOPS of storage performance crippled down to 30K IOPS and so forth because of slow networking but we can review ways to make 10Gbps better…
Clever QoS to make slower networks viable?
Years ago CoS/DSCP were commonly used in the past to protect voice traffic over LANs or MPLS, and while they do exist in the datacenter most customers rarely use them in top of rack. Segmenting traffic per VLAN, making sure you don’t discover bugs in implementations, making sure tags are honored end to end is a lot of operational work. While the vDS supports this, and people may perform it on a per port group basis for storage, generally NIOC shaping traffic is about as far as most people operationally want to get involved in going down this path.
Smarter Switch ASICS
Clever buffer mangagement: “Elephant traps” (dropping of large packets to speed up smaller mice packets), and shared buffer management often worked to prevent one bursty flow, or one large packet from hogging all the resources. This was common on some of the earlier Nexus switches, and I’m sure was great if you had mixes of real time voice and buffered streaming video on your switch but frankly is highly problematic for storage flows that NEED to arrive in order.
Deeper Buffers Switches?
The other side of this coin was moving from swith ASICS with 12 or 32MB to multi-GB buffers. These “ultra deep buffer switches” could help mitigate some port over-runs and reduce the need for drops. VMware and others advocated for them for storage traffic and vSAN. With 10Gbps moving from the lower end Trident to the higher end Jericho ASICs we did see much better handling of micro-bursts, and even sustained workloads. TCP incast was mitigated. As 25Gbps came out though, we saw only a few niche switches configured this way and the pricing on them frankly was so close to 100Gbps that just deploying a faster pipe from point A to point B has proven to be more cost effective than trying to put a bigger bucket under the leak in the roof.
What does faster networking cost?
While some of us may remember 100Gbps ports costing $1000+ a port, networking has gotten a lot cheaper. The same commodity ASICs (Trident 3, Jericho, Tomahawk) power the most common top of rack leaf and spine switches in the datacenter today. Interestingly enough you can even now buy your hardware from one vendor, and switch OS or SDN management overlay for SONIC.
While vendors will try to charge large amounts for branded optics, All in one cables (called AIO) and passive TwinAx copper cables can often be purchased for $15-100 depending on length, and temperature tolerance requirements. These cables remove the need to purchase an optic, and reduce issues with dust and port errors by being “welded shut” against the SFP28/QSFP copper transceiver.
$15 – $30 for 25Gbps passive cables
TINA – There is no Alternative (to faster networking)
The future is increasingly moving core datacenter performance intensive workloads to 100Gbps, with 25Gbps for smaller stacks (and possible 50Gbps even replacing that soon). The cost economics are shifting there, and the various tricks to squeeze more out of 10Gbps feels a bit like squeezing a single lemon to try to make 10 gallons of lemonade. “The Juice isn’t worth the squeeze.” While many of the above problems of slow networking can be mitigated with more hosts, lower performance expectations, longer operational windows, eventually it becomes clear that upgrading the network is more cost effective than throwing server hardware and time at a bad network.