So it’s time to stand up your new VMware cluster. You have reviewed your compute and storage requirements, and have picked hosts with 1-2TB of RAM, 100-300TB of storage, 32 core x 2 socket CPUs and are ready to begin that important consolidation project. You will be consolidating 3:1 from older hosts and before you deploy you get one additional requirement.

Networking Team: “We can only provision 2 x 10Gbps to each host”

You ask why? and get a number of $REASONS.

Looking at average utilization for the month it was below 10Gbps.
25G/100Gbps cables and optics sounds expensive.
Faster speeds seem unnatural and scary.
Networking speed is a luxury for people who have Tigers on gold leashes, and we needed to save money somewhere.
There is no benefit to operations.
We are not due to replace our top of rack switches until 2034.

Now all of these are bad reasons, but we will walk through them starting with the first one today.

What is the impact of slow networking on my host?

Now you may think that slow networking is a storage team problem, but the impacts of undersized networking can impact a lot of differnt things. Other issues to expect to run into from undersized networking:

1. Slower vMotions, higher stun times, and longer host evacuations will be caused by slower networking. As you stuff more and more bandwidth intensive traffic on the same link the greater contention for host evacuations. This impacts maintenance mode operations and data resynchronization times.

2. Slow Backup and restore. While backups may be slower, we can somewhat cheat slow networking using CBT (Changed Block Tracking) and only doing forever incremental. Slow large data restore operations are the biggest concern for undersized networking. After a large scale failure or ransomware attack you may discover that rehydrated large amounts of data over 10Gbps is a lot slower than over 100Gbps. There is always a bottleneck in backup and restore speed, but the network is generally the cheapest resource to fix. You can try to mitigate this with scale out backup repositories, and using more data movers/proxys’, and more hosts and SAN ports, but in the end this ends up being far less cost effective than upgrading the network to 25/50/100Gbps.

3. Slower networking for storage, manifests itself in worse storage performance, specifically on large throughput operations, but also short microbursts where latency will creep up. Keep in mind that 10Gbps sounds like a lot but that is *per second*. If you are trying to get a large block of data in under 5ms that time window a single port can only move 6.25MB. As we try to pull average latencies down lower we need to be cognizant of what that link speed means for burst requests. Overtaxed network storage will often mask the true peak demand as back preasure and latency creep in. Pete has a great blog on this topic.

4. Slower large batch operations. Migrations, Database transform and load operations, and other batch jobs are often bandwidth constrained. You the operator may just see this as a 1-2 minute “bip” but making that 1-2 minute reponse in an end user application turn into a 10-20 second response can significantly improve the user experience of your application.

5. Tail latency. Applications with complicated chains of requests often are fundamentally bound by the one outlier in response times. Faster networking reduces the chance of contention somewhere in that 14 layer micro-service application the devops team has built.

6. Limitations on storage density. For HCI or any scale out storage system you will want adequate network bandwidth to handle node failure gracefully. vSAN has a number of tricks to reduce this impact (ESA compresses network resync, durability components) but at the end of the day you will not want 300TB in a vSAN/Ceph/Gluster/Minio node on a 10Gbps connection. There is a insidious feedback loop of slow networking is that it forces expensive, design decisions (lower density hosts and more of them), that often masks the need for faster networking. Even non-scale out platforms eventually will hit walls on density. a Monolithic storage array can scale to a lot more density and run wider fan out ratios using 100Gbps ethernet than 10Gbps ethernet.

Let us first dig into the first and most common objection to upgrading the network.

“Looking at average utilization for the month it was below 10Gbps”

How do you we as architects respond to this statement?

Networks are bursty is my short response to this. Pete Koehler calls this “The curse of averages”. Most of the tooling people use to make this statement is SNMP monitoring tooling that polls every few minutes. This apprach is find for slowly changing things like temperature, or binary health events like “is power supply dead?”. Unfortuently for networking, a packet buffer can fill up and cause back preasure and congestion in as short as 100ms, and SNMP polling every 5 minutes is not going to cut it for this. Inversely context around WHEN a network is saturated is important. If the network is saturated in the middle of the night when backups or databse maintenance or ETL runs I might not actually care. Using an average with a poor samplying frequency of times when I do and do not care about congestion is about the worst way to make a design decision possibly.

There are ways to understand congestion and it’s impacts. You may notice on the outliers of storage latency polling that there is corresponding high networking utilization at the same time. You can also get smarter about monitoring and have switches deliver syslog information about buffer exhaustion to your operations tool and overlay this with other metrics like high CPU usage, or high storage latency to understand the impact on slow undersized networking. (Screenshot of LogInsight generating an alarm).

Why is observability on networking often bad?

Operations teams are often a lot more blind to networking limitations than they realize. Now it’s true this tooling will never be perfect as there becomes some challenges trying to get a 100% perfect network monitoring.

Why not Just SNMP poll every 100ms?

The more frequent the polling on monitoring the more likely the monitoring itself starts to create overhead that impacts the networking devices or hosts themselves. Anyone who has turned on debug logging on a switch and crashed it should understand this. Modern efforts to reduce it (dedicated ASIC functions for observability, seperation of observability from the data plane in switches) do exist. It is worth noting vSAN hsa a network diagnostic mode that goes down to 1 second, which is pretty good for acute troubleshooting.

Can we just monitor links smarter?

Physical fiber taps that sit in line and sniff/process the size/shape/function/latency of every packet do exist. Virtual instruments was a company who did this. People who worked there told me “Storage arrays and networks lie a lot” but the cost of deploying fiber taps, and dedicated monitoring appliances per rack often exceeds just throwing more merchant silicon at the problem and upgrading the network to 100Gbps.

What tooling exists today?

Even driven tooling is often going to be the best way to detect network saturation. Newer ASICs and APIs exist, as well as siply having the switch shoot a syslog event when congestion is happening can help you overlay networking problems with application issues. VMware Cloud Foundation’s built in Log analytics tooling can help with this, and can overal the VCF Operations performance graphs to get a better understanding of when the network is causing issues.

Can we Just squeeze traffic down the 10Gbps better?

A few attempts have been made to “make 10Gbps work”. The reality is I have seen hosts that could deliver 120K IOPS of storage performance crippled down to 30K IOPS and so forth because of slow networking but we can review ways to make 10Gbps better…

Clever QoS to make slower networks viable?

Years ago CoS/DSCP were commonly used in the past to protect voice traffic over LANs or MPLS, and while they do exist in the datacenter most customers rarely use them in top of rack. Segmenting traffic per VLAN, making sure you don’t discover bugs in implementations, making sure tags are honored end to end is a lot of operational work. While the vDS supports this, and people may perform it on a per port group basis for storage, generally NIOC shaping traffic is about as far as most people operationally want to get involved in going down this path.

Smarter Switch ASICS

Clever buffer mangagement: “Elephant traps” (dropping of large packets to speed up smaller mice packets), and shared buffer management often worked to prevent one bursty flow, or one large packet from hogging all the resources. This was common on some of the earlier Nexus switches, and I’m sure was great if you had mixes of real time voice and buffered streaming video on your switch but frankly is highly problematic for storage flows that NEED to arrive in order.

Deeper Buffers Switches?

The other side of this coin was moving from swith ASICS with 12 or 32MB to multi-GB buffers. These “ultra deep buffer switches” could help mitigate some port over-runs and reduce the need for drops. VMware and others advocated for them for storage traffic and vSAN. With 10Gbps moving from the lower end Trident to the higher end Jericho ASICs we did see much better handling of micro-bursts, and even sustained workloads. TCP incast was mitigated. As 25Gbps came out though, we saw only a few niche switches configured this way and the pricing on them frankly was so close to 100Gbps that just deploying a faster pipe from point A to point B has proven to be more cost effective than trying to put a bigger bucket under the leak in the roof.

What does faster networking cost?

While some of us may remember 100Gbps ports costing $1000+ a port, networking has gotten a lot cheaper. The same commodity ASICs (Trident 3, Jericho, Tomahawk) power the most common top of rack leaf and spine switches in the datacenter today. Interestingly enough you can even now buy your hardware from one vendor, and switch OS or SDN management overlay for SONIC.

While vendors will try to charge large amounts for branded optics, All in one cables (called AIO) and passive TwinAx copper cables can often be purchased for $15-100 depending on length, and temperature tolerance requirements. These cables remove the need to purchase an optic, and reduce issues with dust and port errors by being “welded shut” against the SFP28/QSFP copper transceiver.

Passive TwinAx, or All In One Optical cables are not that expensive. This is a cheap passive TwinAx cable. At larger runs you will want to consider all in one optical. This image came from fs.com

$15 – $30 for 25Gbps passive cables

TINA – There is no Alternative (to faster networking)

The future is increasingly moving core datacenter performance intensive workloads to 100Gbps, with 25Gbps for smaller stacks (and possible 50Gbps even replacing that soon). The cost economics are shifting there, and the various tricks to squeeze more out of 10Gbps feels a bit like squeezing a single lemon to try to make 10 gallons of lemonade. “The Juice isn’t worth the squeeze.” While many of the above problems of slow networking can be mitigated with more hosts, lower performance expectations, longer operational windows, eventually it becomes clear that upgrading the network is more cost effective than throwing server hardware and time at a bad network.

From time to time I get oddball questions where someone asks about how to do something that is not supported or a bad idea. I’ll often fire back a simple “No” and then we get into a discussion about why VMware does not have a KB for this specific corner case or situation. There are a host of reasons why this may or may not be documented but here is my monthly list of “No/That is a bad idea (TM)!”.

How do I use VMware Cloud Foundation (VCF) with a VSA/Virtual Machine that can not be vMotion’d to another host?

This one has come up quite a lot recently with some partners, and storage vendors who use VSA’s (A virtual machine that locally consumes storage to replicate it) incorrectly claiming this is supported. The issue is that SDDC Manager automates upgrade and patch management. In order to patch a host, all running virtual machines must be removed. This process is triggered when a host is placed into maintenance mode and DRS carefully vMotions VMs off of the host. If there is a virtual machine on the host that can not be powered off or moved, this will cause lifecycle to fail.

What about if I use the VSA’s external lifecycle management to patch ESXi?

The issue comes in, running multiple host patching systems is a “very bad idea” (TM). You’ll have issues with SDDC Manager not understanding the state of the hosts, but also coordination of non-ESXi elements (NSX perhaps using a VIB) would also be problematic. The only exception to using SDDC manager with external lifecycle tooling tools are select vendor LCM solutions that done customization and interop (Examples include VxRAIL Manager, the Redfish to HPE Synergy integration, and packaged VCF appliance solutions like UCP-RS and VxRACK SDDC). Note these solutions all use vSAN and avoid the VSA problem and have done the engineering work to make things play nice.

Should I use a Nexus 2000K (or other low performing network switch) with vSAN?

While vSAN does not currently have a switch HCL (Watch this space!) I have written some guidance specifically about FEXs on this personal blog. The reality is there are politics to getting a KB written saying “not to use something”, and it would require cooperation from the switch vendors. If anyone at Cisco wants to work with me on a joint KB saying “don’t use a FEX for vSAN/HCI in 2019” please reach out to me! Before anyone accuses me of not liking Cisco, I’ll say I’m a big fan of the C36180YC-R (ultra deep buffers RAWR!), and have seen some amazing performance out of this switch recently when paired with Intel Optane.

Beyond the FEX, I’ve written some neutral switch guidance on buffers on our official blog. I do plan to merge this into the vSAN Networking Guide this quarter.

I’d like to use RSPAN against the vDS and mirror all vSAN traffic, I’d like to run all vSAN traffic through a ASA Firewall or Palo Alto or IDS, Cisco ISR, I’d like to route vSAN traffic through a F5 or similar requests…

There’s a trend of security people wanting to inspect “all the things!”. There are a lot of misconceptions about vSAN routing or flowing or going places.

Good Ideas! – There is some false assumptions you can’t do the following. While they may add complexity or not be supported on VCF or VxRAIL in certain configurations, they certainly are just fine with vSAN from a feasibility standpoint.

Routing storage traffic is just fine. Modern enterprise switches can route OSPF/Static routes etc at wire-speed just fine all in the ASIC offloads. vSAN is supported over layer 3 (may need to configure static routes!) and this is a “Good idea” on stretched clusters so spanning tree issues don’t crash both datacenters!
vSAN over VxLAN/VTEP in hardware is supported.
VSAN over VLAN backed port groups on NSX-T is supported.

Bad Ideas!

Frank Escaros-Buechsel with VMware support once told someone “While we do not document that as not supported, it’s a bit like putting peanut butter in a server. Some things we assume are such bad idea’s no one would try them, and there is only so much time to document all bad ideas.

Trying to mirror high throughput flows of storage or vMotion from a VDS is likely to cause performance problems. While I”m not sure of a specific support statement, i’m going to kindly ask you not to do this. If you want to know how much traffic is flowing and where, consider turning on SFLOW/JFLOW/NetFlow on the physical switches and monitoring from that point. vRNI can help quite a bit here!
Sending iSCSI/NFS/FCoE/vSAN storage traffic to an IDS/Firewall/Load balancer. These devices do not know how to inspect this traffic (trust me, they are not designed to look at SCSI or NVMe packets!) so you’ll get zero security value out of this process. If you are looking for virus binaries, your better off using NSX guest introspection and regular antivirus software. Because of the volume, you will hit the wire-speed limits of these devices. Outside of path latency, you will quickly introduce drops and re-transmits and murder storage traffic performance. Outside of some old Niche inline FC encryption blades (that I think Netapp used to make), inline storage security devices are a bad idea. While there are some carrier-grade routers that can push 40+ Gbps of encryption (MLXe’s I vaguely remember did this) the costs are going to be enormous, and you’ll likely be better off just encrypting at the vSCSI layer using the VM Encryption VAIO filter. You’ll get better security that IPSEC/MACSEC without massive costs.

Did I get something wrong?

Is there an Exception?

Feel free to reach out and lets talk about why your environment is a snowflake from these general rules of things “not to do!”

Tag: vSAN Networking

The Problem with 10Gbps