Skip to content

Posts tagged ‘Networking’

Improving NIC and switch performance for vSAN (and other IP storage)

This is going to be a short post collecting a few tricks to unlock some bottlenecks in storage networking that may grow over time:

Unfortunetly a lot of troubeshooting of networking performance stops earlier than it should. Two common incomplete troubleshooting workflows I’ve seen:

  1. Someone checks that network utilization on a host isn’t near the link speed and says “network not the bottleneck”.
  2. Someone calls the networking team and they look at the switchports utilization based on SNMP polling, or do a quick “Show interface” and don’t see obvious port errors (CRC, drops, giants etc). They proudly close the ticket as “Switches are fine!”

Buffer Configuration Considerations

In one of my labs where we have a Nexus 9000 series switch we found performance was looking a bit limited. Seeing higher than expected re-ransmits we dug into it deeper. We dug deeper into buffer utilization. Discovering that the default mesh configuration was limiting buffer access to 500 KB per port, we adjusted the buffers using the qos ns-buffer-profile ultra-burst command. This signifigantly opened up performance, reducing TCP incast inssues (which cause retransmits) and brought performance more in line with what we would expect for the cluster. For anyone looking for more inormation on this command (and how to look at buffers) see the QoS Guide. Note for solving buffer contention different switches will have different options for configuring buffers, priortizing which flows to drop first, and allocating buffer to ports. In other cases it may be simpler to just buy switches with deeper buffers to begin with. Rather than trying to chop apart a 12MB-40MB buffer, simply purchasing a switch with an 8GB buffer can avoid a lot of the need for buffer management consideration.

I”ve been asked about the HPE 5950 series switch. Digging into the CoS/QoS guide I found a few things:

You can detect how often you exceed a buffer with the display buffer usage interface command.

<switchname> Display buffer usage interface hundredgige 1/0/1

This command will be more useful than the “display buffer usage” as that only tracks usage over a 5 second rolling outage, vs the violation tracker that the interface counter can track (Which will detect very short microbursts that may be causing buffer full and retransmits conditions nad latency). Note the default buffer threashold is 70%.

burst-mode enable appears to be a similar command to the ultra-burst buffer configuration and is recomended for cases that include “Traffic enters a device from multiple same-rate interfaces and goes out of an interface with the same rate.” Given this scenario is exactly what we would see from TCP incast (multiple vSAN hosts trying to talk to the same host and filling a buffer), this is likely something you would want to turn on. As I don’t have one of these switches in my lab, I’d love any feedback anyone has from trying this command. If anyone from HPE Networking is reading this, feel free to reach out.

TCP dispatch queues tuning

In another example in the lab, a test of raw throughput was coming up short. A review of the back end disk groups showed a lack of congestion (latency was low, write cache fill rate was low). A review of network utilization showed only 30% utilization on the link speed, but high latency (20ms+ between the nodes).

Investigation showed that the throughput was bottlenecking on a single threaded TCP process (CPU for attached world at 100%). By raising the TCP RX queues from the default of 1 to 4, this eleminated this as a bottleneck and returned performance to expected levels.

Steps to set this are:

Set the advanced setting on the host

$ esxcfg-advcfg -s 4 /Net/TcpipRxDispatchQueues

Or for PowerCLI:

Get-AdvancedSetting -Entity <esxi host> -Name Net.TcpipRxDispatchQueues | Set-AdvancedSetting -Value ‘4’

Reboot the host once this is set.

To validate this setting:

$ esxcfg-advcfg -g /Net/TcpipRxDispatchQueues

It’s worth noting that Niel’s blog on vMotion tuning reported higher throughput per stream than I saw (His blog reports 15Gbps per stream). This may be a result of my lab hosts using inexpensive cheaper Intel 5xx series NICs that lack the advanced offloads that the Intel 700 or 800 series cards have. Mellanox CX Series cards also have similar capabilities. Without these offloads, more CPU is needed to push the same throughput and this would compound together to bring the performance ceiling even lower with the cheaper NICs.

Summery

For anyone seeing bottlenecks on lower-cost NIC’s, or wanting to push more than 15Gbps per host of vSAN traffic, keep an eye on this setting, and talk to GSS if you are concerned this default may be causing a bottleneck. For new hosts I’d strongly consider smarter NIC’s that contain hardware LRO/TSO, RSS VxLAN GENEVE offload capabilities and make sure that your driver and firmware are both up to date. Note in a future release this default may change.

If you have any feedback on these commands (or questions on other commands or switches!) reach out to me on twitter @Lost_Signal

Should I use a Nexus 2000 series (Cisco FEX) with VMware vSAN?

This question has come up a few times with customer networking teams and it’s one that I must admit confuses me that we are having to have in 2019.

The short response is no. You should avoid using these devices with vSAN, and in general with virtualization or storage traffic.

They were designed for a time when low utilization of physical servers or low-density virtualization was the norm. At the same time, the price for 10Gbps ports on fast switches was incredibly expensive.

Cisco’s troubleshooting notes on Cisco FEX make a few statements.

Move any servers with bursty traffic flows such as storage arrays and video endpoints off of the FEX and connect them directly to the base ports of the parent switch.

Common questions that  have come up at VMworld and other discussions:

Q: why should I listen to a guy who does storage and virutalization about networking?

A: I don’t disagree. How about one of the Co-Founders of the company that built the FEX?

 

 

Q: What is VMware doing to fix this with vSAN

A: This isn’t really a VMware problem. Storage or other large traffic flows like vMotion suffer on Cisco FEX devices. Note other east/west heavy traffic flows suffer in light buffered oversubscribed environments. vMotion, and NSX are also not going to perform there best without real switch ports.

Q: What are some model numbers for the device?

A: Devices:

N2K-C2148T-1GE

N2K-C2224TP-1GE

N2K-C2248TP-1GE

N2K-C2232PP-10GE

N2K-C2232TM-10GE

N2K-C2248TP-E-1GE

N2K-C2232TM-E-10GE

N2K-C2232TM-E-10GE

N2K-C2248PQ-10GE

N2K-C2348UPQ-10GE

B22

Q: My networking team told me they are just like an external line card for a switch chassis?

A: Your networking team is incorrect. A real switch port can send traffic to another port without hair-pinning through another device. It’s arguable that a hub would provide a more direct route for packets from one port to another than what the FEX product line offers. Modern switches also offer much larger buffers that can help mitigate TCP incast and other issues that you will see at scale.

Q: How do I determine if my networking teams have deployed Cisco  FEX devices?

A: This can be difficult without physical inspection to known issues with Cisco  Discovery  Protocol) not working correctly with some configurations of the devices. One sign is if the port on the switch has incredibly high designations 100/1/1 you may be looking at a FEX. It’s best to have your data center operation teams inspect the racks, and take note of model numbers in the same way you would have them physically inspect for cardboard or other things you don’t want in your datacenter. Ultimately the best solution is preventative. Talk to your networking teams about the risks of using FEX devices before they are deployed. 

Q: What are some alternatives to look at?

I’m happy to take comments from other networking people about this but I’ve seen two general choices that customers use instead.

For Cisco customers looking for a device that need FCoE, the Nexus 56xx, 6000, and 7000 offer real switch ports as well as larger buffers. Note: older Nexus 50xx and 55xx have relatively small VoQ buffers that tend to not scale well with larger clusters.

For customers not needing FCoE support (which should be most customers in 2019), the C36180YC-R offers:

  • 10/25Gbps access ports
  • A massive 8 GB of port buffer
  • A fast modern multi-core ASIC