This question has come up a few times with customer networking teams and it’s one that I must admit confuses me that we are having to have in 2019.
It’s 2019. FEXs are not switches and you realllllly should stop buying/deploying/using them.— John Nicholson (@Lost_Signal) May 2, 2019
The short response is no. You should avoid using these devices with vSAN, and in general with virtualization or storage traffic.
They were designed for a time when low utilization of physical servers or low-density virtualization was the norm. At the same time, the price for 10Gbps ports on fast switches was incredibly expensive.
Cisco’s troubleshooting notes on Cisco FEX make a few statements.
Move any servers with bursty traffic flows such as storage arrays and video endpoints off of the FEX and connect them directly to the base ports of the parent switch.
Common questions that have come up at VMworld and other discussions:
Q: why should I listen to a guy who does storage and virutalization about networking?
A: I don’t disagree. How about one of the Co-Flounders of the company that built the FEX?
As a co-founder of Nuova Systems, where we invented FEX, I heartily agree with this sentiment.— Tom Lyon (@aka_pugs) May 4, 2019
Q: What is VMware doing to fix this with vSAN
A: This isn’t really a VMware problem. Storage or other large traffic flows like vMotion suffer on Cisco FEX devices. Note other east/west heavy traffic flows suffer in light buffered oversubscribed environments. vMotion, and NSX are also not going to perform there best without real switch ports.
Q: What are some model numbers for the device?
Q: My networking team told me they are just like an external line card for a switch chassis?
A: Your networking team is incorrect. A real switch port can send traffic to another port without hair-pinning through another device. It’s arguable that a hub would provide a more direct route for packets from one port to another than what the FEX product line offers. Modern switches also offer much larger buffers that can help mitigate TCP incast and other issues that you will see at scale.
Q: How do I determine if my networking teams have deployed Cisco FEX devices?
A: This can be difficult without physical inspection to known issues with Cisco Discovery Protocol) not working correctly with some configurations of the devices. One sign is if the port on the switch has incredibly high designations 100/1/1 you may be looking at a FEX. It’s best to have your data center operation teams inspect the racks, and take note of model numbers in the same way you would have them physically inspect for cardboard or other things you don’t want in your datacenter. Ultimately the best solution is preventative. Talk to your networking teams about the risks of using FEX devices before they are deployed.
Q: What are some alternatives to look at?
I’m happy to take comments from other networking people about this but I’ve seen two general choices that customers use instead.
For Cisco customers looking for a device that need FCoE, the Nexus 56xx, 6000, and 7000 offer real switch ports as well as larger buffers. Note: older Nexus 50xx and 55xx have relatively small VoQ buffers that tend to not scale well with larger clusters.
For customers not needing FCoE support (which should be most customers in 2019), the C36180YC-R offers:
- 10/25Gbps access ports
- A massive 8 GB of port buffer
- A fast modern multi-core ASIC
|HCI1473BU||The vSAN I/O Path Deconstructed: A Deep Dive into the Internals of vSAN|
|???||Mystery Session: 7/27 at 3:30PM|
|HCI1769BU||We Got You Covered: Top Operational Tips from vSAN Support Insight|
|HCI3331BU||Better Storage Utilization with Space Reclamation/UNMAP|
The vSAN I/O Path Deconstructed is an interesting inside look at the IO path of vSAN and the reasoning behind it.
We Got You Covered: Top Operational Tips from vSAN Support Insight shows off the phone home capabilities of vSAN and can help address your questions about what and how this data is used. We are also going to discuss how you can leverage similar views of performance as GSS and engineering to identify how to get the most out of vSAN.
HCI3331BU is a session that has been years in the making for me. “Where did my space go” is a question I get often. We will explain where that missing PB of storage went and how to reclaim it. The savings from implementing UNMAP should be able to fund your next VMworld trip!
Lastly, I’ve got a mystery session that should be unveiled later. Follow me on Twitter @Lost_Signal, and I’ll talk about what it will be when the time comes.
Pete and I will be recording for the vSpeakingPodcast Podcast LIVE! At the HCI Zone (Found near the VMware booth). We’ve got some new guests as well as some favorites lined up.
This is a topic that comes up quite a bit. A lot has been written previously about how big should your vSphere clusters be and Duncan’s musings on this topic are still very valid.
It generally starts with:
“I have 1PB in my storage frame today, can I build a 1PB vSAN cluster?”
The short response is yes, you can certainly build a PB vSAN cluster, and build 64 node clusters (there are customers who have broken 2 PB within a cluster, and customers with 64 node clusters), but you stop and think if you should.
We have to stop and think about things beyond cost control when designing availability. I always chuckle when people talk about arrays having seven 9’s of availability. The question to ask yourself is if the storage is up, but the network is down does anyone care? Once we include things “outside of storage” we often find that the reality of uptime is often more limited. The actual environmental (Power, Cooling) of a datacenter are rated at best 99.98% by the uptime institute. Traditionally we tried to make the floor tile that our gear sat in to be as resilient as possible.
James Hamilton of Amazon has pointed to WAN connectivity to being another key bottleneck to uptime.
“The way most customers work is that an application runs in a single data center, and you work as hard as you can to make the data center as reliable as you can, and in the end you realize that about three nines (99.9 percent uptime) is all you’re going to get,”
Getting beyond 4 nines of uptime for remote users (who are the mercy of half finished internet standards like BGP) is possible but difficult.
Availability most be able to account for the infastructure it rests on, and resiliency in storage and applications must account for the physical infrastructure.
Lets review traditional storage cost and operational concepts and why we today have reached a point where customers are putting over 1PB into a storage pool.
- Capital Costs – Some features may be licensed per frame, and significant discounts may be given if large purchase are made up front rather than as capacity is needed. Sparing capacity and overhead as a % of a storage pool become smaller if your growth rate is fixed.
- Opex – While many storage frames may have federation tools, there are still process’s that are often done manually, particularly for change control reasons because of the scale of an outage of a frame (I talked to a customer who had one array fail and take out 4000 VM’s including their management virtual machines).
- Performance – wide striping or on hybrid systems aggregating cache and controllers and ports reduced the change of a bottleneck being reached.
Patching/Change Control – Talking to a lot of customers they are often running the same firmware that their storage array came with. The risk, or the 15 second “gap” in IO as controllers are upgraded is often viewed as a huge risk. This is made worst by the most risk averse application on the cluster effectively dictates patching and change control windows. No one enjoys late night all hands on deck patching windows for storage arrays.
- Parallel remediation in patch windows – Deploying more storage systems means more manual intervention. Traditional arrays often lack good tools for management and monitoring of parallel remediation. Often times more storage arrays means more change control windows.
- Aligning the planets on the HCL – To upgrade a Fibre Channel Array, you must upgrade ESXi, the Array, The Fabric Version, the Fibre Channel HBA firmware, and the server BIOS to align with the ESXi upgrade. This is a lot of moving parts, all of which that carry risks of a corner case being identified.
Lets review how vSAN dresses these costs without driving you to put everything in one giant cluster..
- Capital Costs – vSAN licensing is per socket and hosts can be deployed with empty drive bays. Drives for regular severs regularly fall in in price, making it cheaper to purchase what you need now and add drives to hosts as needed to meet capacity growth. Overhead for spare capacity for rebuilds does reduce as you add hosts, but nothing forces you to fill each host with capacity up front and no additional licensing costs will be invoked by having partially full servers.
- Opex – vSAN’s normal management plane (vCenter) is easily federated and storage policies span clusters without any additional work. Lifecycle management like controller updates from the Config assist, and health monitoring alerts easily roll up to a single pane of glass.
- Performance – All Flash has changed the game. You no longer need 1000 spindles and wide striping to get fast or consistent performance. Pooling workloads with 3 tier storage architecture and storage arrays actually increases the chance that you might saturate throughput, or buffers on fibre channel switching.
- Patching – vSAN patching can be done simply using existing tools for updating ESXi (VMware Update Manager), and lifecycle update for storage controllers can be pushed by a simple click from the UI in vSAN 6.6. Customers already have ESXi patching windows and processes deployed and maintenance mode with vMotion is as trusted and battle tested means to evacuate a host.
- VMware Update manager (VUM) can remediate multiple clusters in parallel. This means you can patch as many (or as few) clusters, and when used with DRS this is fully automated including placement of virtual machines.
- Additional intelligence has been deployed for vSAN to include remediation of Firmware. Given that vSAN does not use proprietary Fibre Channel fabrics, is integrated into ESXi, and lacks the need for proprietary fabric HBA’s this significantly reduces the number of planets to align when planning an upgrade window.
In summery I wanted to say. While vSAN can certainly scale to the multi-PB cluster size, you should look if you actually need to scale up this much. In many cases you would be better served by at scale running multiple clusters.
We’ve all been there…
Maybe its the streets of NYC, or a corner stall in a mall in Bangkok, or even Harwin St here in Houston. Someone tried to sell you a cut rate watch or sunglasses. Maybe the lettering was off, or the gold looked a bit flakey but you passed on that possibly non-genuine watch or sunglasses. It might have even been made in the same factory, but it is clear the QC might have issues. You would not expect the same outcome as getting the real thing. The same thing can happen in ReadyNodes.
Real ReadyNodes for VMware vSAN have a couple key points.
They are tested. All of the components have been tested together and certified. Beware anyone in software-defined storage who doesn’t have some type of certification program as this opens the doors to lower quality components, or hardware/driver/firmware compatibility issues. VMware has validated satisfactory performance with the ReadyNode configurations. A Real ReadyNode looks beyond “will these components physically connect” and if they will actually deliver.
vSAN ReadyNodes offer choice. ReadyNodes are available from over a dozen different server OEM’s. The VMware vSAN Compatibility Guide offers over a thousand verified hardware components also to supplement these ReadyNodes for further customization. ReadyNodes are not limited to a single server or compoennt vendor.
They are 100% supported by VMware. Real VMware ReadyNodes don’t require virtual machines to mount, present or consume storage, or non-VMware supported VIBs be installed.
They are Mature. They run a 7th release, battle-tested, mature hypervisor integrated storage stack.
So what do you do if you’ve ended up with a fake ReadyNode? Unlike the fake watch I had to throw away, you can check with the vSAN compatibility list and see if you can with minimal controller or storage devices changes convert your system in place over to vSAN. Remember if your running ESXI 5.5 update 1 or newer, you already have vSAN software installed. You just need to license and enable it!
One of the newest exciting features of Virtual SAN 6.2 is the new performance service. This is an ESXi native performance monitoring system with API, as well as UI access.
One misconception I wanted to be clear on is that it does not require the use of vCenter Operations Manager, or the vCenter database. Instead, Virtual SAN performance service uses the Virtual SAN object store to store its data in a distributed and protected fashion. For smaller environments who do not want the overhead of VSOM this is a great solution, and will complement the existing tools.
Now why would you want to deploy VSOM if this turnkey simple, low overhead performance system is native? Quite a few reasons:
- VSOM offers longer term granular performance tracking. The native Virtual SAN performance service uses the same roll up schedule as vCenter’s normal performance graphs.
- VSOM allows for forecasting and capacity planning as it analysis trends.
- VSOM allows overlaying performance from multiple area’s and systems (Including things like switching, application KPI’s) to do root cause and anomaly analysis and correlation.
- VSOM offers powerful integration with LogInsight allowing event correlation with performance graphs.
- VSOM allows for rolling up performance information across hundreds (or thousands of sites) into larger dashboards.
- In heterogenous enivrements using traditional storage, VSOM allows collecting fabric, and array performance information.
So if I don’t enable this service (or deploy VSOM) what do I get? You still get basic Latency, IOPS, throughput information from the normal vCenter performance graphs by looking at the vDisk layer. You miss out on back end component views (things like internal SSD queues and latency) as well as datastore/cluster wide metrics, but you can still troubleshoot basic issues with the built in performance graphs.
What about VSAN Observer? For those of you who remember previously this information was only available by using the Ruby vCenter shell interface (RVC). VSAN observer provides powerful visibility, but it had a number of limits:
- It was designed originally for internal troubleshooting and lacks consistency with the vCenter UI.
- It ran on its own web service separately and was not integrated into the existing vCenter graphs.
- It was manually enabled from the RVC CLI
- It could not be accessed by API
- It was not recommended to run it continuously, or to deploy a separate Virtual machine/Container to run it from.
All of these limitations have been addressed with the Virtual SAN performance service.
I expect the performance service will largely replace VSAN Observer uses. VSAN observer will still be useful for customers who have not upgraded to VSAN 6.2 or where you do not have capacity available for the performance database.
There is an extensive amount of metrics that can be reviewed. It offers “top down” visibility of cluster wide performance, and virtual machine IOPS and latency.
Virtual SAN Performance service also offers “bottom up” visibility into device latency and queues on individual capacity and cache devices. For quick troubleshooting of issues, or verification of performnace it is a great and simple tool that can be turned on with a single checkbox.
vCenter 6.0u2 (For UI)
Up to 255GB of capacity on the Virtual SAN datastore (You can choose the storage policy it uses).
I would like to say that this post was inspired by Chad’s guide to storage architectures. When talking to customers over the years a recurring problem surfaced. Storage historically in the smaller enterprises tended towards people going “all in” on one big array. The idea was that by consolidating the purchasing of all of the different application groups, and teams they could get the most “bang for buck”. The upsides are obvious (Fewer silo’s and consolidation of resources and platforms means lower capex/opex costs). The performance downsides were annoying but could be mitigated. (normally noisy neighbor performance issues). That said the real downside to having one (or a few) big arrays are often found hidden on the operational side.
- Many customers trying to stretch their budget often ended up putting Test/Dev/QA and production on the same array (I’ve seen Fortune 100 companies do this with business critical workloads). This leads to one team demanding 2 year old firmware for stability, and the teams needing agility trying to get upgrades. The battle between stability and agility gets fought regularly in the change control committee meetings further wasting more people’s time.
- Audit/regime change/regulatory/customer demands require an air gap be established for a new or existing workload. Array partitioning features are nice, but the demands often extend beyond this.
- In some cases, organizations that had previously shared resources would part ways. (divestment, operational restructuring, budgetary firewalls).
Some storage workloads just need more performance than everyone else, and often the cost of the upgrade is increased by the other workloads on the array that will gain no material benefit. Database Administrators often point to a lack of dedicated resources when performance problems arise. Providing isolation for these workloads historically involved buying an exotic non-x86 processor, and a “black box” appliance that required expensive specialty skills on top of significant Capex cost. I like to call these boxes “cloaking devices” as they often are often completely hidden from the normal infrastructure monitoring teams.
A benefit to using a Scale out (Type III) approach is that the storage can be scaled down (or even divided). VMware VSAN can evacuate data from a host, and allow you to shift its resources to another cluster. As Hybrid nodes can push up to 40K IOPS (and all flash over 100K) allowing even smaller clusters to hold their own on disk performance. It is worth noting that the reverse action is also possible. When a legacy application is retired, the cluster that served it can be upgraded and merged into other clusters. In this way the isolation is really just a resource silo (the least threatening of all IT silos). You can still use the same software stack, and leverage the same skill set while keeping change control, auditors and developers happy. Even the Database administrators will be happy to learn that they can push millions of orders per minute with a simple 4 node cluster.
In principal I still like to avoid silos. If they must exist, I would suggest trying to find a way that the hardware that makes them up is highly portable and re-usable and VSAN and vSphere can help with that quite a bit.
Ok, I’ll admit this is an incredibly misleading click bait title. I wanted to demonstrate how the economics of cheaper flash make VMware Virtual SAN (and really any SDS product that is not licensed by capacity) cheaper over time. I also wanted to share a story of how older slower flash became more expensive.
Lets talk about a tale of two cities who had storage problems and faced radically different cost economics. One was a large city with lots of purchasing power and size, and the other was a small little bedroom community. Who do you think got the better deal on flash?
Just a small town data center….
A 100 user pilot VDI project was kicking off. They knew they wanted great storage performance, but they could not invest in a big storage array with a lot of flash up front. They did not want to have to pay more tomorrow for flash, and wanted great management and integration. VSAN and Horizon View were quickly chosen. They used the per concurrent user licensing for VSAN so their costs would cleanly and predictably scale. Modern fast enterprise flash was chosen that cost ~$2.50 per GB and had great performance. This summer they went to expand the wildly successful project, and discovered that the new version of the drives they had purchased last year now cost $1.40 per GB, and that other new drives on the HCL from their same vendor were available for ~$1 per GB. Looking at other vendors they found even lower cost options available. They upgraded to the latest version of VSAN and found improved snapshot performance, write performance and management. Procurement could be done cost effectively at small scale, and small projects could be added without much risk. They could even adopt the newest generation (NVMe) without having to forklift controllers or pay anyone but the hardware vendor.
Meanwhile in the big city…..
The second city was quite a bit larger. After a year long procurement process and dozens of meetings they chose a traditional storage array/blade system from a Tier 1 vendor. They spent millions and bought years worth of capacity to leverage the deepest purchasing discounts they could. A year after deployment, they experienced performance issues and wanted to add flash. Upon discussing with the vendor the only option was older, slower, small SLC drives. They had bought their array at the end of sale window and were stuck with 2 generations old technology. It was also discovered the array would only support a very small amount of them (the controllers and code were not designed to handle flash). The vendor politely explained that since this was not a part of the original purchase the 75% discount off list that had been on the original purchase would not apply and they would need to pay $30 per GB. Somehow older, slower flash had become 4x more expensive in the span of a year. They were told they should have “locked in savings” and bought the flash up front. In reality though, they would locking in a high price for a commodity that they did not yet need. The final problem they faced was an order to move out of the data center into 2-3 smaller facilities and split up the hardware accordingly. That big storage array could not easily be cut into parts.
There are a few lessons to take away from these environments.
- Storage should become cheaper to purchase as time goes on. Discounts should be consistent and pricing should not feel like a game show. Software licensing should not be directly tied to capacity or physical and should “live” through a refresh.
- Adding new generations of flash and compute should not require disruption and “throwing away” your existing investment.
- Storage products that scale down and up without compromise lead to fewer meetings, lower costs, and better outcomes. Large purchases often leads to the trap of spending a lot of time and money on avoiding failure, rather than focusing on delivering excellence.
Parts are starting to roll in for this and next weeks new project. A VSAN to take over our old lab. The SMS was getting long in the tooth, and the remaining servers where either too old, or had been hijacked for the internal VDI environment. We have been aware of this project for a few years now and have been partly sandbagging a major lab overhaul while waiting on a firm release date for this project. VMware has put out a call to arms on testing the new product and we really wanted to put it through its paces before its available to our customers.
Here’s the initial hardware spec’s (Subject to change based on things not working, or Ingram sending me the wrong part).
For Server I have three of the following
ASUS RS720-X7/RS8 2U
Intel Ethernet Converged Network Adapter X540-T1
ASUS PIKE 2008 (8 port LSI)
3 x SAMSUNG 16GB 240-Pin DDR3 SDRAM ECC Registered DDR3 1333
Intel Xeon E5-2620 Sandy Bridge-EP 2.0GHz (2.5GHz Turbo Boost) 15MB L3 Cache LGA 2011 95W Six-Core
6 x Intel RED 2TB 5400RPM SATA drives.
1 x Intel 240GB DC S3500 SSD flash drives.
For switching I have one of the following
NetGear XS712T 12 x 10Gbps RJ-45 SmartSwitch
Here’s the justification for the parts chosen, and thoughts on if this was to be more than a lab where to upgrade.
1. The Server. This was pretty much one of the cheapest 8 drive servers money could buy. Honestly Supermicro would have been a consideration except their HBA was more expensive. LFF was also a design requirement (Lab has a high capacity low IOPS need), and 8 drives was the target. 4 x 1Gbps on-board NIC’s (and a 5th for IPKVM) isn’t a bad thing to be bundled. 2RU was a requirement as it opened up additional options for FC/PCI-Flash-SAS expansion trays etc. My only complaint is the lack of an internal SD card slot. Personally I don’t enjoy 1RU pizza box servers in the lab as the fans spin a lot louder. If this was a production system wanting tier 1 hardware, a Cisco C240M3 or a Dell 720XD would be good options.
2. The Memory – Its cheap, and 144GB of RAM across the cluster should be enough to get me started. Down the road we may add more. If this was a production setup I likely wouldn’t see anything less than 128GB or 192GB per host.
3. The CPU – Our lab loads are relatively light, but I wanted something modern so I would have a baseline for VSAN CPU usage. As we scale up and need more memory slots I suspect we’ll end up putting a second CPU in. I wanted something that on reasonable VDI composer and other testing could give me a baseline so I will know how to scale CPU/Memory/IOPS ratio’s going forward.
Drives piling up!
Drives piling up!
4. The Drives – Our lab generally has a LOT of VM’s sitting around doing nothing. Because of our low IOP/GB ratio I’m violating the recommendation of 1:10 Flash to normal spinning disk. WD Reds where chosen for the cheapest price possible, while still having proper TLER settings that will not cause them to drop out randomly and cause rebuild issues. They are basically prosumer grade drive, and if this lab had anything important I would upgrade to at least a WDRE4, Hitachi UltraStar, or Seagate Constellation NL-SAS drives. If this was production I’d likely be using 10K 900GB drives as the IOPS/capacity ratio is quite a bit better. A huge part of VSAN, CBRC, vFLASH and VMware’s storage policy engine is separating performance from capacity, so I”m going to likely push flash reservations and other technologies to their limits. The flash drives chosen where Intel DC S3500 as Intel has a strong pedigree for reliability, and the DC series introduces a new standard in consistency. Even full and under load they maintain consistent IOPS. While the 3500′s endurance is decent, its not really designed for large scale production write logging. If building a production system, 3700 or even the PCI based 910 Intel drives would be a much better selection for more than just the obvious jump in performance.
5. The Network – I’m sure everyone looking at the model numbers is supremely confused. The selection really boiled down to me wanting to test more than just the VSAN and do it on a budget. I wanted to test 10Gbps RJ-45, SR-IOV, Intel CNA’s, without spending 10K on Nic’s and switches and cables. Even going to Ebay for used Arista switches wasn’t going to keep the budget low enough. Netgear’s $1500 switch, delivers $125 ports with no need for GBIC’s, and Intel’s CNA’s pack a lot of features for a third the price of its optical cousins. I’ll admit the lack of published PPS specs, and anemic buffers may come back to haunt me. I can fall back on the 5 GigE nics and my old GigE switching if I need to, and this was all too cheap to not take a pass at. For a production upgrade (and possibly to replace this thing) I would look at least a Brocade 6650 (Billion PPS) switch or maybe even a VDX 6720 if I’m wanting something a little more exciting.