Auto-Policy Remediation Enhancements for the ESA in vSAN 8 U2Auto-Policy Remediation

Auto-Policy Remediation Enhancements for the ESA in vSAN 8 U2

vSAN 8 U1 introduced a new Auto-Policy Management feature that helps administrators run their ESA clusters with the optimal level of resilience and efficiency. This helps takes the guesswork and documentation consultation after deploying or expanding a cluster on what is the most optimal policy configuration. In vSAN 8 U2, we’ve made this feature even more capable.

Background

Data housed in a vSAN datastore is always stored in accordance with an assigned storage policy, which prescribes a level of data resilience and other settings. The assigned storage policy could be manually created, or a default storage policy created by vSAN. Past versions of vSAN used a single “vSAN Default Storage Policy” stored on the managing vCenter Server to serve as the policy to use if another policy wasn’t defined and applied by an administrator. Since this single policy was set as the default policy for all vSAN clusters managed by the vCenter Server, it used settings such as a failures to tolerate of 1 (FTT=1) using simple RAID-1 mirroring to be as compatible as possible with the size and the capabilities of the cluster.This meant that the default storage policy wasn’t always optimally configured for a given cluster. The types, sizes, and other characteristics of a cluster might be very different. Changing a policy rule optimized for one cluster may not be ideal, or even compatible with another cluster. We wanted to address this, especially since the ESA eliminates compromises in performance between RAID-1 and RAID-5/6.

Auto-Policy Management for ESA

Configuration of the policy is covered in the 8U1 feature blog here. Once configured, this will automatically create the relevant SPBM policy for the cluster.



Upon the addition or removal of a host from a cluster, the Auto-Policy Management feature will evaluate if the optimized default storage policy needs to be adjusted. If vSAN identifies the need to change the optimized default storage policy, it does so by providing a simple button in the triggered health finding to change the affected storage policy, at which time it will reconfigure the cluster-specific default storage policy with the new optimized policy settings. It will also rename the policy to reflect the newly suggested settings. This guided approach is intuitive, and simple for administrators to know their VM storage policies are optimally configured for their cluster. This change specifically addresses improved behavior for ongoing adjustments in the cluster. Upon a change to the cluster size, Instead of creating a new policy (as it did in vSAN 8 U1), the Auto-Policy Management feature will change the existing, cluster specific storage policy.

Upon a reconfiguration of the Auto-Policy generated storage policy, the automatically generated name will also be adjusted. For example, in a 5-host standard vSAN cluster without host rebuild reserve enabled, the auto-policy management feature will create a RAID-5 storage policy, and use the name of: “cluster-name – Optimal Datastore Default Policy – RAID5”



If an additional host is added to the cluster, after a 24 hour period, the following events will occur:

The Administrator will be prompted with an optional button “Update Cluster DS Policy.”

This will trigger two events The existing policy is changed to RAID-6 The existing policy’s name is changed to “cluster-name – Optimal Datastore Default Policy – RAID6

As described in the steps above, vSAN 8 U2 still does not automatically change the policy without their knowledge. The difference with vSAN 8 U2 is that upon a change of a host count in a cluster, we not only suggest the change, but upon an administrator manually click on the button “Update Cluster DS Policy” we will make this adjustment for them. A host in maintenance mode does not impact this health finding. The number of hosts in a cluster are defined by those that have been joined in the cluster


Configuration Logic for Optimized Storage Policy for Cluster

The policy settings the optimized storage policy uses are based on the type of cluster, the number of hosts in a cluster, and if the Host Rebuild Reserve (HRR) capacity management feature is enabled on the cluster. A change to any one of the three will result in vSAN making a suggested adjustment to the cluster-specific, optimized storage policy. Note that the Auto-Policy Management feature is currently not supported when using the vSAN Fault Domains feature.

Standard vSAN clusters (with Host Rebuild Reserve turned off):

  • 3 hosts without HRR : FTT=1 using RAID-1
  • 4 hosts without HRR: FTT=1 using RAID-5 (2+1)
  • 5 hosts without HRR: FTT=1 using RAID-5 (2+1)
  • 6 or more hosts without HRR: FTT=2 using RAID-6 (4+2)

    Standard vSAN clusters (with Host Rebuild Reserve enabled)
  • 3 hosts with HRR: (HRR not supported with 3 hosts)
  • 4 hosts with HRR: FTT=1 using RAID-1
  • 5 hosts with HRR: FTT=1 using RAID-5 (2+1)
  • 6 hosts with HRR: FTT=1 using RAID-5 (4+1)
  • 7 or more hosts with HRR: FTT=2 using RAID-6 (4+2)

vSAN Stretched clusters

  • 3 data hosts at each site: Site level mirroring with FTT=1 using RAID-1 mirroring for a secondary level of resilience
  • 4 hosts at each site: Site level mirroring with FTT=1 using RAID-5 (2+1) for secondary level of resilience.
  • 5 hosts at each site: Site level mirroring with FTT=1 using RAID-5 (2+1) for secondary level of resilience.
  • 6 or more hosts at each site: Site level mirroring with FTT=2 using RAID-6 (4+2) for a secondary level of resilience.

vSAN 2-Node clusters:

2 data hosts: Host level mirroring using RAID-1

Summary

The new improved Auto-Policy Management feature in vSAN 8 U2 serves as a building block to make vSAN ESA clusters even more intelligent, and easier to use. It gives our customers confidence that resilience settings for their environment are optimally configured.

What should I be paying for NVMe drives for ESA VSAN? (October 2024)

It’s come to my attention that a lot of people shopping storage really don’t know what to expect for NVMe server drives. Also looking at some quotes recently I can say some of you are getting great prices, and some of you are getting…. Well a quote…

I’m seeing discounted prices in the 12 cents (Read Intensive, Datacenter Class drives) to closer to 30 cents (Mixed use, fancier Enterprsie class drives) depending on volume and order. I’m also seeing some outliers (OEMs charging 60 cents per GB?!?!). Seeing better/worse pricing? message me on twitter @Lost_Signal.

I did look around the ecosystem and see one seller closing in on 10 cents per GB for one of the Samsung drives in an OEM caddy.



While DRAM and other component costs matter, vSAN Storage only clusters with dense nodes (200-300TiBs of NVMe) will typically see over 80% of the hardware BOM be the NVMe drives. This is driving a lot of focus on drive pricing and some awkward questions with server sales/accounting teams trying to explain charging 4-5x the going rate for drives.

So why is there such a difference in drive prices?

Drive Types


First off there’s a number of critera that can influence the price of a drive:

  1. What’s the endurance? Mixed-Use (3 drive write per day) drives are what vSAN ESA started with, but it is worth noting they cost more. How much more? ~20% more than Read Intensive drives that only support 1 DWPD. Do I need Mixed use? In short most of you do not, but you should check your change rate, or write rate. Very high throughput data warehouses doing tons of ETLs or large automation farms may see the need to pay the fancier drive that will last longer, and likely have better high end write throughput. I would expect 90% of clusters can use Read Intensive drives though at this point.
  2. Enterprise or Datacenter class TLC drives – Much like “value SAS” before a cheaper slightly less featured (single port vs. 2 port which does NOT matter inside a server), slightly less performant class of NVMe drive is showing up on quotes. I’m so far a fan, for anything but ultra high write throughput workloads it should save you some money. It’s positioned well to replace SATA. and furthers the argument that vSAN OSA is a legacy platform, and ESA should be all new builds. Speaking to one vendor recently they were skeptical of the need for QLC NAND when the cheaper “Datacenter class” TLC can hit pretty solid price points without some of the performance and endurance limits that QLC currently faces (To be fair, we all said the same thing about SLC, and MLC and TLC before, so in the long run I”m sure we will end up on QLC and PLC eventually).
  3. SAS/SATA are not supported by vSAN ESA, but frankly I’m seeing prices at the same or frankly worse for similarish SAS drives. I don’t expect SAS/SATA to show up in the datacenter much going forward beyond maybe M.2 boot devices.

Price List and Discounting

  1. Price list price. Note there are two factors at play here. A vendor will have a list price that is HIGHLY inflated (think 10-12x the component cost to them or even a normal person purchasing that device). These price lists are not consistent vendor to vendor. Price lists are not always universal, they might be per country, by quarter, by contract vehicle and by company. Negotiated price lists can do some weird things. vehicles that are not updated quarterly effectively mean you have committed to worse prices over time (As market prices go down). Also older price lists will not include newer drives or SKUs that are cheaper, sometimes forcing customers to purchase older servers/drives etc at higher cost.
  2. Discount % – When I ask people what they pay for drives or servers they often reply with a discount percent, with a slight bit of excitement and zero context. This is a bit like me telling people I paid 30% off for an air filter yesterday. (30% of WHAT?) Discussing discount without knowing the price list markup is a bit like buying a car without knowing what currency you are negotiating in. Different OEMs have different blends of markup, and base discounts. One Tier 1 vendor OEM’s example of expected discounts are:

    55% – Anyone with a pulse should get this discount.
    65% – If you found a partner and they felt like making 20% off of you, this is your normal pricing for a small order from a small company.
    75% – A reasonable normal discount
    85% – A large order, or an order from a large company who does a lot of purchasing.
    90%+ You bought a railcar sized order.

    Some factors that can influence discount size:



    Note Tier 2/3 OEMs tend to have much more “Street ready pricing” by default.

Some factors that can influence discount size

  • Size of deal – Larger orders can discount more.
  • Financial Shenanigans – Some server vendors are currently trying to operate as a SaaS companies in their financial reporting to wallstreet. As part of this cosplaying as a subscription service, they will only quote sane discounts/prices if you structure the deal as a subscription deal. They may require this have a cloud connected component that in reality has no real value, but I assure you is required by auditors to comply with ASC 606 accounting regulations and totally isn’t dubiously stretching the line at the unique value requirements of the cloud bits. If you do not want a quote that costs 3x what it should, and would like servers delivered this year instead of 2026, I suggest you roll your eyes, and ask for that new cloud thing!.
  • Competitive pressure – Competitive deals (meaning there is another vendor quoting servers or drives) typically unlocks 10-30% better pricing for the sales team. If you NEVER quote anyone else (even as a benchmark) you will discover your pricing power even at scale slowly atrophies over time. Seriously, go invite Lenovo, or Hitachi, Fujitsu or some other vendor to throw a quote at the wall. Even if you likely plan to stick with your existing OEM, you will find this helps keep pricing a bit more honest.

Vendor doesn’t want to sell you the drives (because they want to sell something else!)– This one is weird, but if you are asking a VAR/Server vendor who also sells storage to quote you NVMe drives for vSAN… They may have a perverse incentive to mis-price them so they can sell you a higher margin external storage array. Server components (especially to partners, I used to work for one!) tend to offer less margin, and vendor sales reps may have quota buckets they need to fill in storage. This reminds me of the wise words of Eric.

Common factors for higher prices

“The customer gets ONE of the votes on what they get to buy” – Enterprise Storage sales rep who I saw make 700K in commission.

You specified a very specific drive they don’t have in stock – Vendors have gotten increasingly annoyed with being forced to stock like for like parts for replacement, and supply chain management of 40 different NVMe drive SKUs (performance, encryption, endurance, capacity variables) has allowed their supply chain guys to offer discounts for “Agnostic SKUs” (where you get something that meets the spec). While I am partial to some specific drive SKUs this can cost you anywhere from 20% to 100% as well as delays in shipping. By discounting drives they have to sell and want to sell they can make sure the server gets sold in THIS quarter so they can book revenue now.

Sandbagging, SPIFFS and other odd sales behaviors – People who sell most of the time want to help the customer solve a problem. That said they also are driven by a long list of various incentives to sell specific things at specific times. This is referred to as “Coin-Operated” behavior. Sandbagging is a term used when a sales team purposely slows down a deal. This could be because they have hit a ceiling on how much commission they can earn, or accelerators to their commission. SPIFFs are one off payments for selling specific things, often paid not by the sales teams employer but a manufacturer or partner directly. It frankly always felt strange to have a storage vendor trying to pay me in Visa Gift cards on the side (I generally refused these, as it felt like a illicit transaction) but it does happen.

#vSAN #ESA #NVMe #TCO #Price

is HPE Tri-Mode Supported for ESA?

No.

Now, the real details are a bit more complicated than that. It’s possible to use the 8SFF 4x U.3 TriMode (not x1) backplane kit, but only if the server was built out with only NVMe drives, and no RAID controller/Smart Array. Personally I’d build of of E.3 drives. For a full BOM review and a bit more detail check out this twitter thread on the topic where I go step by step through the BOM outlining what’s on it, why and what’s missing.

How to configure a fast end to end NVMe I/O path for vSAN

A quick blog post as this came up recently. Someone who was looking t NVMoF with their storage was asking how do they configure a similar end to end vSAN NVMe I/O path that avoids SCSI or serial I/O queues.

Why would you want this? NVMe in general uses significantly less CPU per IOP compared to SCSI, has simpler hardware requirements commonly (no HBA needed), and can deliver higher throughput and IOPS at lower latency using parallel queuing.

This is simple:

  1. Start with vSAN certified NVMe drives.
  2. Use vSAN ESA instead of OSA (It was designed for NVMe and parralel queues in mind, with additional threading at the DOM layer etc).
  3. Start with 25Gbps ethernet, but consider 50 or 100Gbps if performance is your top concern.
  4. Configure the vNVMe adapter instead of the vSCSI or LSI Buslogic etc. controllers.
  5. (Optional) – Want to shed the bonds of TCP and lower networking overhead? Consider configuring vSAN RDMA (RCoE). This does require some specific configuration to implement, and is not required but for customers pushing the limits of 100Gbps in throughput this is something to consider.
  6. Deploy the newest vSAN version. The vSAN I/O path has seen a number of improvements even since 8.0GA that make it important to upgrade to maximize performance.

To get started adda. NVMe Controller to your virtual machines, and make sure VMtools is installed in the guest OS of your templates.

Note you can Migrate existing VMDKs to vNVMe (I recommend doing this with the VM powered off). Also before you do this you will want to install VMtools (So you have the VMware paravirtual NVMe controller driver installed).

The Problem with 10Gbps

So it’s time to stand up your new VMware cluster. You have reviewed your compute and storage requirements, and have picked hosts with 1-2TB of RAM, 100-300TB of storage, 32 core x 2 socket CPUs and are ready to begin that important consolidation project. You will be consolidating 3:1 from older hosts and before you deploy you get one additional requirement.

Networking Team: “We can only provision 2 x 10Gbps to each host”

You ask why? and get a number of $REASONS.

  • Looking at average utilization for the month it was below 10Gbps.
  • 25G/100Gbps cables and optics sounds expensive.
  • Faster speeds seem unnatural and scary.
  • Networking speed is a luxury for people who have Tigers on gold leashes, and we needed to save money somewhere.
  • There is no benefit to operations.
  • We are not due to replace our top of rack switches until 2034.

Now all of these are bad reasons, but we will walk through them starting with the first one today.

What is the impact of slow networking on my host?

Now you may think that slow networking is a storage team problem, but the impacts of undersized networking can impact a lot of differnt things. Other issues to expect to run into from undersized networking:

1. Slower vMotions, higher stun times, and longer host evacuations will be caused by slower networking. As you stuff more and more bandwidth intensive traffic on the same link the greater contention for host evacuations. This impacts maintenance mode operations and data resynchronization times.

2. Slow Backup and restore. While backups may be slower, we can somewhat cheat slow networking using CBT (Changed Block Tracking) and only doing forever incremental. Slow large data restore operations are the biggest concern for undersized networking. After a large scale failure or ransomware attack you may discover that rehydrated large amounts of data over 10Gbps is a lot slower than over 100Gbps. There is always a bottleneck in backup and restore speed, but the network is generally the cheapest resource to fix. You can try to mitigate this with scale out backup repositories, and using more data movers/proxys’, and more hosts and SAN ports, but in the end this ends up being far less cost effective than upgrading the network to 25/50/100Gbps.

3. Slower networking for storage, manifests itself in worse storage performance, specifically on large throughput operations, but also short microbursts where latency will creep up. Keep in mind that 10Gbps sounds like a lot but that is *per second*. If you are trying to get a large block of data in under 5ms that time window a single port can only move 6.25MB. As we try to pull average latencies down lower we need to be cognizant of what that link speed means for burst requests. Overtaxed network storage will often mask the true peak demand as back preasure and latency creep in. Pete has a great blog on this topic.

4. Slower large batch operations. Migrations, Database transform and load operations, and other batch jobs are often bandwidth constrained. You the operator may just see this as a 1-2 minute “bip” but making that 1-2 minute reponse in an end user application turn into a 10-20 second response can significantly improve the user experience of your application.

5. Tail latency. Applications with complicated chains of requests often are fundamentally bound by the one outlier in response times. Faster networking reduces the chance of contention somewhere in that 14 layer micro-service application the devops team has built.

6. Limitations on storage density. For HCI or any scale out storage system you will want adequate network bandwidth to handle node failure gracefully. vSAN has a number of tricks to reduce this impact (ESA compresses network resync, durability components) but at the end of the day you will not want 300TB in a vSAN/Ceph/Gluster/Minio node on a 10Gbps connection. There is a insidious feedback loop of slow networking is that it forces expensive, design decisions (lower density hosts and more of them), that often masks the need for faster networking. Even non-scale out platforms eventually will hit walls on density. a Monolithic storage array can scale to a lot more density and run wider fan out ratios using 100Gbps ethernet than 10Gbps ethernet.

Let us first dig into the first and most common objection to upgrading the network.

“Looking at average utilization for the month it was below 10Gbps”

How do you we as architects respond to this statement?

Networks are bursty is my short response to this. Pete Koehler calls this “The curse of averages”. Most of the tooling people use to make this statement is SNMP monitoring tooling that polls every few minutes. This apprach is find for slowly changing things like temperature, or binary health events like “is power supply dead?”. Unfortuently for networking, a packet buffer can fill up and cause back preasure and congestion in as short as 100ms, and SNMP polling every 5 minutes is not going to cut it for this. Inversely context around WHEN a network is saturated is important. If the network is saturated in the middle of the night when backups or databse maintenance or ETL runs I might not actually care. Using an average with a poor samplying frequency of times when I do and do not care about congestion is about the worst way to make a design decision possibly.


There are ways to understand congestion and it’s impacts. You may notice on the outliers of storage latency polling that there is corresponding high networking utilization at the same time. You can also get smarter about monitoring and have switches deliver syslog information about buffer exhaustion to your operations tool and overlay this with other metrics like high CPU usage, or high storage latency to understand the impact on slow undersized networking. (Screenshot of LogInsight generating an alarm).

Why is observability on networking often bad?

Operations teams are often a lot more blind to networking limitations than they realize. Now it’s true this tooling will never be perfect as there becomes some challenges trying to get a 100% perfect network monitoring.


Why not Just SNMP poll every 100ms?

The more frequent the polling on monitoring the more likely the monitoring itself starts to create overhead that impacts the networking devices or hosts themselves. Anyone who has turned on debug logging on a switch and crashed it should understand this. Modern efforts to reduce it (dedicated ASIC functions for observability, seperation of observability from the data plane in switches) do exist. It is worth noting vSAN hsa a network diagnostic mode that goes down to 1 second, which is pretty good for acute troubleshooting.

Can we just monitor links smarter?

Physical fiber taps that sit in line and sniff/process the size/shape/function/latency of every packet do exist. Virtual instruments was a company who did this. People who worked there told me “Storage arrays and networks lie a lot” but the cost of deploying fiber taps, and dedicated monitoring appliances per rack often exceeds just throwing more merchant silicon at the problem and upgrading the network to 100Gbps.

What tooling exists today?

Even driven tooling is often going to be the best way to detect network saturation. Newer ASICs and APIs exist, as well as siply having the switch shoot a syslog event when congestion is happening can help you overlay networking problems with application issues. VMware Cloud Foundation’s built in Log analytics tooling can help with this, and can overal the VCF Operations performance graphs to get a better understanding of when the network is causing issues.

Can we Just squeeze traffic down the 10Gbps better?

A few attempts have been made to “make 10Gbps work”. The reality is I have seen hosts that could deliver 120K IOPS of storage performance crippled down to 30K IOPS and so forth because of slow networking but we can review ways to make 10Gbps better…

Clever QoS to make slower networks viable?

Years ago CoS/DSCP were commonly used in the past to protect voice traffic over LANs or MPLS, and while they do exist in the datacenter most customers rarely use them in top of rack. Segmenting traffic per VLAN, making sure you don’t discover bugs in implementations, making sure tags are honored end to end is a lot of operational work. While the vDS supports this, and people may perform it on a per port group basis for storage, generally NIOC shaping traffic is about as far as most people operationally want to get involved in going down this path.

Smarter Switch ASICS


Clever buffer mangagement: “Elephant traps” (dropping of large packets to speed up smaller mice packets), and shared buffer management often worked to prevent one bursty flow, or one large packet from hogging all the resources. This was common on some of the earlier Nexus switches, and I’m sure was great if you had mixes of real time voice and buffered streaming video on your switch but frankly is highly problematic for storage flows that NEED to arrive in order.

Deeper Buffers Switches?

The other side of this coin was moving from swith ASICS with 12 or 32MB to multi-GB buffers. These “ultra deep buffer switches” could help mitigate some port over-runs and reduce the need for drops. VMware and others advocated for them for storage traffic and vSAN. With 10Gbps moving from the lower end Trident to the higher end Jericho ASICs we did see much better handling of micro-bursts, and even sustained workloads. TCP incast was mitigated. As 25Gbps came out though, we saw only a few niche switches configured this way and the pricing on them frankly was so close to 100Gbps that just deploying a faster pipe from point A to point B has proven to be more cost effective than trying to put a bigger bucket under the leak in the roof.

What does faster networking cost?

While some of us may remember 100Gbps ports costing $1000+ a port, networking has gotten a lot cheaper. The same commodity ASICs (Trident 3, Jericho, Tomahawk) power the most common top of rack leaf and spine switches in the datacenter today. Interestingly enough you can even now buy your hardware from one vendor, and switch OS or SDN management overlay for SONIC.

While vendors will try to charge large amounts for branded optics, All in one cables (called AIO) and passive TwinAx copper cables can often be purchased for $15-100 depending on length, and temperature tolerance requirements. These cables remove the need to purchase an optic, and reduce issues with dust and port errors by being “welded shut” against the SFP28/QSFP copper transceiver.

Passive TwinAx, or All In One Optical cables are not that expensive. This is a cheap passive TwinAx cable. At larger runs you will want to consider all in one optical. This image came from fs.com

$15 – $30 for 25Gbps passive cables

TINA – There is no Alternative (to faster networking)

The future is increasingly moving core datacenter performance intensive workloads to 100Gbps, with 25Gbps for smaller stacks (and possible 50Gbps even replacing that soon). The cost economics are shifting there, and the various tricks to squeeze more out of 10Gbps feels a bit like squeezing a single lemon to try to make 10 gallons of lemonade. “The Juice isn’t worth the squeeze.” While many of the above problems of slow networking can be mitigated with more hosts, lower performance expectations, longer operational windows, eventually it becomes clear that upgrading the network is more cost effective than throwing server hardware and time at a bad network.

Peanut Butter is Not Supported with vSphere/Storage Networking/vSAN/VCF

 From time to time I get oddball questions where someone asks about how to do something that is not supported or a bad idea. I’ll often fire back a simple “No” and then we get into a discussion about why VMware does not have a KB for this specific corner case or situation. There are a host of reasons why this may or may not be documented but here is my monthly list of “No/That is a bad idea (TM)!”.

How do I use VMware Cloud Foundation (VCF) with a VSA/Virtual Machine that can not be vMotion’d to another host?

This one has come up quite a lot recently with some partners, and storage vendors who use VSA’s (A virtual machine that locally consumes storage to replicate it) incorrectly claiming this is supported. The issue is that SDDC Manager automates upgrade and patch management. In order to patch a host, all running virtual machines must be removed. This process is triggered when a host is placed into maintenance mode and DRS carefully vMotions VMs off of the host. If there is a virtual machine on the host that can not be powered off or moved, this will cause lifecycle to fail.

What about if I use the VSA’s external lifecycle management to patch ESXi?

The issue comes in, running multiple host patching systems is a “very bad idea” (TM). You’ll have issues with SDDC Manager not understanding the state of the hosts, but also coordination of non-ESXi elements (NSX perhaps using a VIB) would also be problematic. The only exception to using SDDC manager with external lifecycle tooling tools are select vendor LCM solutions that done customization and interop (Examples include VxRAIL Manager, the Redfish to HPE Synergy integration, and packaged VCF appliance solutions like UCP-RS and VxRACK SDDC). Note these solutions all use vSAN and avoid the VSA problem and have done the engineering work to make things play nice.

JAM also not supported!

Should I use a Nexus 2000K (or other low performing network switch) with vSAN?

While vSAN does not currently have a switch HCL (Watch this space!) I have written some guidance specifically about FEXs on this personal blog. The reality is there are politics to getting a KB written saying “not to use something”, and it would require cooperation from the switch vendors. If anyone at Cisco wants to work with me on a joint KB saying “don’t use a FEX for vSAN/HCI in 2019” please reach out to me! Before anyone accuses me of not liking Cisco, I’ll say I’m a big fan of the C36180YC-R (ultra deep buffers RAWR!), and have seen some amazing performance out of this switch recently when paired with Intel Optane.

Beyond the FEX, I’ve written some neutral switch guidance on buffers on our official blog. I do plan to merge this into the vSAN Networking Guide this quarter. 

I’d like to use RSPAN against the vDS and mirror all vSAN traffic, I’d like to run all vSAN traffic through a ASA Firewall or Palo Alto or IDS, Cisco ISR, I’d like to route vSAN traffic through a F5 or similar requests…

There’s a trend of security people wanting to inspect “all the things!”.  There are a lot of misconceptions about vSAN routing or flowing or going places.

Good Ideas! – There is some false assumptions you can’t do the following. While they may add complexity or not be supported on VCF or VxRAIL in certain configurations, they certainly are just fine with vSAN from a feasibility standpoint.

  1. Routing storage traffic is just fine. Modern enterprise switches can route OSPF/Static routes etc at wire-speed just fine all in the ASIC offloads. vSAN is supported over layer 3 (may need to configure static routes!) and this is a “Good idea” on stretched clusters so spanning tree issues don’t crash both datacenters!
  2. vSAN over VxLAN/VTEP in hardware is supported.
  3. VSAN over VLAN backed port groups on NSX-T is supported.

Bad Ideas!

Frank Escaros-Buechsel with VMware support once told someone “While we do not document that as not supported, it’s a bit like putting peanut butter in a server. Some things we assume are such bad idea’s no one would try them, and there is only so much time to document all bad ideas.

  1. Trying to mirror high throughput flows of storage or vMotion from a VDS is likely to cause performance problems. While I”m not sure of a specific support statement, i’m going to kindly ask you not to do this. If you want to know how much traffic is flowing and where, consider turning on SFLOW/JFLOW/NetFlow on the physical switches and monitoring from that point. vRNI can help quite a bit here!
  2. Sending iSCSI/NFS/FCoE/vSAN storage traffic to an IDS/Firewall/Load balancer. These devices do not know how to inspect this traffic (trust me, they are not designed to look at SCSI or NVMe packets!) so you’ll get zero security value out of this process. If you are looking for virus binaries, your better off using NSX guest introspection and regular antivirus software. Because of the volume, you will hit the wire-speed limits of these devices. Outside of path latency, you will quickly introduce drops and re-transmits and murder storage traffic performance. Outside of some old Niche inline FC encryption blades (that I think Netapp used to make), inline storage security devices are a bad idea. While there are some carrier-grade routers that can push 40+ Gbps of encryption (MLXe’s I vaguely remember did this) the costs are going to be enormous, and you’ll likely be better off just encrypting at the vSCSI layer using the VM Encryption VAIO filter. You’ll get better security that IPSEC/MACSEC without massive costs.

Did I get something wrong?

Is there an Exception?

Feel free to reach out and lets talk about why your environment is a snowflake from these general rules of things “not to do!”

VMworld 2018

Another year another VMworld. I’ve got a few sessions I will be presenting:

 

HCI1473BU The vSAN I/O Path Deconstructed: A Deep Dive into the Internals of vSAN
??? Mystery Session: 7/27 at 3:30PM
HCI1769BU We Got You Covered: Top Operational Tips from vSAN Support Insight
HCI3331BU Better Storage Utilization with Space Reclamation/UNMAP

 

The vSAN I/O Path Deconstructed is an interesting inside look at the IO path of vSAN and the reasoning behind it.

We Got You Covered: Top Operational Tips from vSAN Support Insight shows off the phone home capabilities of vSAN and can help address your questions about what and how this data is used. We are also going to discuss how you can leverage similar views of performance as GSS and engineering to identify how to get the most out of vSAN.

HCI3331BU is a session that has been years in the making for me. “Where did my space go” is a question I get often. We will explain where that missing PB of storage went and how to reclaim it. The savings from implementing UNMAP should be able to fund your next VMworld trip!

Lastly, I’ve got a mystery session that should be unveiled later. Follow me on Twitter @Lost_Signal, and I’ll talk about what it will be when the time comes.

Pete and I will be recording for the vSpeakingPodcast Podcast LIVE! At the HCI Zone (Found near the VMware booth). We’ve got some new guests as well as some favorites lined up.

vSAN Backup and SPBM policies.

I get asked a lot of questions about how Backup works with vSAN. For the most part it’s a simple request for a vendor support statement and VADP/CBT documentation. The benefit of native vSAN snapshots (better performance!) does come up, but I will point out there is more to backup and restores than just the basics. Lets look at how one vendor (Veeam) integrates SPBM into their backup workflow.

 

Storage Based Policies can tie into availability and restore planning. When setting up your Backup or Replication software make sure that it supports the ability to restore a VM to it’s SPBM policy, as well as have the ability to do custom mapping. You do not want to have to do a large restore job then after the restore re-align block locations again to apply a policy if only the default cluster policy is used for restores. This could result in a 2x or longer restore time. Check out this Video for an example of what Backup and Restore SPBM integration looks like.

While some questions are often around how to customize SPBM policies to increase the speed of backups (on Hybrid possibly increase a stripe policy), I occasionally get questions about how to make restores happen more quickly.

A common situation for restores is that a volume needs to be recovered and attached to a VM simple to recover a few files, or allow temporarily access to a retired virtual machine. In a perfect world you can use application or file level recovery tools from the backup vendor but with some situations an attached volume is required. Unlike a normal restore this copy of data being recovered and presented is often ephemeral. In other cases, the speed of recovery of a service is more important than the protection of it’s running state (maybe a web application server that does not contain the database).  In both these cases I thought it worth looking at creating a custom SPBM policy that favored speed of recovery, over actual protection.

 

In this example  I’m using a Failure To Tolerate (FTT) of 0.  The reason for this is two fold.

  1. Reduce the capacity used by the recovered virutal machine or volume.
  2. Reduce the the time it takes to hydrate the copy.

In addition I’m adding a stripe width of 4. This policy will increase the recovery speed by splitting the data across multiple disk groups.

Now it should be noted that some backup software allows you to a run a copy from the backup software itself (Veeam’s PowerNFS server is an example). At larger scale this can often tax the performance of the backup storage itself. This temporary recovery policy could be used for some VM’s to speed to recovery of services when protection of data can be waived for the short term.

Now what if I decide I want to keep this data long term?  In this case I could simple change the policy attached to the disk or VM to a safer FTT=1 or 2 setting.

How to make a vSAN storage only node? (and not buy a mainframe!)

I get asked on occasion, “can I buy a vSAN storage only node?” It’s generally followed by a conversation how they were told that storage only nodes are the only way to “control costs in an HCI future”. Generally they were told this by someone who doesn’t support external storage, doesn’t support easy expansion of existing hosts with more drives, and has management tools that are hostile to external storage and in some cases not support entire protocols.

It puzzled me at first as it’s been a long time since someone has tried to spin only being able to buy expansion storage from a single vendor in large chunks as a good thing. You would think it’s 1976 and we are taking about storage for mainframes.

 

 

By default vSAN allows you to use all hosts in a cluster for both storage and compute and encourages you to scale both out as you grow.

First off, this is something that can be avoided with a few quick tricks.

  1. If you are concerned about growing storage asymmetrically, I encourage you to design some empty drive bays in your hosts so that you can add additional disk groups in place (It’s not uncommon to see customers double their storage by just purchasing drives and not having to pay more for VMware licensing!). I see customers put 80TB in a host, and with all flash RAID 5 and Dedupe and Compression you can get a LOT of data in a host! I’ve seen a customer buy a R730XD and only use 8 drive bays to start and triple their storage capacity in place by simply buying some (Cheaper, as it was a year later) drives!
  2. If this is request is because of HIGHLY asymmetric growth of cold data (I have 50 TB of data for hot VM’s, and 600TB per host worth of cold data growth) I’d encourage you to use vSAN for the hot data and look at vVols for the cold data. VMware is the only HCI platform that gives you a seamless management framework (SPBM) for managing both HCI storage, as well as external storage. vSAN is great for 80% of total use cases (and more than often enough for 100% of many customers) but for corner cases we have a great way to use both. I’ve personally run a host with vSAN, iSCSI, FC and NFS and it works and is supported just fine. Having vVols to ease the management overhead of those other profiles can make things a lot better! If your growing bulk cold data with NL-SAS drives at large scale like this JBOD’s on a modular array are going to be the low cost option.

Now back to the question at hand. What about if the above approaches don’t work. I just need a little more storage, (maybe another host or 3’s worth) and my storage IO profile is growing with my data so it’s not a hot/cold problem and I’d rather keep it all on vSAN. Also you might have a concern about licensing as you have workloads that if they use a CPU for compute will need to license the host (Oracle, Windows etc).  In this case you have two options for a vSAN storage only node.

First lets define what a storage only node is.

  1. A storage only node is a node that does not provide compute to the cluster. It can not run virtual machines without configuration changes. 
  2. A storage only node while not providing compute adds storage performance and capacity to the cluster.

The first thing is to determine what licensing you are using.

If you are using vSphere Enterprise Plus here is how to make a storage node

Lets assume we are using all flash and purchase a 2RU host with 24 drive bays of 2.5” drives and fill it full of storage (~80TB of SSD can be put into a host today, but as bigger drives are certified in the future this could easily be a lot more!). Now to keep licensing costs down we are going to get a single socket CPU, and get fewer cores (but keep the clock speed high). This should help control power consumption.

you can leverage DRS “Anti-affinity” rules to keep virtual machines from running on a host. Make sure to use the “MUST” rules, and define that virtual machines will never run on a host.

Deploy LogInsight. It can track vMotions and power on events and give you a log that shows that a host was never used for licensing/auditing purposes.

At this point we just need a single CPU license for vSphere, and a single vSAN socket license and we are ready to roll. If down the road we decide we want to allow other workloads (maybe something that is not licensed per socket) we can simply tune our DRS rules and allow that host to be used for those virtual machines (maybe carve out a management DRS pool and put vROPS, LI, and the vCSA on those storage hosts?).

Next up, if you are using a licensing tier that does NOT have access to DRS you can still make a storage only node.  

Again, we buy our 2RU server with a single CPU and a token amount of ram to keep licensing costs down and stuff it full of 3.84TB drives!

Now since we don’t have DRS we are going to have to find other ways to prevent a VM from being powered onto a host, or vMotioned to a host.

Don’t extend the Virtual Machine port groups to that host!

Deploy a separate vDS for the storage hosts, and do not setup virutal machine port groups. A virtual machine will not power up on a host that it can not find it’s port group on.

What if I’m worried someone might create a port group?

Just take away their permissions to create them, or change them on Virutal Machines!

In this case your looking at a single socket of vSphere and a single socket of vSAN. Looking at the existing price for drives, in this case the “Premium” for software for this storage only node would be less than 10% of the costs of the drives. As someone who used to sell storage arrays I’d put the licensing costs as comparable to what I’d pay for an empty JBOD shelf. There’s a slight premium here for the server, but as your adding additional controller capacity, for workloads that are growing IO with capacity this isn’t really a bad thing as the alternative was overbuying controller capacity up front to handle this expansion.

The other thing to note, is that your investment in vSAN and vSphere licensing is a perpetual one. In 3 years when 16TB drives are low costs nothing stops you from upgrading some disk groups and using your existing licensing. In this way your perpetual license for vSAN is getting cheaper every year.

If you want to control storage and licensing costs, VMware gives you a lot of great options. You can expand vSAN in place, you can add storage only nodes for a low cost for perpetual licenses, and you can serve wildly diverse storage needs with VVOls hand the half a dozen protocols we support. Buying into a platform that can only be expanded by a single vendor runs counter to the promise of a software defined datacenter. This leads us back to the dark ages of mainframes.

Using SD cards for embedded ESXi and vSAN?

*Update to include corruption detection script, and better KB on endurance and size requirements for  boot devices also updated it for vSphere 7 guidance*

I get a lot of questions about embedded installations of VMware vSAN.

Cormac has written some great advice on this already.

This KB explains how to increase the crash dump partition size for customers with over 512GB of RAM.

vSAN trace file placement is discussed by Cormac here.

Given that vSAN does not support running VMFS on the same RAID controller used for pass thru this often causes customers to look at embedded ESXi installs. Today a lot of deployments are done using embedded SD cards because they support a basic RAID 1 mirror system.

The issue

While not a vSAN issue directly this issue can impact vSAN customers. We have identified this issue on non-vsan hosts.

GSS has seen challenges with lower quality SD cards exhibiting significantly higher failure rates as bad batches in the supply chain have caused cascading failures in clusters. VMware has researched the issue and found that a amplification of reads is making the substandard parts fail quicker. Note the devices will not outright fail, but can be detected by running a hash of the first 20MB repeatedly and getting different results. This issue is commonly discovered on a reboot. As a result of this in 6.0U3 we have a method of redirecting the VMTools to a RAMDisk as this was found to be the largest source of reads to the embedded install. The process for setting this as follows.

Prevention

Log into each host using an SSH connection and set the ToolsRamdisk option to “1”:

1. esxcli system settings advanced set -o /UserVars/ToolsRamdisk -i 1
2. Reboot the ESXi host
3. Repeat for remaining hosts in the cluster.

Thanks to GSS/Engineering for hunting this issue down and getting this work around out. More information can be found on the KB here. As a proactive measure I would recommend all embedded SD card and USB device deployments use this flag, as well as any environment that seeks faster VMTools performance.

Detection

Knowing is half the battle!
This host will likely not survive a reboot!

What if you do not know if you are impacted by this issue?  William Lam has written this great script that will check the MD5 hash of the first 20MB in 3 passes, to detect if you are impacted by this issue. (Thanks to Dan Barr for testing).

Going forward I expect to see more deployments with High endurance SATADOM devices, as well as in future server designs embedded M.2 slots for boot devices becoming more common and SD cards retired as the default option. While these devices may lack redundancy I would expect a higher MTBF for one of these than a pair of low quality/cost SD cards. The lack of end to end nexus checking on embedded devices vs a full drive also contribute to this. Host profiles and configuration backups can mitigate a lot of the challenges of rebuilding one in the event of a failure.

Mitigation

Check out this KB for how to Backup your ESXi configuration (somewhere other than the local device).

Evacuate the host swap in the new device with a fresh install and restore the configuration.

Looking for a new Boot Device?

Although a 1GB USB or SD device suffices for a minimal installation, you should use a 4GB or larger device. The extra space will be used for an expanded coredump partition on the USB/SD device. Use a high quality USB flash drive of 16GB or larger so that the extra flash cells can prolong the life of the boot media, but high quality drives of 4GB or larger are sufficient to hold the extended coredump partition. See Knowledge Base article http://kb.vmware.com/kb/2004784.

read the new vSphere 7 boot device guidance. Embedded SD/USB installs should be viewed as a legacy option, and more performance and endurance capible larger devices should be considered.

Looking for guidance on what the endurance and size you need for an embedded boot device (as well as vSAN advice?). Check out KB2145210 that breaks out what different use cases need.