Skip to content

is HPE Tri-Mode Supported for ESA?

No.

Now, the real details are a bit more complicated than that. It’s possible to use the 8SFF 4x U.3 TriMode (not x1) backplane kit, but only if the server was built out with only NVMe drives, and no RAID controller/Smart Array. Personally I’d build of of E.3 drives. For a full BOM review and a bit more detail check out this twitter thread on the topic where I go step by step through the BOM outlining what’s on it, why and what’s missing.

How to configure a fast end to end NVMe I/O path for vSAN

A quick blog post as this came up recently. Someone who was looking t NVMoF with their storage was asking how do they configure a similar end to end vSAN NVMe I/O path that avoids SCSI or serial I/O queues.

Why would you want this? NVMe in general uses significantly less CPU per IOP compared to SCSI, has simpler hardware requirements commonly (no HBA needed), and can deliver higher throughput and IOPS at lower latency using parallel queuing.

This is simple:

  1. Start with vSAN certified NVMe drives.
  2. Use vSAN ESA instead of OSA (It was designed for NVMe and parralel queues in mind, with additional threading at the DOM layer etc).
  3. Start with 25Gbps ethernet, but consider 50 or 100Gbps if performance is your top concern.
  4. Configure the vNVMe adapter instead of the vSCSI or LSI Buslogic etc. controllers.
  5. (Optional) – Want to shed the bonds of TCP and lower networking overhead? Consider configuring vSAN RDMA (RCoE). This does require some specific configuration to implement, and is not required but for customers pushing the limits of 100Gbps in throughput this is something to consider.
  6. Deploy the newest vSAN version. The vSAN I/O path has seen a number of improvements even since 8.0GA that make it important to upgrade to maximize performance.

To get started adda. NVMe Controller to your virtual machines, and make sure VMtools is installed in the guest OS of your templates.

Note you can Migrate existing VMDKs to vNVMe (I recommend doing this with the VM powered off). Also before you do this you will want to install VMtools (So you have the VMware paravirtual NVMe controller driver installed).

RDTBench – Testing vSAN, RDMA and TCP between hosts

A while back I was asking engineering how they tested RDMA between hosts and stumbled upon RDTBench. This is a traffic generator, where you configure one host to act as a “server” and 1 to several hosts to act as clients communicating with it. This is a great tool for testing networking throughput before production use of a host, as well as validating RDMA configurations as it can be configured to generate vSAN RDMA traffic. Pings and IPERF are great, but being able to simulate RDT (vSAN protocol) traffic has it’s advantages.

RDT Bench does manifest itself as traffic on the vSAN Performance service host networking graphs
A few quick questions about it:

Where is it?

/usr/lib/vmware/vsan/bin

How do I run it?
You need to run it on two different hosts. One host will need to be configured to act as a client (by default it runs as a server). For the server I commonly use the -b flag to make it run bi-bidirectionally on the transport. -p rdma will run it in RDMA mode to test RDMA.

If RDMA is not working, go ahead and turn vSAN RDMA on (cluster, config, networking).

vSAN will “fail safe” back to TCP, but tell you what is missing from your configuration.

./rdtbench -h will provide the full list of command help.

For now this tool primarily exists for engineering (used for RDMA NIC validation) as well as support (as a more realistic alternative to IPERF), but I’m curious how we can incorporate it into other workflows for testing the health of a cluster.

Where to run your vCenter Server? (On a vSAN Stretched Cluster)

In a perfect world, you have a management cluster, that hosts your vCenter server and you the management of every cluster lives somewhere else. Unfortunately the real world happens and:

  • Something has to manage the management cluster.
  • Sometimes you need a cluster to be completely stand alone. 

Can I run the vCenter server on the cluster it manages?

It is FULLY supported to run the vCenter Server on the cluster that it is managing. HA will still work. If you want a deeper dive on this issue this short video covers this question. 

So what is the best advise when doing this?

  1. Use ephemeral port groups for all management networks. This prevents vDS chicken egg issues that are annoying but not impossible to work around. 
  2. I prefer to use DRS SHOULD rules so the center will “normally” live on the lowest host number/IP address in the cluster. This is useful for a situation where vCenter is unhealthy and the management services are failing to start, it makes it easy to find which host is running it. Make sure to avoid using “MUST” rules for this as it would prevent vCenter from running anywhere else in the event that host fails. 
You can attach VMK ports to a ephemeral port group even if the VCSA is offline

But what about a stretched Cluster? I have a stand alone host to run the witness server should I put it there? 

No, I would not recommend this design. It is always preferable to run the vCenter server somewhere that it will enjoy HA protection, and not need to be powered off to patch a host. vSAN stretched clusters always support active/active operations, many customers often configure them with most workloads running in the preferred datacenter location. If you use this configuration I recommend you run the vCenter server in the secondary location for a few reasons:

  1. In the event the primary datacenter fails, you will not be “Operationally blind” as HA is firing off, and recovering workloads. This lowers any operational blindspots that would happen for a few minutes while vCenter server fails over. 
  2. It will act as a weathervane to the health of the secondary datacenter. It is generally good to have SOME sort of workload running at the secondary site to provide some understanding of how those hosts will perform, even if it is a relatively light load.

Disable Intel VMD for drives being used for VMware vSAN

My recommendation is to please disable Intel VMD (Volume Manage Devices) and use the native NVMe inbox drive to mount devices for VMware vSAN going forward. To be clear Intel VMD is NOT a bad technology, but we do not need/want it in the I/O path for VMware vSAN going forward. It can be useful to do RAID on Chip for NVMe boot devices. In addition it was the only method to reliably get hotplug and serviceability (blink lights) prior the NVMe spec being “finished”, hence why it was sometimes used for some older early NVMe vSAN configurations.

Looking at the VCG a number of drive are only being certified using the Inbox driver and not the Intel driver.

To disable this you need to configure the Bios/UEFI. Here’s an example for Lenovo (who I think defaults to it enabled).

Jason Massae has a great blog that covers hoe to use Intel VMD in more details and Intel has their own documentation for non-vSAN use cases.

Yes, you can change things on a vSAN ESA ReadyNode

First I’m going to ask you to go check out the following KB and take 2-3 minutes and read it. : https://kb.vmware.com/s/article/90343

Pay extra attention to the table from this document it links to.

Also go read Pete’s new blog explaining read intensive drive support.

So what does this KB Mean in practice?

You can start with the smallest ReadyNode (Currently this is an AF-2, but I’m seeing some smaller configs in the pipeline), and add capacity, drives, or bigger NICs and make changes based on the KB.

Should I change it?

The biggest things to watch for is adding TONS of capacity, and not increasing NIC sizes, could result in longer than expected rebuilds. Putting 300TB into a host with 2 x 10Gbps NICs is probably not the greatest idea, while adding extra RAM or cores (or changing the CPU frequency 5%) is unlikely to yield any unexpected behaviors. In general balanced designs are preferred (That’s why the ReadyNode profiles as a template exist) but we do understand sometimes customers need some flexibility and because of the the KB above was created to support it.

What can I change?

I’ve taken the original list, and converted it to text as well as added (in Italics) some of my own commentary on what and how to change ESA ReadyNodes. I will be updated this blog as new hardware comes onto the ReadyNode certification list.

CPU

  • Same or higher core count with similar or higher base clock speed is recommended.
  • Each SAN ESA ReadyNode™ is certified against a prescriptive BOM.
  • Adding more memory than what is listed is supported by SAN, provided Sphere supports it. Please maintain a balanced memory population configuration when possible.
  • If wanting to scale storage performance with additional drives, consider more cores. While vSAN OSA was more sensative to clock speed for scaling agregate performance, vSAN ESA additional threading makes more cores particularly useful for scaling performance.
  • As of the time of this writing the minimum number of cores is 32. Please check the vSAN ESA VCG profile page for updates to see if smaller nodes have been certified.

Storage Devices (NVMe drives today)

  • Device needs to be same or higher performance/endurance class.
  • Storage device models can be changed with SAN ESA certified disk. Please confirm with the Server vendor for Storage device support on the server.
  • We recommend balancing drive types and sizes(homogenous configurations) across nodes in a cluster.
  • We allow changing the number of drives and drives at different capacity points(change should be contained within the same cluster)as long as it meets the capacity requirement of the profile selected but not exceed Max Drives certified for the ReadyNode™. Please note that the performance is dependent on the quantity of the drives.
  • Mixed Use NVMe (typically 3DWPD) endurance drives are best for large block steady State workloads. Lower endurance drives that are certified for vSAN ESA may make more sense for read heavy, shorter duty cycle, storage dense cost conscious designs.
  • 1DWPD ~15TB “Read Intensive” are NOW on the vSAN ESA VCG, for storage dense, non-sustained large block write workloads these offer a great value for storage dense requirements.
  • Consider rebuild times, and consider also upgrading the number of NICs for vSAN or the NIC interfaces to 100Gbps when adding significant amounts of capacity to a node.

NIC

  • NICs certified in IOVP can be leveraged for SAN ESA ReadyNode™.
  • NIC should be same or higher speed.
  • We allow adding additional NICs as needed.
  • If/When 10Gbps NIC hosts ReadyNode profiles are released it is advised to still consider 25Gbps NICs as they can operate at 10Gbps and support future switching upgrades (SFP28 interfaces are backwards compatible with SFP+ cables/transceivers).

Boot Devices

  • Boot device needs to be same or higher performance endurance class.
  • Boot device needs to be in the same drive family.

TPM

Please just buy a TPM. It is critically important for vSAN Encryption key protection, securing the ESXi configuration, host attestation and other issues. They cost $50 up front, but hours of annoying maintenance to install after the fact. I suggest throwing a NVMe drive at any sales engineer who forgets them off a quote.

NFS Native Snapshots, Should I just use vVols instead?

The ability to offload snapshots natively to a NFS filer has been around for a while. Commonly this was used with View Composer Array Integration (VCAI) to rapidly clone VDI images, and occasionally for VMware Cloud Director environments (Fast Clone for vApps). There were some caveats to consider:

  • Up until vSphere 7 Update 2 the first snapshot had to be a traditional redo log snapshot.
  • VMware blocks storage vMotion for VMs with native snapshots. (You will need to use array replication, and a bit of scripting to move these) which leads to the most important caviot.
  • A snapshot.alwaysAllowNative = “TRUE” setting for virtual machines was introduced. This allows the virtual machine in NFS datastore with VAAI plugin to be able to create Native snapshots ignoring its base disk is flat one or not.
  • If the Filer refuses to create a snapshot (Most commonly seen when a filer refuses to allow snapshots while doing a background automated clone or replication on some storage platforms), it will revert to redo log. It is worth noting that “alwaysAllowNative” doesn’t actually prevent this fail back behavior.
  • Some filers vendors will automatically inject snapshot.alwaysAllowNative = “TRUE” into VMs automatically.

The challenge with this in particular is that it can cause a problem. A Chain that goes from Native, to Redo log back to Native (or Redo Log to Native to Redo log) is invalid and leads to disk corruption!

So what are my options if this is a risk in my environment?

I’ll first off point out that vVols allows offloading of snapshots WHILE retaining support for storage vMotion. It’s fundamentally a bit simpler/more elegant solution to this problem of having natively offloading snapshots.

For most NFS VAAI users this should not be an issue as the filer should just create native snapshots when asked. For platforms that have issues taking native snapshots when other background processes are running, consider disabling that background replication/cloning that is automatically tied to the snapshot tree. If this is not an option consider not using snapshot.alwaysAllowNative and performing full clones, or not using the NFS VAAI clone offload instead. Hopefully in the future there will be a further patch to prevent this issue.

vSAN ESA Design Tips

Here’s a quick Twitter thread covering some top things to think about with vSAN ESA (Express Storage Architecture) design.

How do I secure and encrypt an ESXi Boot Device?

It’s time for a talk on Boot devices. No, we are not talking about SD cards, instead, we are going to talk about encryption and security of boot devices!

One trend lately has been to use PCI-E attached RAID controllers for a pair of M.2 SATA/NVMe devices that boot the server. Example Dell BOSS (Great option!). One challenge for some customers is these controllers often lack encryption support.

So first off. Do you even need to worry about this? What is the attack surface of an ESXi boot device?

Securing other keys – If you didn’t use TPMs for caching vSAN encryption keys, in theory, those would be there. This is easy to solve by spending $50 on a TPM, and the keys will be cached there instead.

Gigabyte Accessory GC-TPM2.0 TPM Module Retail : Electronics
You can pay $50 up front, or spend hours of your life in a data center manually trying to add these into a host.

Attestation – You may want to make sure someone didn’t meddle with the binaries, and you can trust the full chain of code used to boot the system including firmware. Secure boot and host attestation require a TPM and cover this. VMkernel.Boot.execInstalledOnly is a setting that will make sure arbitrarily uploaded binaries can’t be executed. Remember you don’t actually have to encrypt the full boot device to protect the binary integrity, this is handled by verifying signatures and UEFI secure boot.

Protecting the configuration file from tampering and or being read – While I find it unlikely anyone is going to physically do anything interesting with my ESXi information (Ohh no, they might learn I use time.vmware.com for NTP /s) there are some paranoid customers out there who have hosts in less than secure locations or consider the IP address of their DNS servers to be highly proprietary. Starting in vSphere 7 U2 the ESXi configuration is encrypted by default, and with a TPM the encryption keys will be securely sealed in the TPM. For more information on this see docs.vmware.com

Summary of a secure boot chain

So with a TPM + Secure Boot + the VMkernel.Boot.execInstalledOnly + TPM sealed configuration encryption a stolen or physically tampered with boot device will not expose sensitive data, or be able to be used to compromise a host.

“Is this enough?”

Personally, I think the above techniques will cover 98% of customer requirements to secure their boot devices and encrypt and sign what matters in a way that someone can’t do anything useful even with physical access to a boot device… For the truly paranoid though I would be remiss to not mention the following ways to 100% encrypt the entire boot device. Note If you go down this path you would still likely want to implement the above steps anyways and will still need/want a TPM, so this is not an “or” option necessarily as anyone this paranoid is going to need/want defense in depth.

Full Device Encryption

But what if my security team is demanding full volume encryption? Well for these cases there are some options.

  1. Buy a RAID controller that supports SEDs.
  2. Look at virtual raid-on-chip systems (VROC) for NVMe devices. Intel VMD is one system that can provide RAID 1 for boot devices of NVMe without the need for an add-in card, and also can manage encryption if SED NVMe devices are used. Note you will still need SEDs, as Intel VMD itself doesn’t do the encryption, just passes off the keys to the out-of-band controller (iLO/iDRAC/CIMC etc).
Image

Generally, you will need external KMIP compliant KMS to make this securely work, but again talk to your server OEM.

Final Thoughts

I don’t claim to be the expert on vSphere Security or all compliance scenarios. I would love to hear your feedback and concerns. I’m on Twitter @Lost_signal.

Other reading:

https://www.truesec.com/hub/blog/secure-your-vmware-esxi-hosts-against-ransomware Hat Tip to Anders Olsson for collecting a lot of useful information in securing ESXi boot.

VeeamOn (What’ I’m watching)

I’m going to keep a blog of sessions and Events I”m checking out and interested in for VeeamOn2022. This will get updated as the week goes on (and may serve as the basis for some Podcast interviews).

Object First

I’ve been tracking out of the corner of my eye ObjectFirst.com as a stealth project. They seem to be building “the best backup optimized object storage system” (or something like it). I have few details and a few theories but am strongly looking forward to the announcement on Monday.

Lab Warz

I still remember the first time I sat down for Veeam’s quirky take on a competitive hands on lab competition. The quirky theming, practical skill testing, and adrenaline pumping “time to do this fast!” feeling was unlike anything I’d ever seen at a conference. I see this listed as virtual only so I look forward to seeing if I can barrel roll through it without too shameful of a score. Even if you don’t feel up to the challenge see if you can learn a thing or two about some features you might be able to find value in.

Veeam Plug-in for SAP Now and Later

In a former life I used to deal with application level recovery for various applications ranging from the usual suspects (Exchange, SQL, Oracle) to a few weirder ones. I like checking out occasionally the backup and recovery of virtualized applications that I never operated. It exposes me to challenges that are the similar (distributed state concerns) but also the uniqueness of metadata and blending of VADP/CBT and native tooling. Way too many application backups end up maintained by scripts by DBA’s with dubious alerting and it’s good to see how Veeam is working with SAP and their Backupint framework to offer protection in a way that is supported and allows for consistent restores while still using fancier hypervisor and storage level snapshot offload. One unique workflow I wasn’t familiar with was as “restore license key” on restore which seemed like a pretty nice thing to include as restoring state often includes small things people forget about.

Debanjan Banerjee does a great job walking through how the different pieces come together, and it serves as a good reminder of why virtualizing SAP is always a good idea.