May 12

What to look at and do (or not) when recovering from a cluster failure (Part 2)

In part one of this series, I highlighted a scenario where we lost quite a few hosts in a lab vSAN cluster caused by 3 failed boot devices and a power event that forced a reboot of the hosts. Before I get back into the step by step of the recovery I wanted to talk a bit about what we didn’t do.

What should you do?

If this is production please call GSS. They have unusually calm voices and can help validate decisions quickly and safely before you make them. They also have access to recovery tooling, and escalation engineers you do not have.
Try to get core services online first (DNS/NTP/vCenter). This makes restoring other services easier. In our case, we were lucky and had only partial service interruption here (1 of 2 DNS servers were impacted).

Cluster Health Checks

While, I much prefer to work in vCenter, in the event of vCenter having an outage, it is worth noting that vSAN health checks can be run without vCenter.

Run at the CLI
Run from the Native HTML5 client on each ESXi host. The cluster health is a distributed service that is independent of vCenter for core checks.

Solving the chicken egg monitoring problem since 2017!

When reviewing the impact on the vSAN cluster look at the Cluster Health Checks:

How many objects are re-syncing, and what is the progress.

Note, in this case I just captured a re-balance operation

2. How many Components are healthy vs. unhealthy

3. Drive status – How many drives and disk groups are offline. note, within the disk group monitoring you can see what virtual machine components were on the impacted disk groups.

4. Service Check. See how many hosts are reporting issues with vSAN related services. In my case this was the hint that one of my hosts had managed to partially boot, but something was wrong. Inversely if you may see a host that is showing disconnected from vCenter, but is still contributing storage. It is worth noting that vSAN can continue to run and process storage IO as long as the vSAN services start, and the vSAN network is functional. It’s partly for this reason that when you enable vSAN, the HA heartbeats move to the vSAN network, as it’s important to keep your HA fencing in line with storage.

5. Time is synchronized across the cluster. For security reasons, hosts will become isolated if clocks drift too far (Similar to active directory replication breaking, Kerberos authentication not working etc. Thankfully there is a handy health check for this.

Host 16 used a stratum 16 $10 rolex I bought for cheap while traveling.

What Not to do?

Don’t panic!

Ok, so we had to do a bit more than that…

Also, while you are at it, don’t reboot random hosts.

This advice isn’t even specifically vSAN advice, but unlike your training with Microsoft desktop operating systems, the solution to problems with ESXi is not always to “tactically reboot” a host by mashing reset from the iDRAC. You might end up rebooting a perfect health host that was in the middle of a resync, or HA operation. Rebooting more health hosts does a few things:

It causes more HA events. HA events trigger boot storms. large bursts of disk IO as an Operating system reboots, databases force log rechecks, in-memory databases rebuild their memory caches and other processes that are normally staggered.
Interrupt object rebuilds. In our case (3 hosts failures and FTT=1) we had some VM’s that we lost quorum on, but many more that only lost 1 of 3 pieces. Making sure all objects that can be repaired are repaired quickly was the first order of battle.
Rebooting hosts can dump logs or crash dumps that are not being written to persistent disk. GSS may want to scrape some data out of even a 1/2 dead host if possible.

Assemble the brain trust

Remember, always have a Irish guy named Myles ready to help fix things

A few other decisions came up as Myles, Teodora and I spoke about what we needed to do to recover the cluster. We also ruled out a few recovery methods and decided on a course of action to get the cluster stable, and then begin the process of proactively preventing this from impacting us with other hosts.

Salvage a boot device from a capacity device – We briefly discussed grabbing one of the capacity devices out of the dead hosts and using it as a boot device. Technically this would not be a supported configuration (or controller is not supported to act as both a boot device and a device hosting vSAN capacity devices). The challenge here is we wanted to get back 100% of our data and it would have been tedious to identify which disk group was safe to sacrifice in a host for this purpose. If we were completely unable to get remote hands to install boot devices or were only interested in the recovery of a single critical VM at all costs, this might have made sense to investigate.
Drive Switcharo– Another option for recovery has our remote hands pull the entire disk group out of the dead servers and shove them into free drive bays on existing healthy servers. Pete Koehler mentioned this is something GSS has had success and something I’d like to dedicate to its own blog topic at some point. Why does this work? Again, vSAN does not store metadata or file system structures on the boot devices, purposely to increase survivability in cases where the entire server must be replaced. This historically was not a common behavior in enterprise storage arrays that would often put this data on OS/vault drives (that might not be movable even, or embedded). Given we had adequate drive bays free to split the 6 impacted disk groups (2 per host) across the remaining 13 hosts in the cluster this was an option. In our case, we decided we didn’t want to deal with moving them back after this was done. My remote hand’s teams were busy enough with vSphere 7 launch tasks, and COVID related precautions were reducing the staffing levels.
Fancy boot devices – We decided to avoid trying to use SD cards going forward as our primary boot option (even mirrored). Once these impacted hosts were online and the cluster was healthy we had ops plug in all of our new boot devices so we could proactively one host at a time process a fresh install. In a perfect world we would have had M.2 boot devices, but adding a PCI-E riser for this purpose on 4-year-old lab hosts was a bit more than we wanted to spend.

What did we do?

In our case, we called our data center ops team and had them plug in some “random USB drives we have laying around” and began fresh installs to get the hosts online and restore access to all virtual machines. I ordered some high endurance Sandisk USB devices and as a backup some high endurance SD cards (Designed for 4K Dashcam video usage). Once these came in, we reinstalled ESXi to the USB devices allowing our ops teams to recover their USB devices. The fresh high-quality SD cards will be useful for staging ISOs inside the out of band, as well as serving as an emergency boot device in the event a USB device fails.

Next up in the series. A walk through of installing ESXi from bare metal, some changes we made to the hosts and I’ll answer the question of “what’s up withe snake hiding in our R&D datacenter”.

May 11

How to rebuild a VCF/vSAN cluster with multiple corrupt boot devices

Note: this is the first part of a series.

In my lab, I recently had an issue where a large number of hosts needed to be rebuilt. Why did they need to be rebuilt? If you’ve followed this blog for a while, you’ve seen the issues I’ve run into with SD cards being less than reliable boot devices.

Why didn’t I move to M.2 based boot devices? Unfortunately, these are rather old hosts and unlike modern hosts, there is not an option for something nice like a BOSS device. This is also an internal lab cluster used by the technical marketing group, so while important, it isn’t necessary “mission critical” by any means.

As a result of this, and a power hiccup I ended up with 3 hosts offline that could not restart. Given that many of my VM’s were set to only FTT=1 this means complete and total data loss right?

Wrong!

First off, the data was still safe on the disk groups of the 3 offline hosts. Once I can get the hosts back online the missing components will be detected and the objects will become healthy again (yah, data loss!). vSAN does not keep the metadata or data structures for the internal files systems and object layout on the boot devices. We do not use the boot device as a “Vault” (if your familiar with the old storage array term). If needed all of the drives in a dead host can be moved to a physically new host and recovery would be similar to the method I used of reinstalling the Hypervisor on each host.

What’s the damage look like?

Hopping into my out of band management (My datacenter is thousands of miles away) I discovered that 2 of the hosts could not detect their boot devices, and the 3rd failed to fully reboot after multiple attempts. I initially tried reinstalling ESXi on the existing devices to lifeboat them but this failed. As I noted in a previous blog, SD cards don’t always fully fail.

Live view of the SD cards that will soon be thrown into a Volcano

If vSAN was only configured to tolerate a single failure, wouldn’t all of the data at least be inaccessible with 3 hosts offline? It turns out this isn’t the case for a few reasons.

vSAN does not by default stripe data wide to every single capacity device in the cluster. Instead, it chunks data out into fresh components every 255GB (Note you are welcome to set strip width higher and force more sub-components being split out of objects if you need to).
Our cluster was large. 16 hosts and 104 physical Disks (8 disks in 2 disk groups per host).
Most VM’s are relatively small, so out of the 104 physical disks in the cluster, having 24 of them offline (8 per host in my case). still means that the odds of those 24 drives hosting 2 of the 3 components needed for a quorum is actually quite low.
A few of the more critical VM’s were moved to FTT=2 (vCenter, DNS/NTP servers) making their odds even better.

Even in the case of a few VM’s that were impacted (A domain Controller, some front end web servers), we were further lucky by the fact that these were redundant virtual machines already. Given both of the VMs providing these services didn’t fail, it became clear with the compounding ods in our favor that for a service to go offline was more in the odds of rolling boxcars twice, than a 100% guarantee.

This is actually something I blogged about quite a while ago. It’s worth noting that this was just an availability issue. In most cases of actual true device failure for a drive, there would normally be enough time between loss to allow for repair (and not 3 hosts at once) making my lab example quite extreme.

Lessons Learned and other takeaways:

Raise a few Small but important VM’s to a higher FTT level if you have enough hosts. Especially core management VMs.
vSAN clusters can become MORE resilient to loss of availability the larger they are, even keeping the same FTT level.
Use higher quality boot devices. M.2 32GB and above with “real endurance” are vastly superior to smaller SD cards and USB based boot devices.
Consider splitting HA service VM’s across clusters (IE 1 Domain Controller in one of our smaller secondary clusters).
For Mission-Critical deployments use of a management workload domain when using VMware Cloud Foundation, can help ensure the management is fully isolated from production workloads. Look at stretched clustering, and fault domains to take availability up to 11.
Patch and reboot your hosts often. Silently corrupt embedded boot devices may be lurking in your USB/SD powered hosts. You might not know it until someone trips a breaker and suddenly you need to power back on 10 hosts with dead SD devices. Regular patching will catch this one host at a time.
While vSAN is incredibly resilient always have BC/DR plans. Admins make mistakes and delete the wrong VMs. Datacenters are taken down by “Fire/Flood/Blood” all the time.

I’d like to thank Myles Grey and Teodora Todorova Hristov for helping me make sense of what happened and getting the action plan to put this back together and grinding through it.

May 1

Keeping track of VCF and vSAN cluster driver/firmware

Are you building out a new VMware Cloud Foundation cluster, and trying to make sure you stay up to date with your vSAN ReadyNodes driver/firmware updates? Good news, there are a few options for tracking new driver/firmware patches.

The first method is simple, try out the new vLCM functionality. This allows for seamless updates of firmware/drivers for drives and controllers as well as system BIOS and other devices. It also has integration to verify key driver/firmware levels for the vSAN VCG sub-components. For those of you looking into this go check, the VCG for compatible hardware check out this blog post.

What about for clusters where you can not use vLCM yet. Maybe, your servers are not yet supported?

The vSAN VCG notification service can help fill the gap. It allows you to subscribe to changes. Subscribing will set you up for email alerts that will show changes to driver and firmware versions, as well as when updates and major releases. You can sign up for individual components, as well as for an entire ReadyNode specification.

Changes are reflected in a clear color-coded view showing what has been removed and what has been added to replace the entry.

The ReadyLabs team keep continuing to make it easier to keep your VMware Cloud Foundation environment up to date. If you have any more questions about the service, be sure to check out the FAQ. If you have any questions on this or the vSAN VCG reach out by email to [email protected]

Apr 22

Understanding File System Architectures.

File System Taxonomy

I’ve noticed that Clustered File Systems, Global file systems, parallel file systems and distributed file systems are commonly confused and conflated. To explain VMware vSAN™ Virtual Distributed File System™ (VDFS) I wanted to highlight some things that it is not. I’ll be largely pulling my definitions from Wikipedia but I look forward to hearing your disagreements on twitter. It is work noting some file systems can have elements that cross the taxonomy of file system layers for various reasons. In some cases, some of these definitions are subcategories of others. In other cases, some file systems (GPFS as an example) can operate in different modes (providing RAID and data protection, or simply inherent it from a backing disk array).

Clustered File System

A clustered file system is a file system that is shared by being simultaneously mounted on multiple servers. Note, there are other methods of clustering applications and data that do not involve using a clustered file system.

Parallel file systems

Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance. While the vSAN layer mirrors some characteristics (Distributed RAID and striping) it does not 100% match with being a parallel file system.

Examples would include OneFS and GlusterFS.

shared-disk file system

shared disk file systems are a clustered file system but are not a parallel file system. VMFS is a shared disk file system. The most common form of a clustered file system that leverages a storage area network (SAN) for shared access of the underlying LBAs. Clients are forced to handle the translation of file calls, and access control as the underlying shared disk array has no awareness of the actual file system itself. Concurrency control prevents corruption. Ever mounted NTFS to 2 different windows boxes and wondered why it corrupted the file system? NTFS is not a shared disk file system and the different operating systems instances do not independently by default know how to cleanly share the partition when they both try to mount it. In the case of VMFS, each host can mount a given volume as read and write, while cleanly making sure that access to specific subgroups of LBA’s used for different VMDKs (or even shared VMDKs) is properly handled with no data corruption. This is commonly done over a storage area network (SAN) presenting LUNs (SCSI) or namespaces (NVMe over fabrics). protocol to share this is block-based and can range from Fibre Channel, iSCSI, FCoE, FCoTR, SAS, Infiniband etc.

Example of 2 hosts mounting a group of LUNs and using VMFS to host VMs

Examples would include: GFS2, VMFS, Apple xSAN (storenext).

Distributed file systems

Distributed file systems do not share block-level access to the same storage but use a network protocol to redirect access to the backing file server exposing the share within the namespace used. In this way, the client does not need to know the specific IP address of the backing file server, as it will request it when it makes the initial request and within the protocol (NFSv4 or SMB) be redirected. This is not exactly a new thing (DFS in Windows is a common example, but similar systems were layered on top of Novell based filers, proprietary filers etc). These redirects are important as they prevent the need to proxy IO from a single namespace server and allow the data path to flow directly from the client to the protocol endpoint that has active access to the file share. This is a bit “same same but different” to how iSCSI redirects allow connection to a target that was not specified in the client pathing, or ALUA pathing handles non-optimized paths in the block storage world. For how vSAN exposes this externally using NFS, Check out this blog, or take a look at this video:

The benefits of a distributed file system?

Access Transparency. This allows back end physical data migrations/rebuilds to happen without the client needing to be aware and re-pointing at the new physical location. clients are unaware that files are distributed and can access them in the same way as local files are accessed.
Transparent Scalability. Previously you would be limited to the networking throughput and resources of a single physical file server or host that hosted a file server virtual machine. With a distributed file system each new share can be distributed out onto a different physical server and cleanly allow you to scale throughput for front end access. In the case of VDFS, this scaling is done with containers that the shares are distributed across.
Capacity and IO path efficiency – Layering a scale-out storage system on top of an existing scale-out storage system can create unwanted copies of data. VDFS uses vSAN SPBM policies on each share and integrates with vSAN to have it handle the data placement and resiliency. In addition layering, a scale-out parallel file system on top of a scale-out storage system leads to unnecessary network hops for the IO path.
Concurrency transparency: all clients have the same view of the state of the file system. This means that if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner. This is distinctly different from how some global file systems operate.

It is worth noting that VDFS is a distributed file system that exists below the protocol supporting containers. A VDFS volume is mounted and presented to the container host using a secure direct hypervisor interface that bypasses TCP/IP and the vSCSI/VMDK IO paths you would traditionally use to mount a file system to virtual machine or container. I will explore more in the future. For now, Duncan Explains it a bit on this blog.

Examples include: VDFS, Mirosoft DFS, BlueArc Global Namespace

Global File System

Global File Systems are a form of a distributed file system where a distributed namespace provides transparent access to different systems that are potentially highly distributed (IE in completely different parts of the world). This is often accomplished using a blend of caching and the use of weak affinity. There are trade-offs in this approach as if the application layer is not understood by the client accessing the data you have to deal with manually resolving conflicting save attempts of the same file, or forcing one site to be “authoritative” slowing down non-primary site access. While various products in this space have existed they tend to be an intermediate step for an application-aware distributed collaboration platform (or centralizing data access using something like VDI). While async replication can be a part of a global file system, file replication systems like DFS-R would not technically qualify. Solutions like Dropbox/OneDrive have reduced the demand for this kind of solution.

Examples include: Hitachi HDI

Where do various VMware storage technologies fall within this?

VMFS – a Clustered file system, that specifically falls within the shared-disk file system. While powerful and one of the most deployed file systems in the enterprise datacenter, it was designed for use with larger files that are (With some exceptions) only accessed by a single host at a time. While support for higher numbers of files and smaller files has improved significantly over the years, general-purpose file shares are currently not a core design requirement for it.

vVols – Not a clustered file system. An abstraction layer for SAN volumes, or NFS shares. For block volumes (SAN) it leverages SUB-LUN units and directly mounts them to the hosts that need them.

VMFS-L – A non-clustered variant Used in vSAN prior to the 6.0 release. Also used for the ESXi installed volume. File system format is optimized for DAS. Optimization include aggressive caching with for the DAS use case, a stripped lockdown lock manager, and faster formats. You commonly see this used on boot devices today.

VDFS – vSAN Virtual Distributed File System. A Distributed file system that sits inside the hypervisor directly onto of vSAN objects providing the block back end. As a result, it can easily consume SPBM policies on a per-share basis. For anyone paying attention to the back end, you will notice that objects are automatically added and concatenated onto volumes when the maximum object size is reached (256GB). components behind these objects can be striped, or as a result of various reasons be automatically spanned and created across the cluster. It is currently exposed through protocol containers that export NFSv3 or NFSv4.1 as a part of vSAN file services. While VDFS does offer a namespace for NFSv4.1 one connections and handles redirection of share access, it does not currently globally redirect between disparate clusters, so it would not be considered a global file system.

Apr 6

1 Comment

vSAN ReadyNodes Additional Feature: vLCM Support

When picking out some new nodes for a VMware Cloud Foundation build-out, I noticed a new feature I could search for.

Here’s a quick explanation of what this new capability is, as well as some existing features:

vLCM Capable ReadyNode: This node is supported by the server OEM as being able to be patched by VMware LifeCycle Manager. This capability allows you to patch NICs, HBAs, Drives with new firmware and driver as well as update the BIOS. Currently, this includes servers from HPE Gen10 as well as select Dell 13 and 14 generation servers. For a quick demo of how vLCM can patch a host check out this video

SAS Expander: A typical SAS physical connection has 4 SAS channels. Most internal HBA and controllers only have 2 SAS physical connections and in a directly connected configuration only support 8 drives. SAS expanders switch the connection, allowing up to 254 per connection. The SAS expander must work tightly with the raid controller (both are often made by the same manufacturer) and firmware and driver versions for both must be kept in “sync” to prevent issues. SAS expanders also support SATA tunneling protocol that allows a SATA drive to emulate a SCSI device. For additional information on SAS expanders, see the vSAN design and sizing guide.

SSD/HDD Hotplug: Hotplug is the ability to add a device to a system while it is running. Useful for replacing failed devices, as well as expanding a ReadyNode without having to power off the host.

Intel VMD: VMD incorporated NVMe drives to have several modes for the drive’s amber LED such as on, off and flash to identify the NVMe drive. This allows device location for serviceability. VMD also enables hot-swap replacement without shutting down the system. The VMD device can intercept PCIe events due to hot plug and allow for safe, clean drive removal and re-insertion. With Intel VMD, servicing drives can be done online, minimizing service interruptions.

Apr 1

CIFS for VCF and vSAN?

The year was 2019 and at VMworld Barcelona, someone asked me “When will vSAN support CIFS?.” This is a question I get from time to time, and I responded the same as always.

“vSAN will NEVER support CIFS”

VMware Cloud Foundation 7, and vSAN 7 now offer native file services, starting with NFS v3 and NFS v4.1 as the first file protocols. Why was NFS chosen first? Why not CIFS?

A historical detour into what is CIFS…

CIFS (Common Internet File System) was effectively a Microsoft extension of IBM’s SMBv1 from around the time of Appletalk rising and falling in usage. It had some issues:

Despite “internet” being in the name, it is in a tie for last with NetBIOS for things you don’t want on the public internet.
There were lots of weird proprietary extensions, unstable client implementations.
Security is baaaaad (and not getting fixed). Us CERT says to stop using it.
After Microsoft deprecated it usage has plummeted to ancient legacy devices. Devices like that 15-year-old copier/scanner want to bash with a hammer, and that Windows XP machine that controls the HVAC system everyone forgot about.

Due to the opportunity for downgrade attacks from SMB2, Microsoft pushed out a service to disable it automatically. This effectively ended its era, and new versions of windows lack the binaries to use it (only in place upgrades still had it around).

Yes, that’s a service that exists to automatically remove a service. There’s got to be a better name for this?

“But John, xxxx vendor stills calls windows file shares CIFS”

I actually asked that vendor, why they call it a CIFS gateway and was told: “we have a few large customers, who haven’t updated their RFP templates from when new coke was still a thing…”

“John, will you stop pedantically correcting everyone who says CIFS, surely they mean SMB?”

The owner of the protocol Ned Pyle at Microsoft actually gets even more annoyed than me by people calling it CIFS.

Please stop saying CIFS when you mean SMB. If you don't like it, too bad; I'm the owner. I'll take my protocol and go home.
— Ned Pyle (@NerdPyle) December 10, 2016

What about SMB3.x

While SMB 3.x share still exists and holds lots of departmental shares, and roaming profiles, and various “junk drawers” of forgotten files, this is not a super exciting high growth right now. Sync and Share products (OneDrive/Dropbox) are for many shops slurping up a lot of this use case for unstructured data that needs to accessed by windows clients. It is worth noting that even the best 3rd party implementations of SMB3.x tend to cut corners on the full Microsoft server implementation, and many features associated with a windows file server (FSRM reporting and file screens, quotas, NTFS ACLs) are not actually a part of SMB and something that has to be implemented in the backing file system or emulated etc. Don’t worry, VMware is still looking at SMB 3.x support, but first, it’s time to address why NFS…

The better question: Why start with NFS?

When picking what protocols vSAN would support first, it was critical to look at what is driving new file share use cases in the data center, and specifically what are the file needs for Kubernetes developers. The goal of vSphere 7 with Kubernetes is to make VMware Cloud Foundation the premier platform for cloud-native application development and deployment. The existing ReadWriteOnce support delivered in vSAN 6.7U3 helps automate block workloads to containers using the CSI, but for applications that need ReadWriteMany volumes, a non-block shared file system option was needed.

NFSv3 strengths

In addition to the Kubernetes use case, there are a number of various infrastructure related use cases for NFS, ranging from a vCenter backup target, to content catalog, archive target, and repository share. NFSv3 especially does well with these use cases, as it’s simple, and the protocol has seen little interop issues in the over 20 years since it was ratified as a RFC. In general, it has aged a bit like a fine wine (as opposed to CIFS which as aged like milk sitting in the sun).

*I’m honestly not a cheese guy, but this is what I assume CIFS would look like as cheese*

NFSv4.1 – Back to the Future

One of the considerations, with file servers as an extension of VMware Cloud Foundation based HCI, is making sure that:

Performance scales linearly with the nodes
Consumption is cloud-like and can be easily automated

A critical feature that NFSv4.1 includes, that v3 does not, is the ability to use a virtual namespace across multiple file servers, and seamlessly redirect connections to the right one every time, without having to make the consumer of NFS go look anything up. I go into what this looks like in this blog a bit. As well as the following video.

So what’s the future of vSAN File Services?

While vSAN file services deliver a great experience for cloud-native services and infrastructure shares, it will continue to evolve to meet the needs of more and more applications and users as time goes on. The unique auto-scaling container structure can support adding additional containers to speak different protocols. Lastly, the unique hypervisor integrated IO path opens up some interesting future possibilities to extend VMware Cloud Foundation’s lead as the leading application platform.

Mar 19

2 Comments

vSAN File Services – How to find the connection URL

vSAN File Services adds a critical service to VMware Cloud Foundation. Layering a distributed protocol access layer on top of vSANs existing shared nothing distributed object store that can serve the NFS needs of Kubernetes as well as traditional services.

New shares setup in vSAN file services are balanced across the cluster. To find the IP address of which container you should connect to for a given share the interface offers this information for NFSv3.

Note NFSv4 is different. a NFSv4 referral enables a multi-server namespace to exist and seamlessly handle redirecting the client to the server that hosts a given directory or share. While it may appear as one namespace, All IO will not have to hairpin through the container owning the primary IP. Similar to iSCSI login Redirects, this simplifies setup, avoids the need for the client to attempt to connect to every node in the cluster.

What does this look like in the interface? This short 1 minute video may help:

VMware Cloud Foundation 4 is a powerful virtual machine and container platform. vSAN file services is critical to meeting the needs of modern applications and container workloads.

If your looking for more information on NFS Redirection the following may be useful:

The RFC – https://tools.ietf.org/html/rfc7530

Mar 13

vSAN File Services – A First Look at File Services for VCF

vSAN File Services adds a critical service to VMware Cloud Foundation. Layering a distributed protocol access layer on top of vSANs existing shared nothing distributed object store. Simplify operations and remove complexity by integrating File workloads into the infrastructure itself. For more information on how to use this with Kubernetes check out Myles Grey’s blog on the CSI integration.

I wanted to capture a few operational tasks withing vSAN File Services to show just how simple it is to setup and manage.

Setting up vSAN File Services

First off setting up vSAN is easy. You will need to setup 3-8 of the containers used for file services. each of these will need:

A unique IP address
DNS (forward and reverse)
subnet and gateway settings

In addition for the cluster you will need:

File Services Domain – This is a unique namespace for the cluster that will be used across the shares.
DNS servers – You can add multiple DNS servers for redundancy.
DNS Suffix – Note you can add multiple of these also.

Configuring a share on vSAN File Services

For each share you will need to configure:

A share name – This will be the path set after the namespace in NFSv4 or directly off the shares primary IP in NFSv3.
A Storage policy – Adjusting the RAID level here can help optimize capacity or resilience.
Share Warning threshold – This is a soft quota that will generate a vSAN health alarm when reached.
Share Hard quota – This is a hard quota. At the point this is reached writes will fail until data is deleted or this quota is raised.
Labels – These can be useful for categorizing a share (adding what department) data classification (compliance or security level) or other organizational methods. These can also be auto-generated when being created by Kubernetes making it easier to identify the MongoDB share the developers are having issues with etc.
Network access controls – These are IP based access control lists for who has read or write access to the shares (along with the usual root_squash ability for systems that need it and can only connect as root, but do be aware of the security implications of using this).

Upgrading vSAN File Services

This one is pretty Simple. vSAN will phone home and look for new versions of the file services OVF package. If the environment lacks internet connectivity from the vCenter a proxy can be used, or the file can be manually downloaded and uploaded to the vCenter.

File Services Upgrade

If you have any other vSAN File Services questions be sure to checkout the FAQ or ask on twitter @Lost_signal.

Jan 3

3 Comments

VMware vSphere Reliable Memory – A few thoughts

According to a study by Google, The annual incidence of uncorrectable errors was 1.3% per machine and 0.22% per DIMM. This rate rises to 1.7–2.3% after seeing corrected errors. Hard errors are caused by physical factors, such as excessive temperature variation, voltage stress, or physical stress brought upon the memory bits. Soft errors are random bit flips, typically associated with alpha particle radiation, solar winds and are correctable.

As the number of DIMMs and density of them increases, I suspect this only gets worse and rapidly approaches 100% if I have something important to work on.

Odds of me seeing this increase to 100% the closer I am to recording a Demo

Now, what happens with a VMware host, when the CPU detects unrecoverable errors for memory? This depends on who gets the bad bit:

-VMkernel: Crash (i.e. PSOD) the ESX host unless the kernel is within MCE Safe context.

VMM: Kill the VM. (The Virtual Machine should restart).

User space: Kill the user world (Most processes can be restarted).

Now, what if we want some protection? It’s worth noting that using ECC memory provides some basic protection (A single bit randomly flipping), and more importantly, provides awareness against larger protections through active scrubbing (So we don’t commit corrupt data to disk). If we want to mitigate larger failures (Such as an entire memory device on a DIMM or a DIMM itself) we need to look for more advanced protection methods.

Memory Mirroring: This is pretty simple and fairly expensive. This involves mirroring all DIMMs, so that in the event of a DIMM failure the server will keep on running. This is only outmatched by more extreme triple-redundant quorum/voting systems used on spaceflight computers. This is only considered for mission-critical systems in extremely difficult to reach places (Submarine, diamond mine, etc).

Single Device Data Correction (SDDC) – Out of the normal 18 memory devices on a DIMM you kee 1 device for CRC and 1 device for parity. If one if the devices fails, its data can be reconstructed. This is called single-device data correction (SDDC). Think of this a bit like RAID 4 (dedicated parity device) with checksums stored on a dedicated device also rather than with the block of data. Note a +1 option, effectively keeps a “hot spare” device so that after a failure is mitigated, you can support another failure. For Intel The Silver/Bronze SKUs offer an adaptive variant called Adaptive Data Correction (ADC), at Bank granularity.

Double Device Data Correction (DDDC) – This is where things start to get fancy and weird. By combining two 4x DIMMs into the same memory channel you can run a double parity scheme across both devices. This comes with performance impacts (Memory throughput seems to be the main issue). This doesn’t seem to be recommended for high throughput applications (HPC).

Adaptive Double DRAM Device Correction (ADDDC) – New with the Intel Scalable series processors (2017), this enables the ability to avoid the performance pre-failure that the DDDC design normally imposes. Note this feature doesn’t work with 8x DIMM layouts (smaller 8 and 16GB DIMMs from what I’ve found). For Intel, the Platinum/Gold SKUs offer Adaptive Double DRAM Device Correction (ADDDC).

Other weird OEM options – You will find stuff like hot spare DIMMs, and exotic additional bit ECC, how often scrubbing is performed etc. Be careful with this stuff and talk to your OEM about the expected performance impact.

Address Range Partial Memory Mirroring – This is an intel specific technology with a bit of variety on the implementation depending on the OEM. Unlike DIMM mirroring (which is transparent) this requires a OS –> Firmware interface for the OS to be aware of. In this case this is what the vSphere reliable memory feature enables. How this works under the hood is kernel processes flagged for usage of this will use this memory and be protected up to and including a full DIMM failure. This feature requires Intel Xeon Platinum and Gold processors SKUs.

Let’s see what this looks like in VMware!

You can look up how much memory is considered reliable by using the ESXCLI hardware memory get command.

Before turning on feature:

[root@h2:~] esxcli hardware memory get
   Physical Memory: 549657530368 Bytes
   Reliable Memory: 0 Bytes
   NUMA Node Count: 2

After turning it on:

[root@h2:~] esxcli hardware memory get
   Physical Memory: 480938061824 Bytes
   Reliable Memory: 68619579392 Bytes
   NUMA Node Count: 2

On boot I can see that a bit of DRAM has been borrowed for “reliable memory” (54GB). Given this is 1/8th of the memory in the host that is a trade off (12.5%) but it’s is something for mission critical applications that might be worth considering. Digging around 12.5% is normal for a 13Gen server. Note this memory overhead comes 100% out of the first NUMA node (ESXTOP will confirm this).

It’s worth noting that more than the kernel can use this feature. Virutal Machines can be configured for it by following KB2146595 and using the VMX flag sched.mem.reliable = "True"

If you reserve 30% or more of a CPU you could trip an alarm called “Significant imbalance between NUMA nodes detected”.

How much should be reserved? The guidance in 5.5 was for at least 3GB, but if virtual machines are using this, or extensive services are being used ona host more may be needed.

There is a bit of variety in the OEM implementations.

Dell – The servers I tested this with by default grab either 12.5% or 25% of my hosts memeory depending on if I use the fault resilient mode, or the NUMA fault resilient mode.

HPE – offers overs 4GB reservation or 10% or 20% of memory above 4 GB.

Lenovo – Offers Mirroring of 4GB.

Fujitsu – Supports 4GB + %. Can be defined in UEFI (which makes me think other OEM’s might have unsuppoprted hacks to alter this in UEFI basd on Intel’s documentation).

SuperMicro – I found no mention of partial memory mirroring in their RAS guide. Note their RAS guide is a great read on the other parity based protections.

Should I configure reliable memory?

You fundamentally have to ask yourself a few questions:

What am I paying for RAM, and what is the overhead going to be? In the case of the Dell functionality I tested it appears the BIOS is reserving 64GB. Looking at 3rd party memory prices this is going to run me in the us about $439. Looking at the spot price of memory recently it seems DRAM pricing is hitting new lows. Maybe it is worth sparing some to increase residency for mission critical clusters?
What is my tolerance for a host failing from a bad DIMM? For this you need to look at your estimated DIMM failure rate, and consider the odds that something important in kernel space is crashed by the DIMM failure (most user world things and virtual machines will reboot if they hit a non-recoverable memory error). If this is Test/Dev I might not care, if I’m running an Oracle RAC cluster that backs the ERP for a fortune 50 company I might have more sensitivity.
Is HA and/or application clustering “good enough protection?
Should I extend this to Virtual Machines? The VMX flag allows configuration of virtual machines to attempt to fit into the reliable memory space. I’m not sure what happens when the host is given 64GB of reliable memory and I try configuring a 100GB reliable memory VM (More testing needed).
Is 4GB good enough? Some platforms (HPE) offer the ability to configure 4GB or 4GB + xx%. By lowering what is protected (but lowering the cost overhead) a blend of risk mitigation and cost control may be “good enough” for many.
Would I rather mirror a virtual machine between two hosts (SMP-FT) and just pay the extra overhead?
Is there a particle accelerator or evil supervillain lab next door? If the server will be operating near a major source of alpha particle radiation it may be worth considering full mirroring (or shielding the server!)

Nov 14

Improving NIC and switch performance for vSAN (and other IP storage)

This is going to be a short post collecting a few tricks to unlock some bottlenecks in storage networking that may grow over time:

Unfortunetly a lot of troubeshooting of networking performance stops earlier than it should. Two common incomplete troubleshooting workflows I’ve seen:

Someone checks that network utilization on a host isn’t near the link speed and says “network not the bottleneck”.
Someone calls the networking team and they look at the switchports utilization based on SNMP polling, or do a quick “Show interface” and don’t see obvious port errors (CRC, drops, giants etc). They proudly close the ticket as “Switches are fine!”

Buffer Configuration Considerations

In one of my labs where we have a Nexus 9000 series switch we found performance was looking a bit limited. Seeing higher than expected re-ransmits we dug into it deeper. We dug deeper into buffer utilization. Discovering that the default mesh configuration was limiting buffer access to 500 KB per port, we adjusted the buffers using the qos ns-buffer-profile ultra-burst command. This signifigantly opened up performance, reducing TCP incast inssues (which cause retransmits) and brought performance more in line with what we would expect for the cluster. For anyone looking for more inormation on this command (and how to look at buffers) see the QoS Guide. Note for solving buffer contention different switches will have different options for configuring buffers, priortizing which flows to drop first, and allocating buffer to ports. In other cases it may be simpler to just buy switches with deeper buffers to begin with. Rather than trying to chop apart a 12MB-40MB buffer, simply purchasing a switch with an 8GB buffer can avoid a lot of the need for buffer management consideration.

I”ve been asked about the HPE 5950 series switch. Digging into the CoS/QoS guide I found a few things:

You can detect how often you exceed a buffer with the display buffer usage interface command.

<switchname> Display buffer usage interface hundredgige 1/0/1

This command will be more useful than the “display buffer usage” as that only tracks usage over a 5 second rolling outage, vs the violation tracker that the interface counter can track (Which will detect very short microbursts that may be causing buffer full and retransmits conditions nad latency). Note the default buffer threashold is 70%.

burst-mode enable appears to be a similar command to the ultra-burst buffer configuration and is recomended for cases that include “Traffic enters a device from multiple same-rate interfaces and goes out of an interface with the same rate.” Given this scenario is exactly what we would see from TCP incast (multiple vSAN hosts trying to talk to the same host and filling a buffer), this is likely something you would want to turn on. As I don’t have one of these switches in my lab, I’d love any feedback anyone has from trying this command. If anyone from HPE Networking is reading this, feel free to reach out.

TCP dispatch queues tuning

In another example in the lab, a test of raw throughput was coming up short. A review of the back end disk groups showed a lack of congestion (latency was low, write cache fill rate was low). A review of network utilization showed only 30% utilization on the link speed, but high latency (20ms+ between the nodes).

Investigation showed that the throughput was bottlenecking on a single threaded TCP process (CPU for attached world at 100%). By raising the TCP RX queues from the default of 1 to 4, this eleminated this as a bottleneck and returned performance to expected levels.

Steps to set this are:

Set the advanced setting on the host

$ esxcfg-advcfg -s 4 /Net/TcpipRxDispatchQueues

Or for PowerCLI:

Get-AdvancedSetting -Entity <esxi host> -Name Net.TcpipRxDispatchQueues | Set-AdvancedSetting -Value ‘4’

Reboot the host once this is set.

To validate this setting:

$ esxcfg-advcfg -g /Net/TcpipRxDispatchQueues

It’s worth noting that Niel’s blog on vMotion tuning reported higher throughput per stream than I saw (His blog reports 15Gbps per stream). This may be a result of my lab hosts using inexpensive cheaper Intel 5xx series NICs that lack the advanced offloads that the Intel 700 or 800 series cards have. Mellanox CX Series cards also have similar capabilities. Without these offloads, more CPU is needed to push the same throughput and this would compound together to bring the performance ceiling even lower with the cheaper NICs.

Summery

For anyone seeing bottlenecks on lower-cost NIC’s, or wanting to push more than 15Gbps per host of vSAN traffic, keep an eye on this setting, and talk to GSS if you are concerned this default may be causing a bottleneck. For new hosts I’d strongly consider smarter NIC’s that contain hardware LRO/TSO, RSS VxLAN GENEVE offload capabilities and make sure that your driver and firmware are both up to date. Note in a future release this default may change.

If you have any feedback on these commands (or questions on other commands or switches!) reach out to me on twitter @Lost_Signal

Virtual Ramblings

Musing of a Storage guy in a Virtual World..