Featured

The Problem with 10Gbps

So it’s time to stand up your new VMware cluster. You have reviewed your compute and storage requirements, and have picked hosts with 1-2TB of RAM, 100-300TB of storage, 32 core x 2 socket CPUs and are ready to begin that important consolidation project. You will be consolidating 3:1 from older hosts and before you deploy you get one additional requirement.

Networking Team: “We can only provision 2 x 10Gbps to each host”

You ask why? and get a number of $REASONS.

  • Looking at average utilization for the month it was below 10Gbps.
  • 25G/100Gbps cables and optics sounds expensive.
  • Faster speeds seem unnatural and scary.
  • Networking speed is a luxury for people who have Tigers on gold leashes, and we needed to save money somewhere.
  • There is no benefit to operations.
  • We are not due to replace our top of rack switches until 2034.

Now all of these are bad reasons, but we will walk through them starting with the first one today.

What is the impact of slow networking on my host?

Now you may think that slow networking is a storage team problem, but the impacts of undersized networking can impact a lot of differnt things. Other issues to expect to run into from undersized networking:

1. Slower vMotions, higher stun times, and longer host evacuations will be caused by slower networking. As you stuff more and more bandwidth intensive traffic on the same link the greater contention for host evacuations. This impacts maintenance mode operations and data resynchronization times.

2. Slow Backup and restore. While backups may be slower, we can somewhat cheat slow networking using CBT (Changed Block Tracking) and only doing forever incremental. Slow large data restore operations are the biggest concern for undersized networking. After a large scale failure or ransomware attack you may discover that rehydrated large amounts of data over 10Gbps is a lot slower than over 100Gbps. There is always a bottleneck in backup and restore speed, but the network is generally the cheapest resource to fix. You can try to mitigate this with scale out backup repositories, and using more data movers/proxys’, and more hosts and SAN ports, but in the end this ends up being far less cost effective than upgrading the network to 25/50/100Gbps.

3. Slower networking for storage, manifests itself in worse storage performance, specifically on large throughput operations, but also short microbursts where latency will creep up. Keep in mind that 10Gbps sounds like a lot but that is *per second*. If you are trying to get a large block of data in under 5ms that time window a single port can only move 6.25MB. As we try to pull average latencies down lower we need to be cognizant of what that link speed means for burst requests. Overtaxed network storage will often mask the true peak demand as back preasure and latency creep in. Pete has a great blog on this topic.

4. Slower large batch operations. Migrations, Database transform and load operations, and other batch jobs are often bandwidth constrained. You the operator may just see this as a 1-2 minute “bip” but making that 1-2 minute reponse in an end user application turn into a 10-20 second response can significantly improve the user experience of your application.

5. Tail latency. Applications with complicated chains of requests often are fundamentally bound by the one outlier in response times. Faster networking reduces the chance of contention somewhere in that 14 layer micro-service application the devops team has built.

6. Limitations on storage density. For HCI or any scale out storage system you will want adequate network bandwidth to handle node failure gracefully. vSAN has a number of tricks to reduce this impact (ESA compresses network resync, durability components) but at the end of the day you will not want 300TB in a vSAN/Ceph/Gluster/Minio node on a 10Gbps connection. There is a insidious feedback loop of slow networking is that it forces expensive, design decisions (lower density hosts and more of them), that often masks the need for faster networking. Even non-scale out platforms eventually will hit walls on density. a Monolithic storage array can scale to a lot more density and run wider fan out ratios using 100Gbps ethernet than 10Gbps ethernet.

Let us first dig into the first and most common objection to upgrading the network.

“Looking at average utilization for the month it was below 10Gbps”

How do you we as architects respond to this statement?

Networks are bursty is my short response to this. Pete Koehler calls this “The curse of averages”. Most of the tooling people use to make this statement is SNMP monitoring tooling that polls every few minutes. This apprach is find for slowly changing things like temperature, or binary health events like “is power supply dead?”. Unfortuently for networking, a packet buffer can fill up and cause back preasure and congestion in as short as 100ms, and SNMP polling every 5 minutes is not going to cut it for this. Inversely context around WHEN a network is saturated is important. If the network is saturated in the middle of the night when backups or databse maintenance or ETL runs I might not actually care. Using an average with a poor samplying frequency of times when I do and do not care about congestion is about the worst way to make a design decision possibly.


There are ways to understand congestion and it’s impacts. You may notice on the outliers of storage latency polling that there is corresponding high networking utilization at the same time. You can also get smarter about monitoring and have switches deliver syslog information about buffer exhaustion to your operations tool and overlay this with other metrics like high CPU usage, or high storage latency to understand the impact on slow undersized networking. (Screenshot of LogInsight generating an alarm).

Why is observability on networking often bad?

Operations teams are often a lot more blind to networking limitations than they realize. Now it’s true this tooling will never be perfect as there becomes some challenges trying to get a 100% perfect network monitoring.


Why not Just SNMP poll every 100ms?

The more frequent the polling on monitoring the more likely the monitoring itself starts to create overhead that impacts the networking devices or hosts themselves. Anyone who has turned on debug logging on a switch and crashed it should understand this. Modern efforts to reduce it (dedicated ASIC functions for observability, seperation of observability from the data plane in switches) do exist. It is worth noting vSAN hsa a network diagnostic mode that goes down to 1 second, which is pretty good for acute troubleshooting.

Can we just monitor links smarter?

Physical fiber taps that sit in line and sniff/process the size/shape/function/latency of every packet do exist. Virtual instruments was a company who did this. People who worked there told me “Storage arrays and networks lie a lot” but the cost of deploying fiber taps, and dedicated monitoring appliances per rack often exceeds just throwing more merchant silicon at the problem and upgrading the network to 100Gbps.

What tooling exists today?

Even driven tooling is often going to be the best way to detect network saturation. Newer ASICs and APIs exist, as well as siply having the switch shoot a syslog event when congestion is happening can help you overlay networking problems with application issues. VMware Cloud Foundation’s built in Log analytics tooling can help with this, and can overal the VCF Operations performance graphs to get a better understanding of when the network is causing issues.

Can we Just squeeze traffic down the 10Gbps better?

A few attempts have been made to “make 10Gbps work”. The reality is I have seen hosts that could deliver 120K IOPS of storage performance crippled down to 30K IOPS and so forth because of slow networking but we can review ways to make 10Gbps better…

Clever QoS to make slower networks viable?

Years ago CoS/DSCP were commonly used in the past to protect voice traffic over LANs or MPLS, and while they do exist in the datacenter most customers rarely use them in top of rack. Segmenting traffic per VLAN, making sure you don’t discover bugs in implementations, making sure tags are honored end to end is a lot of operational work. While the vDS supports this, and people may perform it on a per port group basis for storage, generally NIOC shaping traffic is about as far as most people operationally want to get involved in going down this path.

Smarter Switch ASICS


Clever buffer mangagement: “Elephant traps” (dropping of large packets to speed up smaller mice packets), and shared buffer management often worked to prevent one bursty flow, or one large packet from hogging all the resources. This was common on some of the earlier Nexus switches, and I’m sure was great if you had mixes of real time voice and buffered streaming video on your switch but frankly is highly problematic for storage flows that NEED to arrive in order.

Deeper Buffers Switches?

The other side of this coin was moving from swith ASICS with 12 or 32MB to multi-GB buffers. These “ultra deep buffer switches” could help mitigate some port over-runs and reduce the need for drops. VMware and others advocated for them for storage traffic and vSAN. With 10Gbps moving from the lower end Trident to the higher end Jericho ASICs we did see much better handling of micro-bursts, and even sustained workloads. TCP incast was mitigated. As 25Gbps came out though, we saw only a few niche switches configured this way and the pricing on them frankly was so close to 100Gbps that just deploying a faster pipe from point A to point B has proven to be more cost effective than trying to put a bigger bucket under the leak in the roof.

What does faster networking cost?

While some of us may remember 100Gbps ports costing $1000+ a port, networking has gotten a lot cheaper. The same commodity ASICs (Trident 3, Jericho, Tomahawk) power the most common top of rack leaf and spine switches in the datacenter today. Interestingly enough you can even now buy your hardware from one vendor, and switch OS or SDN management overlay for SONIC.

While vendors will try to charge large amounts for branded optics, All in one cables (called AIO) and passive TwinAx copper cables can often be purchased for $15-100 depending on length, and temperature tolerance requirements. These cables remove the need to purchase an optic, and reduce issues with dust and port errors by being “welded shut” against the SFP28/QSFP copper transceiver.

Passive TwinAx, or All In One Optical cables are not that expensive. This is a cheap passive TwinAx cable. At larger runs you will want to consider all in one optical. This image came from fs.com

$15 – $30 for 25Gbps passive cables

TINA – There is no Alternative (to faster networking)

The future is increasingly moving core datacenter performance intensive workloads to 100Gbps, with 25Gbps for smaller stacks (and possible 50Gbps even replacing that soon). The cost economics are shifting there, and the various tricks to squeeze more out of 10Gbps feels a bit like squeezing a single lemon to try to make 10 gallons of lemonade. “The Juice isn’t worth the squeeze.” While many of the above problems of slow networking can be mitigated with more hosts, lower performance expectations, longer operational windows, eventually it becomes clear that upgrading the network is more cost effective than throwing server hardware and time at a bad network.

Featured

vSAN 7 Update 1 What (Else) is new – Networking

I figured I’d cover in a blog some of the less obvious changes in vSAN 7 Update 1.

Simplified Layer 3 – vSAN has supported layer 3 (hosts within a cluster being on different subnets) since the early days. This is a popular topology when using stretched clustering, and 2 node configurations. vSAN VMkernel ports share the same gateway setting specified for the management network. As the vSAN network (ideally) often on a completely different subnet, this means that a static route would need to be set on each host. To simplify alternative gateway configuration, the vCenter Server UI now supports overriding the default gateway for a VMkernel port. ESXCLI or PowerCLI can still configure a gateway (there’s even now a ESXCLI -g flag to set a default gateway).

Data-In-Transit encryption – historically the focus on storage transport security was focused on restricting access to the storage networks (dedicated VLANs for Ethernet, or hard zoning for Fibre Channel) or limited authentication and access filtering (NFS IP ACL, IQN filteriing, CHAP, Soft zoning). If an adversary could capture the frames in transit on the storage network none of these technologies (or even data at rest encryption) protected you from data exfiltration. To address this, vSAN now supports data in transit encryption. This leverages the FIPS 140-2 validated Cryptographic modules to encrypt vSAN network traffic in flight. this allows custom rekey windows (The default is 1 day). No KMS is required for this solution to be deployed, and this feature complements other VMware in flight encryption technology (encrypted vMotion, encrypted HCX/NSX tunnels etc) so you can now encrypt all the things.

Data-In-Transit Encryption is a single click to enable

General Performance and monitoring improvements

As customers move to 25Gbps and 100Gbps switching, further optimizations have been made to the networking stack to increase parallelization of the CPU threads used for networking transport, increasing the efficiency this parallelizations balancing of and reduce overall CPU consumption per thread. These benefits will be most pronounced with RAID 5/6 usage, and multiple disk groups.

Networking monitoring improvements have been made to the vSAN network health checks. This will result in faster, more accurate automated network testing.

Featured

How to succeed as a profesional tech podcast

Pete Flecha and I co-host the Virtually Speaking Podcast. We both get emails, calls, and texts from various people wanting to start a tech podcast. I figured I’d sort some of the most common advice into a single blog with maybe some follow up blogs for more in-depth gear reviews etc. Note There are some strong opinions I hold here loosely and I look forward to the twitter banter that will follow.

Who

Because AI machine learning hasn’t figured out how to podcast yet, you are going to need to sort out who will come on the podcast.

Will you have a single host, a pair of hosts, a panel?

First, how many hosts you have is going to be partly a function of how long the podcast runs for. It’s hard to have a 6-minute podcast with a 5 person panel. Inversely the more people involved in a podcast, the more hell you face trying to align the Gantt chart that is everyone’s schedule and time zone requirements.

The vSpeaking Method for hosts? We generally stick to 2 host. We leverage a guest host filling in when one of us is traveling, or in a conference session and 2 guests. Another benefit of the occasional guest host is they can help interview on a topic that presenters are still getting up to speed on. An example of this is when we tagged in Myles to help with discussing containers and other devops hipster things with Chad Sakac.

What makes a good host?

This is going to be a bit of a philosophical discussion as a lot of things can make for a good host but some general thoughts on traits that help with sucucess:

  1. Someone needs to push the show forward, keep it on task and time and be a good showrunner.
  2. Someone needs to know enough about the topic to discuss it, or ask the guests about it. Someone who’s willing to do the research on the product and field more than 5 minutes beforehand.
  3. Someone skilled and willing to do the editing.
  4. Someone who can handle publicizing the podcast. (The vSpeaking Podcast doesn’t pay for promoted tweets or google AdWords).
  5. People (if plural hosts) who are good at reading a conversation and knowing when to jump in and out.
  6. Flexible schedules. I’ve recorded at all kinds of weird hours to try to support overseas guests, recording at weird hours while we were traveling to get US guests. Remember, The world does not revolve around the Pacific time zone.

Inversely there are some behaviors or environmental variables for hosts that may inhibit success.

Some traits can be problematic in hosts:

  1. People who lack commitment or “air cover from management” to stay committed beyond a few episodes. If your manager is going to be asking “what is the value in this” it may be problematic to maintain this as a work hours activity.
  2. A host who likes to hear the sound of their own voice. Remember it’s a podcast, not a monologue.
  3. A host who thinks they know more about the subject than they do. this will manifest itself in awkward questions, disagreements and the host talking too damn much.
  4. Hosts with zero understanding of the technology. If it’s just a product marketing intern reading a FAQ, no one will listen to it (if you are lucky!)

Will there be guests on every episode?

Having a constantly rotating cast of guests provides a lot of benefits as it helps open up the topics and information you bring in.

there is a class of podcasts that tends to diverge from this. The ‘news of the week’ format. A solid example of this would be something like the Six Five. I’ll caution there are a LOT of podcasts who fall into this camp and even executed well it has some challenges. This may sound easier in that you do not have to do any scheduling of guests, but done properly it is more time-consuming format as it requires you to be the subject matter expert on everything. This is best executed in niche fields (rather than the entire tech industry), and best executed by seasoned veterans who can provide more than the surface commentary and repeating what TheRegister wrote for commentary. Expect to do twice as much research when you have to be able to understand both the question AND the answer.

What

What topics will you cover?

This can be a difficult one. Creating a 100 episodes on JUST Microsoft Exchange would be difficult, exhausting, and likely result in most people (even people who are full time exchange admins) losing interest. There are ways to accomplish your goal of promoting a given product without making it front and center the focus of EVERY single episode. Find common adjacent products, products that integrate, fundamental technologies that every admin using said product needs to know.

The vSpeaking Podcast method: While I wouldn’t mix in an episode on woodworking we’ve found some back to the basics, Topical episodes, and partner episodes help keep things fresh. A major commitment we made from the beginning was to have the majority of our content be “evergreen”. The goal being that if someone discovers our podcast in 2020, they would find interest in listening to a lot of the back catalog. While there are some episodes that become dated quickly (Event, or launch specific episodes) even these might have elements or stories within them that resonate years later.

what questions do we ask?

Note to execute this well you need to do research on the guest (stalk them on twitter, blogs, recent press releases, LinkedIn) and try to get some questions that people in their field of experience would want to know. DO NOT ASK THE GUEST TO WRITE THE QUESTIONS. It comes off as disrespectful, (You don’t care enough to learn enough to ask an educated question). It is ok to ask if there are any recent announcements of themes they might want to cover, but do not make the guest do all the work here.

Critical elements of a good question are not jus what you ask, but what you do not ask.

Avoid leading questions

“Chad, it looks like Customers really like VCF because it provides enterprise reliability, can you explain why this is the most important reason to buy it? This is a terrrrrible question. If the question involves a run-on sentence, assumes you know more about the topic than the host, and narrows the response to “Yes, Mr host, you sure are right and smart” you are “doing it wrong!” . Instead of a less controlled question like “What are some common customer conversations you are having about VCF?” Sometimes you need to channel your inner Larry King and ask your questions like you know nothing. Starting with a level set question, and then asking questions off of that question and going deeper is a better method to allow the guest to explain things, and bring your audience with you rather than jump to a conclusion. A podcast series that interviews the greatest interviewers of all time is worth a listen.

How

The Gear

Gear is partly a function of the environment you record in. A $100 Sure SM58 in a professional recording studio, but in a small office may pick up sound bouncing off the hard walls. I’ve had the fortune of living near firehouses. Oddly enough “Podcasting booths or rooms” in some office buildings or conferences often have some of the worst aquistics ever created (Small rooms with hard glass surfaces are terrible for bouncing audio). Look for a blog series on “good/better best” gear for audio.

For now I’ll start with what not to use as it’s likely easier.

USB condenser microphones – The most popular example of these microphones are the Blue Yeti/snowball etc. Condensers are popular in studios (they can pull a lot of sounds in). The challenge with using these for office or home recording is you tend to end up “recording the room” (bounce off walls) and these microphones are aggressive at picking up background noise (air conditioning, fans, etc are easily picked up). You can do a lot worse than these (and we’ll get to this in a minute), but for serious audio recording in a less than functional environment be prepared to put some sound dampening on the walls, carpet in the room and turn off the air conditioning and fan. A downside of some of these USB mics is they often will not work with regular accessories (amplifiers, mixers, arms, highly custom-sized foam design) so you end up with a proprietary ecosystem around them.

Anything with a phone number – The public phone system tends to drop the quality of everything down to a common codec, G.711. This codec from 1972 is a “narrowband” codec and is (along with a host of other things) part of the reason why business travel exists. People don’t listen to podcasts by dialing into a voicemail, and you should want your podcast to sound better than their voicemail.

What I use?

I’m a fan of XLR based microphones. They make any investment in the Eco-system reusable later.

Microphone – Heil PR40. It’s dynamic and center-fire (Meaning it only records in a narrow angle) and that’s actually good for me. I don’t have a professional studio, the nursery is one room over, and fire trucks and barking dogs come past my office all the time. Note: this is an XLR microphone so I’ll need something to convert it to a digital signal. Pete also uses the same microphone, and that helps us get a similar sound after we match volume levels at the start of the podcast.

Input/Digitizer – For now I primarily use a Blue Icicle that is directly connected (no XLR cable used). I had some issues with the XLR cable built into my cheap arm mount and found that this avoided the need to get anything ti amplify the signal as it immediately goes from the microphone to being a digital signal. I’m still figuring out cable management for the USB cable. I also own a Sure digitizer that costs twice as much, but it was way too easy to hit the gain knob. The Blue requires some torque to turn and this means once you get it set you can largely ignore it.

Other things To go with the mic– I have the Heil foam for the mic to cut popping noises, and the heil branded shock mount (It prevents noise from when I hit the table while recording). I have an arm that is screwed into my desk (Was something cheap). If you are going to have a long XLR run, a quiet mic that needs a pre-amp something to amplify the signal (CloudLifter) might not be a bad idea. There’s no need to buy a mixer/board unless you are going to be blending multiple inputs (I’m not a DJ!).

In Person

For the road – When traveling to conferences we use tabletop stands and a Zoom H6 recorder. While it can act as a stand-alone recorder we normally feed it into a Mac using a USB cable into Audio Hijack and run some low pass filters and other effects. It offers support for 4 XLR inputs as well as individual gain control, can act as a handheld condenser with an attached microphone. Other software like Loopback can be handy for sending sound from a soundboard into another output.

Remote recording

Over the years we’ve tried a couple different bits of software. We started with:

Skype – which worked for a bit, but quality problems have gotten worse as Microsoft has slowly ruined it.

Skype for business – (an unmitigated disaster given our implementation uses narrowband codecs for remote employees when you have more than 2 people on a call).

Zoom.us – We settled on Zoom. It has a few interesting recording capabilities, like the ability to record every channel independently, and allow users to record locally. If you have network quality concerns this can help offset this allowing Pete in editing the podcast to delete parts when I was speaking over someone, or assemble local audio from a guest who was cutting in and out. This shouldn’t be needed often but it’s buried in the zoom web settings.

Editing –

We are doing it live! – Bill R

While it may seem easier to just record an hour and a half of audio and dump it to sound cloud, this is not what most people want to listen to. Part of the benefit to a podcast is not a live call is it allows you to leave some (or a lot) of an episode on the cutting room floor. Things that you can cut out:

  1. Mid program housekeeping discussions (Clarifying that we can introduce or avoid a given topic with the guest) or where you discuss the next segment
  2. Deleting things where you accidentally leaked NDA content.
  3. Letting someone try to respond to something again (however, if you obsess on perfection and require 10 takes to get something right Podcasts may be the wrong medium).
  4. Guests, off-color humor that you’d rather your VP not listen to while making pancakes with their kids.
  5. Long awkward pauses. We like to stress to guests if they want to sit and think for 10 seconds before responding “that’s fine”. Allowing awkward pauses gives people time to provide awesome content. It doesn’t sound great though on the podcast so we can cut them out.
  6. When John (Me) rambles off-topic or tries to speak over someone on accident.
  7. When you might have a good side conversation that might be useful in another montage episode on a topic.

The vSpeaking method? We REALLY try to keep podcasts in a 25-35 minute runtime when possible. Why this length? This is about the average commute time for a lot of our listeners (or time driving between customers for partners and customers). We might split a conversation up. We tend to block 1 hour for recording, use 5-10 minutes for housekeeping and setup, and record 40-50 minutes of content that is then edited down.

At Conferences like VMworld and events, we will often grab shorter 5-15 minute interviews. We will then stitch a collection of these into a longer episode using a windowing effect (we record an intro and outro for each segment). These “vignettes” might even be recorded as a small part of a larger episode. Episode 152 is an example of this format where we combined an interview with pat along with pieces of an interview with Brian Madden that will make up a future episode. This started out as a way for Pete and me to meet our goal of an episode every two weeks, by adapting the “Clip Show” method from television. These days this method is more about building an episode around a topic and provide more opinions and voices.

When

It’s worth noting that consistency is king in podcasting and publishing in general. If a podcast happens yearly, and is a 5-minute podcast, or is daily and is 3 hours long it will likely fail to grab a consistent listener base for a number of reasons. at least twice a month seems to be the minimum level of effort required for the length we run (25-45 minutes). Shorter podcasts tend to require much more frequent cadence to maintain listeners.

Length

It’s worth noting that consistency is king in podcasting and publishing in general. If a podcast happens yearly, and is a 5-minute podcast, or is daily and is 3 hours long it will likely fail to grab a consistent listener base for several reasons. at least twice a month seems to be the minimum level of effort required for the length we run (25-45 minutes). Shorter podcasts (8-10 minutes) tend to require more frequent cadence to maintain listeners.

A longer podcast allows some introduction banter, a longer intro song/music and a little more personality of the people speaking to come out. A short podcast (5-10 minutes) might work well for fitting into someone’s morning shower/shave/tooth brushing routine, but you’ll need to cut it down to more “just the facts”.

The VSpeaking Method: While there is some variety in the show length 25-35 minutes tends to be the target run time. This aligns well against the average, one-way commute time is 26.1 minutes, a 1 hour workout being able to fit two episodes.

Cadence

If you can’t commit to bi-monthly (24 episodes in a year) it may not be worth the investment. It’s hard to stay top of mind for listeners. Also consistent cadence can help. If you do weekly, then skip 2 quarters, then come back weekly it’s hard to remain part of the listeners “usual routine” where they listen to your podcast and make time for it. Assuming quality doesn’t suffer the more frequent, the better your subscriber numbers will look.

The VSpeaking Method: We started bi-monthly then shifted to bi-weekly then recently we have shifted to a “mostly-weekly” cadence.

Where

Where do you post it? Ideally, you want your podcast in every major podcast platform. You want the apple podcast app, Android play store, and Spotify. You will want a web player that allows people to play it from a browser, and you will want a website to host show notes, speaker notes, and other information.

Internal only podcast, or password protected – Call me cynical, but I don’t have faith in internal podcasts.

  1. There’s too much friction vs. using the existing apps that people use on their devices.
  2. It’s predicated on the myth that anything you post internal doesn’t easily leak out to competitors. Lets be honest, your secret competitive strategy if it’s any good will be published on TheRegister years before it ships or in the hands of your competitors before the ink dries. This might work for short form content that the embargo on will quickly be released.

The vSpeaking method: We host vSpeakingpodcast.com using Zencast.fm which provides hosting, and post-episode blogs on Virtual Blocks. We briefly flirted with an internal only podcast using socialcast as a distribution method (Sadly killed) or Podbean but for the quality and time commitment we make, we couldn’t justify it.

Conclusion

You need a few things to maintain a tech podcast.

  1. The drive to keep doing it. It has to be something you actually enjoy doing otherwise committing to blocking the time and doing the pre-work to get guests etc will result in it falling apart a month in.
  2. The right skills/talents. Pete is a fantastic showrunner, the host who keeps things moving, and editor.
  3. A genuine technical curiosity for the topics you will cover.

Help/SOS – Podcast Emergency you say?

if you’ve reached this section and you are trying to start a quality podcast please stop reading. If you are here because your boss came to you and said “we need you to start a podcast” keep reading. if you just discovered that your MBO/KPI for your quarterly bonus is tied to starting a podcast this section of the guide is exclusively for you. You don’t have any experience with any of this, and reading the above section has you convinced there is no way to be successful and it is too late to change this objective. It’s true, there isn’t really a market for un-edited poorly recorded 8 episode podcasts run by product marketing on “why you should buy our cloud product!” but that isn’t going to stop you from getting paid! Don’t stress, you will still be able to get your bonus if you follow the following guide.

Guests – Don’t try to get busy, highly in-demand guests. This might draw attention to the episode and highlight that this was just something slapped together to get a bonus.

KPI/MBO – Make sure the MBO/KPI didn’t include download statistics. Make sure to choose a platform that doesn’t provide this as part of it’s “free hosting” so you can blame that. In the event, you have some minimum downloads required just hire Mechanical Turk to get people to send you a screenshot of their phone saying “subscribed and downloaded”.

Quality Content? That will take too long. Just write out the ‘top 10 reasons you should buy our product’. Find the sales/marketing PowerPoint and feel free to just read the speaker notes (or slides, as we know speaker notes are for losers, and slides should have 100 words per slide). This is a great opportunity to reuse other stale content. For bonus points re-use content from someone who asked you to create a podcast so you can blame them if they don’t think the content is good!

Gear? Use the official corporate standard for interoffice phone calls, especially if it’s quality is terrible. This will reduce the desire of whoever is reviewing this to give you your bonus to listen long enough to realize there is no real content. Skype4Buisness, Webex dial-in bridges

Editing? Leave 20-30 minutes of dead air on the end of the episodes to make them look longer. Especially if you are hiding the low-quality effort after the first episode as this will prevent it auto-playing the next episode.

Platform and marketing? – Consider posting this as MP3 on an internal-only sales education portal. Avoiding hosting it in the outside world will help avoid scrutiny. Make sure the metadata tags are not tracked well, and the only link to it is a weekly newsletter where it’s placed at the bottom.

What if I want to my product marketed by a Podcast and do not want to bother with all this?

This is a bit easier than the above steps. Simply reach out to some podcasts that have existing followings in your space, and see if you can get guests that will represent your product to that community. Sometimes this will cost money, sometimes this will not. Note: the vSpeakingpodcast does not do pay for play, but we don’t judge others in the industry who do as long as it is responsibly and legally disclosed.

There are also experienced people in the industry that you can just outright hire JMT being a good one right now.

Alternatively, if you want to produce a short video series, Video (A youtube playlist) is honestly a more popular format that is likely more conducive to what you are trying to accomplish.

Featured

Is that supported by VMware? (A breakdown of common misconceptions)

This reddit thread about someone stuck in a non-supported aronfiguration that is having issues made me think its time to explain what supported and partner supported and not supported situations you should be aware of. This is not intended to be some giant pile of FUD that says “Do what John says or beware your doom!”. I wanted to highlight partners who are doing a great job of working within the ecosystem as well as point out some potential gaps that I see customers not always aware of.

I get a lot of questions about storage, and what is supported. At VMware we have quite a few TAP parters and thousands of products that we happily jointly support. These partners are in our TAP program and have submitted their solutions for certification with tested results that show they can perform, and we have agreements to work together to a common outcome (Your performance, and your availability).

There are some companies who do not certify their solutions but have “partner verified” solutions. These solutions may have been verified by the partner, but generally involve the statement of “please call your partner for support”. While VMware will support other aspects in the environment (we will accept a ticket to discuss a problem with NTP that is unrelated to the storage system), you are at best looking for best effort support on these solutions.  Other partners may have signed up for TAP, but do not actually have any solution statement with us. To be clear, being in TAP alone does not mean a solution is jointly supported or verified.

VVOLs

VVOls is an EXCELLENT product that allows storage based policy management to be extended to allow seamless management. Quite a few platforms support this today. If your on a storage refresh, you should STRONGLY consider checking that your partner supports VVOL, and you can check by checking this link.

Any storage company who’s looking at supporting VMware deployments at scale is looking at VVOLs. management of LUNs and arrays as you grow becomes cumbersome and introduces opportunity for error. You should ask your VMware storage provider of where they are on support VVOLs, and what their roadmap is. You can also check the HCL to see if your storage vendor is supporting VVOLs by checking here.

VAAI

VAAI is a great technology that allows LUN and NFS based systems to mitigate some of the performance and capability challenges.  VCAI is a smaller subset that allows NFS based systems to accelerate linked clone offload. Within NFS a smaller subset have been certified for large scale (2000 clones or more) operations.  These are great solutions. I bring this up because it has come to my attention that some partners advertise support of these features but have not completed testing.  This generally boils down to 1 of 3 situations.

 

  1. They have their submission pending and will have this fixed within weeks.
  2. Their solution fails to pass our requirements of performance or availability during testing.
  3. They are a very small startup and are taking the risk of not spending the time and money to complete the testing.
  4. They are not focused on the VMware market and are more concerned with other platforms.

Please check with your storage provider and make sure that their CURRENT version is certified if you are going to enable and use VAAI. You do not want to be surprised by a corruption, or performance issue and discover from a support call that you are in a non-supported configuration.  In some cases some partners have not certified newer platforms so be aware of this as you upgrade your storage. Also there are quite a lot of variations of VAAI (Some may support ATS but not UNMAP) so look at the devil in the details before you adopt a platform with VAAI.

Replication and Caching

Replication is a feature that many customers want to use (either for use with SRM, or as part of their own DR orchestration).  We have a LOT of partners, and we have our own option and two major API’s for supporting this today.

One is VADP (our traditional API associated with backups). Partners like Symantec, Comvault, and Veeam leverage this to provide backup and replication at scale for your environment. While it does use snapshots, I will note in 6.0 improvements were made (no more helper snapshots!) and VVOLs and VSAN’s alternative snapshot system provides much needed performance improvements

The other API is VAIO that allows for direct access to the IO path without the need for snapshots. StorageCraft, EMC and Veritas are leading the pack with adoption for replication here with more to follow. This API also provides access also for Caching solutions from Sandisk, Infinio and Samsung.

Lastly we have vSphere replication. It works with compression in 6.x, it doesn’t use snapshots unless you need guest processing, and it also integrates nicely with SRM.  Its not going to solve all problems (or else we wouldn’t have an ecosystem) but its pretty broad.

Some replication and caching vendors have chosen to use private, non-supported API (that in some cases have been marked for depreciation as they introduce stability and potential security issues). Our supports stance in this case again falls under partner supported at best. While VMware is not going to invalidate your support agreement, GSS may ask you to uninstall your 3rd party solution that is not supported to troubleshoot a problem.

OEM support

This sounds straight forward, but it always ins’t. If someone is selling you something turnkey that includes vSphere pre-installed, they are in one of our OEM programs.  Some examples of this you may know (Cisco/HP/Dell/SuperMicro/Fujitsu/HDS) but all some other ones you may not be aware of smaller embedded OEM’s who produce turnkey solutions that the customer might not even be aware of running ESXi on (Think industrial controls, surveillance and other black box type industry appliances that might be powered by vSphere if you look closely enough). OEM partners get the privilege of doing pre-installs as well as also in some cases offering the ability to bundle Tier 1 and Tier 2 support. Anyone not in this program can’t provide integrated seamless Tier 1/2 support and any tickets that they open will have to start over rather than offer direct escalations to tier 3/engineering resources potentially slowing down your support experience as well as again requiring that multiple tickets be opened with multiple vendors.

Lastly, I wanted to talk about protocols.

VMware supports a LOT of industry standard ways today for accessing storage.  Fibre Channel, Fibre Channel over Ethernet, iSCSI, NFS, Infiniband, SAS, SATA, NVMe as well as our protocol for VMware VSAN. I’m sure more will be supported at some point (vague non-forward looking statement!).

That said there have been some failed standards that were never supported (ATA over Ethernet which was pushed by CoRAID as an example) as they failed to gain wide spread support.

There have also been other proprietary protocols (EMC’s Scale IO) that again fall under Partner Verified and Supported space, and are not directly supported by VMware support or engineering. If your deploying ScaleIO and want VMware support for the solution you would want to look at the older 1.31 release that had a supported iSCSI protocol support for the older ESXi 5.5 release or to check with EMC and see if they have released an updated iSCSI certification. The idea here again isn’t that any ticket opened on a SSO problem will be ignored, just that any support of this solution may involve multiple tickets, and you would likely not start with VMware support on if it is a storage related problem.

Now the question comes up from all of this.

Why would I look at deploying something that is not supported by VMware Support and Engineering?

  1. You don’t have a SLA. If you have an end to end SLA you need something with end to end support (end of story). If this is a test/dev or lab environment, or one where you have temporarily workloads, this could work.
  2. You are wiling to work around to a supported configuration. In the case of ScaleIO, deploy ESXI 5.5 instead, and roll back to the older version to get iSCSI support.  In the case be aware that you may limit yourself on taking advantage of newer feature releases and be aware of when the older product versions support will sunset as this may shorten the lifecycle of the solution.
  3. You have faith the partner can work around future changes and can accept the slower cadence.  Note, unless that company is public there are few consequences for them making forward looking statements of support and failing to deliver on them. This is why VMware has to have an a ridiculous amount of legal bumpers on our VMworld presentations…
  4. You are willing to accept being stuck with older releases, and their limitations and known issues.  Partners who are in VAIO/VVOLs have advanced roadmap access (and in many cases help shape the roadmap).  Partners using non-supported solutions, and private API’s are often stuck with 6-9 months of reverse engineering to try to find out what changed between releases as there is no documentation available for how these API’s were changed (or how to work around their removal).
  5. You are willing to be the integrator of the solution. Opening multiple tickets and driving a resolution is something your company enjoys doing.  The idea of becoming your own converged infrastructure ISV doesn’t bother you. In this case I would check with signing up to become an OEM embedded partner if this is what you view as the value proposition that you bring to the table.
  6. You want to live dangerously. Your a traveling vagabond who has danger for a middle name. Datacenter outages, or 500ms of disk latency don’t scare you, and your users have no power to usurp your rule and cast you out.

 

Featured

Fun with VSAN Storage Profiles

When VASA and storage profiles first came out, I really thought it was not important or over rated for smaller shops. Now that VSAN has broken my lab free of the rule of “one size performance and data protection fits all” I’ve decided to get a bit creative to demonstrate what all can be done. I’ve included some sample tiers, as well as my own guidance on when to use them for staff. Notice how Gold is not the highest tier (A traditional design mistake in Clouds/Labs). The reason for this is simple. If someone asks for something, I can simply ask them “Do you want that on gold tier” and not end up giving them space reservations, Cache reservations, triple mirroring or striping after they demand gold tier for everything. This is key to reducing space wastage in environments where politics trumps resources in provisioning practices.

FunWithvSANProfiles

Featured

VSAN a Call for workloads!

After some brief fun setting up distributed switching I’m starting my first round of benchmarks. I’m going to run with Jumbo’s first then without Jumbo’s. My current testing harness is a copy of View Planner 3.0, and 3 instances if VMware IO analyzer. If anyone has any specific vSCSI traces they would like to send me I’m up for running a couple (Anyone got any crazy oracle workloads?).photo-1

Featured

the-vsan-build-part-2

I got the final parts (well enough to boot strap things) on Thursday so the building has begun.

A couple quick observations on the switch, and getting VSAN up and ready for vCenter Server.

NetGear XS712T

1. Just because you mark a port as Untagged doesn’t mean anything. To have your laptop be able to manage on a non-default VLAN you’ll need to setup a PVID (Primary VLAN ID) to the VLAN you want to use for management. Also management can only be done on a single IP/VLAN so make sure to setup a port with a PVID on that VLAN before you change it (otherwise its time for the reset switch).

VLANThunderbolt

2. Mac users should be advised that you can tag VLAN’s and create an unlimited number of virtual interfaces even on a thunderbolt adapter. Handy when using non-default VLANs for configuration. Click the plus sign in the bottom left corner of the network control panel to make a new interface, and then select the gear to manage it and change the VLAN.

3. It will negotiate 10Gig on a Cat5e cable (I’m going to go by Fry’s and get some better cables at some point here before benchmarking).

VSAN/VCenter.
Its trivial to setup a single host deployment.
First create the VSAN.
esxcli vsan cluster join -u bef029d5-803a-4187-920b-88a365788b12
(Alternatively you can go generate your own unique UUID)
Next up find the NAA on a normal disk, and a SSD by running this command.
esxcli storage core device list
Next up add the disks to the VSAN.
esxcli vsan storage add -d naa.50014ee058fdb53a -s naa.50015178f3682a73
After this you’ll want to add a VMkernel for VSAN and add some hosts, but with these commands you can have a one node system up ready for vCenter Server installation in under 15 minutes.

For this lab I’ll be using the vCenter Server Appliance.

After installing the OVA you’ll want to run the setup script. You will need to first login to the command line interface. Mac users be warned, mashing the command key will send you to a different TTY.
The login is root/vmware. From the console run the network setup script. /opt/vmware/share/vami/vami_set_network
It can be run with parameters attached to more quickly setup.
/opt/vmware/share/vami/vami_set_network eth0 STATICV4 172.16.44.100 255.255.255.0 172.16.44.1
After doing this you can login in your browser using HTTPS and port 5480 and finish the setup. Example (https://172.16.55.100:5480).

My Dog

Meet Otto. He’s a rescue who’s I guess around 10 years old, weighs 11-12 pounds and likes children and all people.

His likes are naps, sunning himself in the yard, and making me take him on 2-3 walks a day to keep me in shape.

He is weirdly quiet (Doesn’t bark), and doesn’t really make a mess or scratch at things.

I was told by my agent to attach a photo for landlords to see to applications, and finding some of their platforms don’t make this an option, will just embed a link to this blog post. If you see him on a greenbelt trail you or your children are welcome to pet him, he’s extremely harmless.

is HPE Tri-Mode Supported for ESA?

No.

Now, the real details are a bit more complicated than that. It’s possible to use the 8SFF 4x U.3 TriMode (not x1) backplane kit, but only if the server was built out with only NVMe drives, and no RAID controller/Smart Array. Personally I’d build of of E.3 drives. For a full BOM review and a bit more detail check out this twitter thread on the topic where I go step by step through the BOM outlining what’s on it, why and what’s missing.

How to configure a fast end to end NVMe I/O path for vSAN

A quick blog post as this came up recently. Someone who was looking t NVMoF with their storage was asking how do they configure a similar end to end vSAN NVMe I/O path that avoids SCSI or serial I/O queues.

Why would you want this? NVMe in general uses significantly less CPU per IOP compared to SCSI, has simpler hardware requirements commonly (no HBA needed), and can deliver higher throughput and IOPS at lower latency using parallel queuing.

This is simple:

  1. Start with vSAN certified NVMe drives.
  2. Use vSAN ESA instead of OSA (It was designed for NVMe and parralel queues in mind, with additional threading at the DOM layer etc).
  3. Start with 25Gbps ethernet, but consider 50 or 100Gbps if performance is your top concern.
  4. Configure the vNVMe adapter instead of the vSCSI or LSI Buslogic etc. controllers.
  5. (Optional) – Want to shed the bonds of TCP and lower networking overhead? Consider configuring vSAN RDMA (RCoE). This does require some specific configuration to implement, and is not required but for customers pushing the limits of 100Gbps in throughput this is something to consider.
  6. Deploy the newest vSAN version. The vSAN I/O path has seen a number of improvements even since 8.0GA that make it important to upgrade to maximize performance.

To get started adda. NVMe Controller to your virtual machines, and make sure VMtools is installed in the guest OS of your templates.

Note you can Migrate existing VMDKs to vNVMe (I recommend doing this with the VM powered off). Also before you do this you will want to install VMtools (So you have the VMware paravirtual NVMe controller driver installed).

RDTBench – Testing vSAN, RDMA and TCP between hosts

A while back I was asking engineering how they tested RDMA between hosts and stumbled upon RDTBench. This is a traffic generator, where you configure one host to act as a “server” and 1 to several hosts to act as clients communicating with it. This is a great tool for testing networking throughput before production use of a host, as well as validating RDMA configurations as it can be configured to generate vSAN RDMA traffic. Pings and IPERF are great, but being able to simulate RDT (vSAN protocol) traffic has it’s advantages.

RDT Bench does manifest itself as traffic on the vSAN Performance service host networking graphs
A few quick questions about it:

Where is it?

/usr/lib/vmware/vsan/bin

How do I run it?
You need to run it on two different hosts. One host will need to be configured to act as a client (by default it runs as a server). For the server I commonly use the -b flag to make it run bi-bidirectionally on the transport. -p rdma will run it in RDMA mode to test RDMA.

If RDMA is not working, go ahead and turn vSAN RDMA on (cluster, config, networking).

vSAN will “fail safe” back to TCP, but tell you what is missing from your configuration.

./rdtbench -h will provide the full list of command help.

For now this tool primarily exists for engineering (used for RDMA NIC validation) as well as support (as a more realistic alternative to IPERF), but I’m curious how we can incorporate it into other workflows for testing the health of a cluster.

Where to run your vCenter Server? (On a vSAN Stretched Cluster)

In a perfect world, you have a management cluster, that hosts your vCenter server and you the management of every cluster lives somewhere else. Unfortunately the real world happens and:

  • Something has to manage the management cluster.
  • Sometimes you need a cluster to be completely stand alone. 

Can I run the vCenter server on the cluster it manages?

It is FULLY supported to run the vCenter Server on the cluster that it is managing. HA will still work. If you want a deeper dive on this issue this short video covers this question. 

So what is the best advise when doing this?

  1. Use ephemeral port groups for all management networks. This prevents vDS chicken egg issues that are annoying but not impossible to work around. 
  2. I prefer to use DRS SHOULD rules so the center will “normally” live on the lowest host number/IP address in the cluster. This is useful for a situation where vCenter is unhealthy and the management services are failing to start, it makes it easy to find which host is running it. Make sure to avoid using “MUST” rules for this as it would prevent vCenter from running anywhere else in the event that host fails. 
You can attach VMK ports to a ephemeral port group even if the VCSA is offline

But what about a stretched Cluster? I have a stand alone host to run the witness server should I put it there? 

No, I would not recommend this design. It is always preferable to run the vCenter server somewhere that it will enjoy HA protection, and not need to be powered off to patch a host. vSAN stretched clusters always support active/active operations, many customers often configure them with most workloads running in the preferred datacenter location. If you use this configuration I recommend you run the vCenter server in the secondary location for a few reasons:

  1. In the event the primary datacenter fails, you will not be “Operationally blind” as HA is firing off, and recovering workloads. This lowers any operational blindspots that would happen for a few minutes while vCenter server fails over. 
  2. It will act as a weathervane to the health of the secondary datacenter. It is generally good to have SOME sort of workload running at the secondary site to provide some understanding of how those hosts will perform, even if it is a relatively light load.

Disable Intel VMD for drives being used for VMware vSAN

My recommendation is to please disable Intel VMD (Volume Manage Devices) and use the native NVMe inbox drive to mount devices for VMware vSAN going forward. To be clear Intel VMD is NOT a bad technology, but we do not need/want it in the I/O path for VMware vSAN going forward. It can be useful to do RAID on Chip for NVMe boot devices. In addition it was the only method to reliably get hotplug and serviceability (blink lights) prior the NVMe spec being “finished”, hence why it was sometimes used for some older early NVMe vSAN configurations.

Looking at the VCG a number of drive are only being certified using the Inbox driver and not the Intel driver.

To disable this you need to configure the Bios/UEFI. Here’s an example for Lenovo (who I think defaults to it enabled).

Jason Massae has a great blog that covers hoe to use Intel VMD in more details and Intel has their own documentation for non-vSAN use cases.

Yes, you can change things on a vSAN ESA ReadyNode

First I’m going to ask you to go check out the following KB and take 2-3 minutes and read it. : https://kb.vmware.com/s/article/90343

Pay extra attention to the table from this document it links to.

Also go read Pete’s new blog explaining read intensive drive support.

So what does this KB Mean in practice?

You can start with the smallest ReadyNode (Currently this is an AF-2, but I’m seeing some smaller configs in the pipeline), and add capacity, drives, or bigger NICs and make changes based on the KB.

Should I change it?

The biggest things to watch for is adding TONS of capacity, and not increasing NIC sizes, could result in longer than expected rebuilds. Putting 300TB into a host with 2 x 10Gbps NICs is probably not the greatest idea, while adding extra RAM or cores (or changing the CPU frequency 5%) is unlikely to yield any unexpected behaviors. In general balanced designs are preferred (That’s why the ReadyNode profiles as a template exist) but we do understand sometimes customers need some flexibility and because of the the KB above was created to support it.

What can I change?

I’ve taken the original list, and converted it to text as well as added (in Italics) some of my own commentary on what and how to change ESA ReadyNodes. I will be updated this blog as new hardware comes onto the ReadyNode certification list.

CPU

  • Same or higher core count with similar or higher base clock speed is recommended.
  • Each SAN ESA ReadyNode™ is certified against a prescriptive BOM.
  • Adding more memory than what is listed is supported by SAN, provided Sphere supports it. Please maintain a balanced memory population configuration when possible.
  • If wanting to scale storage performance with additional drives, consider more cores. While vSAN OSA was more sensative to clock speed for scaling agregate performance, vSAN ESA additional threading makes more cores particularly useful for scaling performance.
  • As of the time of this writing the minimum number of cores is 32. Please check the vSAN ESA VCG profile page for updates to see if smaller nodes have been certified.

Storage Devices (NVMe drives today)

  • Device needs to be same or higher performance/endurance class.
  • Storage device models can be changed with SAN ESA certified disk. Please confirm with the Server vendor for Storage device support on the server.
  • We recommend balancing drive types and sizes(homogenous configurations) across nodes in a cluster.
  • We allow changing the number of drives and drives at different capacity points(change should be contained within the same cluster)as long as it meets the capacity requirement of the profile selected but not exceed Max Drives certified for the ReadyNode™. Please note that the performance is dependent on the quantity of the drives.
  • Mixed Use NVMe (typically 3DWPD) endurance drives are best for large block steady State workloads. Lower endurance drives that are certified for vSAN ESA may make more sense for read heavy, shorter duty cycle, storage dense cost conscious designs.
  • 1DWPD ~15TB “Read Intensive” are NOW on the vSAN ESA VCG, for storage dense, non-sustained large block write workloads these offer a great value for storage dense requirements.
  • Consider rebuild times, and consider also upgrading the number of NICs for vSAN or the NIC interfaces to 100Gbps when adding significant amounts of capacity to a node.

NIC

  • NICs certified in IOVP can be leveraged for SAN ESA ReadyNode™.
  • NIC should be same or higher speed.
  • We allow adding additional NICs as needed.
  • If/When 10Gbps NIC hosts ReadyNode profiles are released it is advised to still consider 25Gbps NICs as they can operate at 10Gbps and support future switching upgrades (SFP28 interfaces are backwards compatible with SFP+ cables/transceivers).

Boot Devices

  • Boot device needs to be same or higher performance endurance class.
  • Boot device needs to be in the same drive family.

TPM

Please just buy a TPM. It is critically important for vSAN Encryption key protection, securing the ESXi configuration, host attestation and other issues. They cost $50 up front, but hours of annoying maintenance to install after the fact. I suggest throwing a NVMe drive at any sales engineer who forgets them off a quote.

NFS Native Snapshots, Should I just use vVols instead?

The ability to offload snapshots natively to a NFS filer has been around for a while. Commonly this was used with View Composer Array Integration (VCAI) to rapidly clone VDI images, and occasionally for VMware Cloud Director environments (Fast Clone for vApps). There were some caveats to consider:

  • Up until vSphere 7 Update 2 the first snapshot had to be a traditional redo log snapshot.
  • VMware blocks storage vMotion for VMs with native snapshots. (You will need to use array replication, and a bit of scripting to move these) which leads to the most important caviot.
  • A snapshot.alwaysAllowNative = “TRUE” setting for virtual machines was introduced. This allows the virtual machine in NFS datastore with VAAI plugin to be able to create Native snapshots ignoring its base disk is flat one or not.
  • If the Filer refuses to create a snapshot (Most commonly seen when a filer refuses to allow snapshots while doing a background automated clone or replication on some storage platforms), it will revert to redo log. It is worth noting that “alwaysAllowNative” doesn’t actually prevent this fail back behavior.
  • Some filers vendors will automatically inject snapshot.alwaysAllowNative = “TRUE” into VMs automatically.

The challenge with this in particular is that it can cause a problem. A Chain that goes from Native, to Redo log back to Native (or Redo Log to Native to Redo log) is invalid and leads to disk corruption!

So what are my options if this is a risk in my environment?

I’ll first off point out that vVols allows offloading of snapshots WHILE retaining support for storage vMotion. It’s fundamentally a bit simpler/more elegant solution to this problem of having natively offloading snapshots.

For most NFS VAAI users this should not be an issue as the filer should just create native snapshots when asked. For platforms that have issues taking native snapshots when other background processes are running, consider disabling that background replication/cloning that is automatically tied to the snapshot tree. If this is not an option consider not using snapshot.alwaysAllowNative and performing full clones, or not using the NFS VAAI clone offload instead. Hopefully in the future there will be a further patch to prevent this issue.