Skip to content

Posts from the ‘Uncategorized’ Category

vSAN Deduplication and Compression Tips!

I’ve been getting some questions lately and here are a few quick thoughts on getting the most out of this feature.

If you do not see deduplication or compression at all:

  1. See if the object space reservation policy has been set to above zero, as this reservation will effectively disable the benefits of deduplication for the virtual machine.
  2. Do not forget that swap is by default set to 100% but can be changed.
  3. If a legacy client or provisioning command is used that specifies “thick” or “Eager Zero Thick” this will override the OSR 100%. To fix this, you can reapply the policy. William Lam has a great blog post with some scripts on how to identify and resolve this.
  4. Make sure data is being written to the capacity tier. If you just provisioned 3-4 VM’s they may still be in the write buffer. We do not waste CPU or latency deduplicating or compressing data that may not have a long lifespan. If you only provisioned 10 VM’s that are 8GB each it’s quite possible that they have not destaged yet. If you are doing testing clone a lot of VM’s (I tend to create 200 or more) so you can force the destage to happen.

Performance anomalies (and why!) when testing vSAN’s deduplication and compression.

I’ve always felt that it’s incredibly hard to performance test deduplication and compression features, as real-world data has a mix of compressibility, and duplicate blocks and some notes I’ve seen from testing. Note: these anomalies often happen on other storage systems with these features and highlight the difficulty in testing these features.

  • Testing 100% duplicate data tends to make reads and writes better than a baseline of the feature off as you avoid any bottleneck on the destage from cache process, and the tiny amount of data will end up in a DRAM cache.
  • Testing data that compresses poorly on vSAN will show the little impact to read performance as vSAN will write the data fully hydrated to avoid any CPU or latency overhead in decompression (not that LZ4 isn’t a fast algorithm, to begin with).
  • Write throughput and IOPS for bursts that do not start to fill up the cache show little overhead. This is true, as the data is written non-compacted to reduce latency

These quirks stick out in synthetic testing, and why I recommend reading the space efficiencies guide for guidance on using this and other features.

New and noteworthy vSAN KB’s worth a read.

While vSAN Health Checks are constantly expanding, it’s still worth keeping up with the new KB’s to see what’s going on and if there are any issues you need to consider.

Here’s a few KB’s worth a read. 

vSAN 2017 Quarterly Advisory for Q2

This article includes links to important bug fixes in recent patch releases, outstanding issues, known workarounds and other informational articles.

2150957

 

File services support by NetApp ONTAP Select 9.2 for VMware vSAN datastores

This article provides information about NetApp’s ONTAP Select solution that offers file services on VMware vSAN datastore.

2151182

  

Setting up active-passive dual pathing with vSAN and vSphere

This article explains setting up active-passive dual pathing with vSAN and vSphere. This one is a bit interesting as it includes some information on the superiority of native drivers in handling internal duel path SAS fabrics in managing failover and failback.

2151225

 

Understanding vSAN memory consumption in ESXi 6.5.0d/ 6.0 U3 and later

This article provides information about memory consumption in the latest version of vSAN 6.2 (ESXi 6.0 Update 3 and later) and vSAN 6.6 (ESXi 6.5.0d and later) and a provides example scenarios.

    2113954

Duplicate SCSI IDs causing SATA drives in drive bay #1 to go missing from ESXi when running the nhpsa driver on Gen 9 HPE Synergy compute modules, HPE ProLiant DL-series servers that include a SAS expander

This document highlights an issue observed when using Gen 9 HPE Synergy compute modules or HPE ProLiant DL-series servers with ESXi 6.5, the native nhpsa driver and SATA drives. There’s a workaround for now (Leave drive bay 1 empty, or use a SAS device for it).

2150104

 

 

vSAN 6.6 Ondisk upgrade to version 5 fails with the error “A general system error occurred: Unable to complete Sysinfo operation…”

This is resolved by going to 6.6.1 and performing an update while using the re-sync throttle function. Note, if your on vSAN 6.6 you REALLY want to get to vSAN 6.6.1. Huge performance improvements, beyond bug fixes like this.

2151316

How to bulk create VMkernel Ports for vMotion and vSAN in vSAN 6.6

Quick post time!

A key part of vSAN 6.6 improvements is the new configuration assist menu. Common configuration requirements are tested, and wizards can quickly be launched that will do various tasks (Setup DRS, HA, create a vDS and migrate etc).

One of my least favorite repetitive tasks to do in the GUI is setup VMkernel Ports for vSAN and vMotion. Once you create your vDS and port groups, you can quickly create these in bulk for all host at once.

Once you put in the IP address for the first host in the cluster it will auto fill the remainder by adding one to the last octet. Note, this will use the order that hosts were added to the cluster (So always add them sequentially). Note you can also bulk set the MTU if needed.

If you have more questions about vSAN, vSAN networking, or want more demo’s check out the vSAN content, head over to storagehub.vmware.com

The GIF below walks through the entire process:

So Easy a caveman could do it!

VMware vSAN, Cisco UCS and Cisco ACI information

I’ve had a few questions regarding VMware vSAN with Cisco ACI.

While mostly the guidance for ACI is the same there are a few vendor specific considerations. upon internal testing we found some recommended configuration advise and specific concerns for the multicast querier. For more information see this new storage hub section of the networking guide. 

If your looking for General vSAN networking advice, be sure to read the networking guide.

If your looking for Cisco’s documentation regarding UCS servers and VMware vSAN it can be found here.

If your looking for guidance on configuring Cisco Controllers and HBA’s Peter Keilty has some great blogs on this topic. As a reminder while I would strongly prefer the Cisco HBA over the RAID controller if you use the RAID controller you will need the cache module to have proper queue depths.

 

Looking for VMware Storage Content?

Looking for Demo’s, Videos, Design and sizing guides, VVOLs, SRM, VSAN?

Go check out storagehub.vmware.com

Is that supported by VMware? (A breakdown of common misconceptions)

This reddit thread about someone stuck in a non-supported aronfiguration that is having issues made me think its time to explain what supported and partner supported and not supported situations you should be aware of. This is not intended to be some giant pile of FUD that says “Do what John says or beware your doom!”. I wanted to highlight partners who are doing a great job of working within the ecosystem as well as point out some potential gaps that I see customers not always aware of.

I get a lot of questions about storage, and what is supported. At VMware we have quite a few TAP parters and thousands of products that we happily jointly support. These partners are in our TAP program and have submitted their solutions for certification with tested results that show they can perform, and we have agreements to work together to a common outcome (Your performance, and your availability).

There are some companies who do not certify their solutions but have “partner verified” solutions. These solutions may have been verified by the partner, but generally involve the statement of “please call your partner for support”. While VMware will support other aspects in the environment (we will accept a ticket to discuss a problem with NTP that is unrelated to the storage system), you are at best looking for best effort support on these solutions.  Other partners may have signed up for TAP, but do not actually have any solution statement with us. To be clear, being in TAP alone does not mean a solution is jointly supported or verified.

VVOLs

VVOls is an EXCELLENT product that allows storage based policy management to be extended to allow seamless management. Quite a few platforms support this today. If your on a storage refresh, you should STRONGLY consider checking that your partner supports VVOL, and you can check by checking this link.

Any storage company who’s looking at supporting VMware deployments at scale is looking at VVOLs. management of LUNs and arrays as you grow becomes cumbersome and introduces opportunity for error. You should ask your VMware storage provider of where they are on support VVOLs, and what their roadmap is. You can also check the HCL to see if your storage vendor is supporting VVOLs by checking here.

VAAI

VAAI is a great technology that allows LUN and NFS based systems to mitigate some of the performance and capability challenges.  VCAI is a smaller subset that allows NFS based systems to accelerate linked clone offload. Within NFS a smaller subset have been certified for large scale (2000 clones or more) operations.  These are great solutions. I bring this up because it has come to my attention that some partners advertise support of these features but have not completed testing.  This generally boils down to 1 of 3 situations.

 

  1. They have their submission pending and will have this fixed within weeks.
  2. Their solution fails to pass our requirements of performance or availability during testing.
  3. They are a very small startup and are taking the risk of not spending the time and money to complete the testing.
  4. They are not focused on the VMware market and are more concerned with other platforms.

Please check with your storage provider and make sure that their CURRENT version is certified if you are going to enable and use VAAI. You do not want to be surprised by a corruption, or performance issue and discover from a support call that you are in a non-supported configuration.  In some cases some partners have not certified newer platforms so be aware of this as you upgrade your storage. Also there are quite a lot of variations of VAAI (Some may support ATS but not UNMAP) so look at the devil in the details before you adopt a platform with VAAI.

Replication and Caching

Replication is a feature that many customers want to use (either for use with SRM, or as part of their own DR orchestration).  We have a LOT of partners, and we have our own option and two major API’s for supporting this today.

One is VADP (our traditional API associated with backups). Partners like Symantec, Comvault, and Veeam leverage this to provide backup and replication at scale for your environment. While it does use snapshots, I will note in 6.0 improvements were made (no more helper snapshots!) and VVOLs and VSAN’s alternative snapshot system provides much needed performance improvements

The other API is VAIO that allows for direct access to the IO path without the need for snapshots. StorageCraft, EMC and Veritas are leading the pack with adoption for replication here with more to follow. This API also provides access also for Caching solutions from Sandisk, Infinio and Samsung.

Lastly we have vSphere replication. It works with compression in 6.x, it doesn’t use snapshots unless you need guest processing, and it also integrates nicely with SRM.  Its not going to solve all problems (or else we wouldn’t have an ecosystem) but its pretty broad.

Some replication and caching vendors have chosen to use private, non-supported API (that in some cases have been marked for depreciation as they introduce stability and potential security issues). Our supports stance in this case again falls under partner supported at best. While VMware is not going to invalidate your support agreement, GSS may ask you to uninstall your 3rd party solution that is not supported to troubleshoot a problem.

OEM support

This sounds straight forward, but it always ins’t. If someone is selling you something turnkey that includes vSphere pre-installed, they are in one of our OEM programs.  Some examples of this you may know (Cisco/HP/Dell/SuperMicro/Fujitsu/HDS) but all some other ones you may not be aware of smaller embedded OEM’s who produce turnkey solutions that the customer might not even be aware of running ESXi on (Think industrial controls, surveillance and other black box type industry appliances that might be powered by vSphere if you look closely enough). OEM partners get the privilege of doing pre-installs as well as also in some cases offering the ability to bundle Tier 1 and Tier 2 support. Anyone not in this program can’t provide integrated seamless Tier 1/2 support and any tickets that they open will have to start over rather than offer direct escalations to tier 3/engineering resources potentially slowing down your support experience as well as again requiring that multiple tickets be opened with multiple vendors.

Lastly, I wanted to talk about protocols.

VMware supports a LOT of industry standard ways today for accessing storage.  Fibre Channel, Fibre Channel over Ethernet, iSCSI, NFS, Infiniband, SAS, SATA, NVMe as well as our protocol for VMware VSAN. I’m sure more will be supported at some point (vague non-forward looking statement!).

That said there have been some failed standards that were never supported (ATA over Ethernet which was pushed by CoRAID as an example) as they failed to gain wide spread support.

There have also been other proprietary protocols (EMC’s Scale IO) that again fall under Partner Verified and Supported space, and are not directly supported by VMware support or engineering. If your deploying ScaleIO and want VMware support for the solution you would want to look at the older 1.31 release that had a supported iSCSI protocol support for the older ESXi 5.5 release or to check with EMC and see if they have released an updated iSCSI certification. The idea here again isn’t that any ticket opened on a SSO problem will be ignored, just that any support of this solution may involve multiple tickets, and you would likely not start with VMware support on if it is a storage related problem.

Now the question comes up from all of this.

Why would I look at deploying something that is not supported by VMware Support and Engineering?

  1. You don’t have a SLA. If you have an end to end SLA you need something with end to end support (end of story). If this is a test/dev or lab environment, or one where you have temporarily workloads, this could work.
  2. You are wiling to work around to a supported configuration. In the case of ScaleIO, deploy ESXI 5.5 instead, and roll back to the older version to get iSCSI support.  In the case be aware that you may limit yourself on taking advantage of newer feature releases and be aware of when the older product versions support will sunset as this may shorten the lifecycle of the solution.
  3. You have faith the partner can work around future changes and can accept the slower cadence.  Note, unless that company is public there are few consequences for them making forward looking statements of support and failing to deliver on them. This is why VMware has to have an a ridiculous amount of legal bumpers on our VMworld presentations…
  4. You are willing to accept being stuck with older releases, and their limitations and known issues.  Partners who are in VAIO/VVOLs have advanced roadmap access (and in many cases help shape the roadmap).  Partners using non-supported solutions, and private API’s are often stuck with 6-9 months of reverse engineering to try to find out what changed between releases as there is no documentation available for how these API’s were changed (or how to work around their removal).
  5. You are willing to be the integrator of the solution. Opening multiple tickets and driving a resolution is something your company enjoys doing.  The idea of becoming your own converged infrastructure ISV doesn’t bother you. In this case I would check with signing up to become an OEM embedded partner if this is what you view as the value proposition that you bring to the table.
  6. You want to live dangerously. Your a traveling vagabond who has danger for a middle name. Datacenter outages, or 500ms of disk latency don’t scare you, and your users have no power to usurp your rule and cast you out.

 

Dispelling myths about VSAN and flash.

I’ve been having the same conversation with several customers lately that is concerning.

Myth #1 “VSAN must use flash devices from a small certified list”

Reality: The reality is that that there are over 600 different flash devices that have been certified (and this list is growing).

Myth #2 “The VSAN certified flash devices are expensive!”

Reality: ” Capacity tier flash devices can be found in the 50-60 cents per GB range from multiple manufacturers. Caching tier devices can be found for under $1 per GB.  These prices have fallen from $2.5 a GB when VSAN was released in 2014. I expect this downward price trend to continue.

Myth #3 “I could save money with another vendor who will support using cheaper consumer grade flash. They said it would be safe”.

Reality: Consumer grade drives lack capacitors to protect both upper and lower pages.  In order to protect lower cost NAND these drives use volatile DRAM buffers to hold and coalesce writes. Low end consumer grade drives will ignore flush after write commands coming from the operating system, and on power loss can simply loose the data in this buffer.  Other things that can happen is meta data corruption (loss of the lookup table resulting in large portions of the drive becoming unavailable) shorn writes (where writes do not align properly with their boundary and loose data as well as improperly return it on read) and non-serialized writes that could potentially file system or application level recovery journals.  Ohio State and HP Labs put together a great paper on all the things that can (and will) go wrong here. SSD’s have improved since this paper, and others have done similar tests of drives with and without proper power loss protection. The findings point to enterprise class drives with power loss protection being valuable.

Myth #4 “Those consumer grade drives are just as fast!”

Reality: IO latency consistency is less reliable on writes and garbage collection takes significantly more time as there is less spare capacity to manage it.  Flash is great when its fast, but when its not consistent applications can miss SLA’s. If using consumer grade flash in a VSAN home lab, make sure you disable the high latency drive detection. In our labs under heavy sustained load we’ve seen some fairly terrible performance out of consumer flash devices.

In conclusion, there are times and places for cheap low end consumer grade flash (like in my notebook or home lab) but for production use where persistent data matters it should be avoided.

Upcoming Live/Web events…

Spiceworks  Dec 1st @ 1PM Central- “Is blade architecture dead” a panel discussion on why HCI is replacing legacy blade designs, and talk about use cases for VMware VSAN.

Micron Dec 3rd @ 2PM Central – “Go All Flash or go home”   We will discuss what is new with all flash VSAN, what fast new things Micron’s performance lab is up to, and an amazing discussion/QA with Micron’s team. Specifically this should be a great discussion about why 10K and 15K RPM drives are no longer going to make sense going forward.

Intel Dec 16th @ 12PM Central – This is looking to be a great discussion around why Intel architecture (Network, Storage, Compute) is powerful for getting the most out of VMware Virtual SAN.

Veeam On (Part 2)

It has been a great week.  VeeamOn has been a great conference and its clear Rick and the team really wanted to have a different spin on the IT conference.  While there are some impressive grand gestures, speakers and sessions (I Seriously feel like I’m in a spaceship right now) its the little details that stand out too.

My favorite little things so far.

  • General Session does not start at 8AM.
  • Breakfast runs till 9AM not 7:45AM like certain other popular conferences. (Late night networking and no food going into a General Session is not a good idea!).
  • Day 2 keynotes are by the partners, giving me a mini-taste of VMware, HP, Cisco, Netapp’s and Microsoft’s “Vision” for the new world of IT.
  • A really interesting mix of attendees.  I’ll go from having a conversation with 10TB to someone else with 7PB’s and some of the challenges we are discussing will be the same.
  • LabWarz  is easy to wander in and tests you skills in a far more “fun” and meaningful way than trying to dive a certification test between sessions.
  • The vendor expo isn’t an endless maze of companies but are companies (and specifically only the products within them) that are relevant to Veeam users.

Time to check the log…

You can see from the year 5 rings that there was great budget, and much storage was added!

Any time you open a ticket with VMware (or any vendor) the first thing they generally want you to do is pull the logs and send them over.  They then use their great powers (of grep) to try to find the warning signs, or results of a known issue (or new one!).  This whole process can take quite some time, and frustratingly some issues roll out of logs quickly, are buried in 10^14 of noise, or can only be found with an environment that is down and has not been rebooted.  I recently had a conference call with a vendor where they instructed a customer that we would have to wait for one (or more!) complete crashes to their storage array before they would be able to get the logs to possibly find a solution.

This is where LogInsight comes to the rescue.  With real time indexing, graphs that do not require you learn ruby to make, and machine learning to auto group similar messages you can find out why your data center has crashed in 15 minutes instead of 15 days.

Recently while deploying a POC I had a customer who complained of intermittent performance issues on a VDI cluster they couldn’t quite pin down.  Internal teams were arguing (Storage blamed network, network blamed AD, Windows/AD blamed the VMware admin).  A quick search for “error*,crit*,warn*” across all infastruture on the farm (Firewall/Switch/Fabric/DiskArray/Blades/<infinate number of locations View hides logs> returned thousands of unrelated errors for internal certificates not being signed and other non-interesting events.   LogInsight’s auto grouping allowed for quick filtering of the noise to uncover the smoking gun. A Fibre Channel connection inside of a blade chassis was flapping (from a poorly seated HBA).  IT was not bad enough to trigger port warnings on the switches, or an all paths down error, but it was enough to impact user experience randomly.  This issue was a ghost that had been plaguing them for two weeks at this point.  LogInsight found it in under 15 minutes of searching.  It was great to have clear evidence so we could end internal arguing as well as hold the vendor accountable so they couldn’t deflect blame to VMware or another product.

I’d encourage everyone to download a free trial and post back in the comments what obscure errors or ghosts in the machine you end up finding.