Skip to content

Posts from the ‘Uncategorized’ Category

VMware vSAN, Cisco UCS and Cisco ACI information

I’ve had a few questions regarding VMware vSAN with Cisco ACI.

While mostly the guidance for ACI is the same there are a few vendor specific considerations. upon internal testing we found some recommended configuration advise and specific concerns for the multicast querier. For more information see this new storage hub section of the networking guide. 

If your looking for General vSAN networking advice, be sure to read the networking guide.

If your looking for Cisco’s documentation regarding UCS servers and VMware vSAN it can be found here.

If your looking for guidance on configuring Cisco Controllers and HBA’s Peter Keilty has some great blogs on this topic. As a reminder while I would strongly prefer the Cisco HBA over the RAID controller if you use the RAID controller you will need the cache module to have proper queue depths.


Looking for VMware Storage Content?

Looking for Demo’s, Videos, Design and sizing guides, VVOLs, SRM, VSAN?

Go check out

Is that supported by VMware? (A breakdown of common misconceptions)

This reddit thread about someone stuck in a non-supported aronfiguration that is having issues made me think its time to explain what supported and partner supported and not supported situations you should be aware of. This is not intended to be some giant pile of FUD that says “Do what John says or beware your doom!”. I wanted to highlight partners who are doing a great job of working within the ecosystem as well as point out some potential gaps that I see customers not always aware of.

I get a lot of questions about storage, and what is supported. At VMware we have quite a few TAP parters and thousands of products that we happily jointly support. These partners are in our TAP program and have submitted their solutions for certification with tested results that show they can perform, and we have agreements to work together to a common outcome (Your performance, and your availability).

There are some companies who do not certify their solutions but have “partner verified” solutions. These solutions may have been verified by the partner, but generally involve the statement of “please call your partner for support”. While VMware will support other aspects in the environment (we will accept a ticket to discuss a problem with NTP that is unrelated to the storage system), you are at best looking for best effort support on these solutions.  Other partners may have signed up for TAP, but do not actually have any solution statement with us. To be clear, being in TAP alone does not mean a solution is jointly supported or verified.


VVOls is an EXCELLENT product that allows storage based policy management to be extended to allow seamless management. Quite a few platforms support this today. If your on a storage refresh, you should STRONGLY consider checking that your partner supports VVOL, and you can check by checking this link.

Any storage company who’s looking at supporting VMware deployments at scale is looking at VVOLs. management of LUNs and arrays as you grow becomes cumbersome and introduces opportunity for error. You should ask your VMware storage provider of where they are on support VVOLs, and what their roadmap is. You can also check the HCL to see if your storage vendor is supporting VVOLs by checking here.


VAAI is a great technology that allows LUN and NFS based systems to mitigate some of the performance and capability challenges.  VCAI is a smaller subset that allows NFS based systems to accelerate linked clone offload. Within NFS a smaller subset have been certified for large scale (2000 clones or more) operations.  These are great solutions. I bring this up because it has come to my attention that some partners advertise support of these features but have not completed testing.  This generally boils down to 1 of 3 situations.


  1. They have their submission pending and will have this fixed within weeks.
  2. Their solution fails to pass our requirements of performance or availability during testing.
  3. They are a very small startup and are taking the risk of not spending the time and money to complete the testing.
  4. They are not focused on the VMware market and are more concerned with other platforms.

Please check with your storage provider and make sure that their CURRENT version is certified if you are going to enable and use VAAI. You do not want to be surprised by a corruption, or performance issue and discover from a support call that you are in a non-supported configuration.  In some cases some partners have not certified newer platforms so be aware of this as you upgrade your storage. Also there are quite a lot of variations of VAAI (Some may support ATS but not UNMAP) so look at the devil in the details before you adopt a platform with VAAI.

Replication and Caching

Replication is a feature that many customers want to use (either for use with SRM, or as part of their own DR orchestration).  We have a LOT of partners, and we have our own option and two major API’s for supporting this today.

One is VADP (our traditional API associated with backups). Partners like Symantec, Comvault, and Veeam leverage this to provide backup and replication at scale for your environment. While it does use snapshots, I will note in 6.0 improvements were made (no more helper snapshots!) and VVOLs and VSAN’s alternative snapshot system provides much needed performance improvements

The other API is VAIO that allows for direct access to the IO path without the need for snapshots. StorageCraft, EMC and Veritas are leading the pack with adoption for replication here with more to follow. This API also provides access also for Caching solutions from Sandisk, Infinio and Samsung.

Lastly we have vSphere replication. It works with compression in 6.x, it doesn’t use snapshots unless you need guest processing, and it also integrates nicely with SRM.  Its not going to solve all problems (or else we wouldn’t have an ecosystem) but its pretty broad.

Some replication and caching vendors have chosen to use private, non-supported API (that in some cases have been marked for depreciation as they introduce stability and potential security issues). Our supports stance in this case again falls under partner supported at best. While VMware is not going to invalidate your support agreement, GSS may ask you to uninstall your 3rd party solution that is not supported to troubleshoot a problem.

OEM support

This sounds straight forward, but it always ins’t. If someone is selling you something turnkey that includes vSphere pre-installed, they are in one of our OEM programs.  Some examples of this you may know (Cisco/HP/Dell/SuperMicro/Fujitsu/HDS) but all some other ones you may not be aware of smaller embedded OEM’s who produce turnkey solutions that the customer might not even be aware of running ESXi on (Think industrial controls, surveillance and other black box type industry appliances that might be powered by vSphere if you look closely enough). OEM partners get the privilege of doing pre-installs as well as also in some cases offering the ability to bundle Tier 1 and Tier 2 support. Anyone not in this program can’t provide integrated seamless Tier 1/2 support and any tickets that they open will have to start over rather than offer direct escalations to tier 3/engineering resources potentially slowing down your support experience as well as again requiring that multiple tickets be opened with multiple vendors.

Lastly, I wanted to talk about protocols.

VMware supports a LOT of industry standard ways today for accessing storage.  Fibre Channel, Fibre Channel over Ethernet, iSCSI, NFS, Infiniband, SAS, SATA, NVMe as well as our protocol for VMware VSAN. I’m sure more will be supported at some point (vague non-forward looking statement!).

That said there have been some failed standards that were never supported (ATA over Ethernet which was pushed by CoRAID as an example) as they failed to gain wide spread support.

There have also been other proprietary protocols (EMC’s Scale IO) that again fall under Partner Verified and Supported space, and are not directly supported by VMware support or engineering. If your deploying ScaleIO and want VMware support for the solution you would want to look at the older 1.31 release that had a supported iSCSI protocol support for the older ESXi 5.5 release or to check with EMC and see if they have released an updated iSCSI certification. The idea here again isn’t that any ticket opened on a SSO problem will be ignored, just that any support of this solution may involve multiple tickets, and you would likely not start with VMware support on if it is a storage related problem.

Now the question comes up from all of this.

Why would I look at deploying something that is not supported by VMware Support and Engineering?

  1. You don’t have a SLA. If you have an end to end SLA you need something with end to end support (end of story). If this is a test/dev or lab environment, or one where you have temporarily workloads, this could work.
  2. You are wiling to work around to a supported configuration. In the case of ScaleIO, deploy ESXI 5.5 instead, and roll back to the older version to get iSCSI support.  In the case be aware that you may limit yourself on taking advantage of newer feature releases and be aware of when the older product versions support will sunset as this may shorten the lifecycle of the solution.
  3. You have faith the partner can work around future changes and can accept the slower cadence.  Note, unless that company is public there are few consequences for them making forward looking statements of support and failing to deliver on them. This is why VMware has to have an a ridiculous amount of legal bumpers on our VMworld presentations…
  4. You are willing to accept being stuck with older releases, and their limitations and known issues.  Partners who are in VAIO/VVOLs have advanced roadmap access (and in many cases help shape the roadmap).  Partners using non-supported solutions, and private API’s are often stuck with 6-9 months of reverse engineering to try to find out what changed between releases as there is no documentation available for how these API’s were changed (or how to work around their removal).
  5. You are willing to be the integrator of the solution. Opening multiple tickets and driving a resolution is something your company enjoys doing.  The idea of becoming your own converged infrastructure ISV doesn’t bother you. In this case I would check with signing up to become an OEM embedded partner if this is what you view as the value proposition that you bring to the table.
  6. You want to live dangerously. Your a traveling vagabond who has danger for a middle name. Datacenter outages, or 500ms of disk latency don’t scare you, and your users have no power to usurp your rule and cast you out.


Dispelling myths about VSAN and flash.

I’ve been having the same conversation with several customers lately that is concerning.

Myth #1 “VSAN must use flash devices from a small certified list”

Reality: The reality is that that there are over 600 different flash devices that have been certified (and this list is growing).

Myth #2 “The VSAN certified flash devices are expensive!”

Reality: ” Capacity tier flash devices can be found in the 50-60 cents per GB range from multiple manufacturers. Caching tier devices can be found for under $1 per GB.  These prices have fallen from $2.5 a GB when VSAN was released in 2014. I expect this downward price trend to continue.

Myth #3 “I could save money with another vendor who will support using cheaper consumer grade flash. They said it would be safe”.

Reality: Consumer grade drives lack capacitors to protect both upper and lower pages.  In order to protect lower cost NAND these drives use volatile DRAM buffers to hold and coalesce writes. Low end consumer grade drives will ignore flush after write commands coming from the operating system, and on power loss can simply loose the data in this buffer.  Other things that can happen is meta data corruption (loss of the lookup table resulting in large portions of the drive becoming unavailable) shorn writes (where writes do not align properly with their boundary and loose data as well as improperly return it on read) and non-serialized writes that could potentially file system or application level recovery journals.  Ohio State and HP Labs put together a great paper on all the things that can (and will) go wrong here. SSD’s have improved since this paper, and others have done similar tests of drives with and without proper power loss protection. The findings point to enterprise class drives with power loss protection being valuable.

Myth #4 “Those consumer grade drives are just as fast!”

Reality: IO latency consistency is less reliable on writes and garbage collection takes significantly more time as there is less spare capacity to manage it.  Flash is great when its fast, but when its not consistent applications can miss SLA’s. If using consumer grade flash in a VSAN home lab, make sure you disable the high latency drive detection. In our labs under heavy sustained load we’ve seen some fairly terrible performance out of consumer flash devices.

In conclusion, there are times and places for cheap low end consumer grade flash (like in my notebook or home lab) but for production use where persistent data matters it should be avoided.

Upcoming Live/Web events…

Spiceworks  Dec 1st @ 1PM Central- “Is blade architecture dead” a panel discussion on why HCI is replacing legacy blade designs, and talk about use cases for VMware VSAN.

Micron Dec 3rd @ 2PM Central – “Go All Flash or go home”   We will discuss what is new with all flash VSAN, what fast new things Micron’s performance lab is up to, and an amazing discussion/QA with Micron’s team. Specifically this should be a great discussion about why 10K and 15K RPM drives are no longer going to make sense going forward.

Intel Dec 16th @ 12PM Central – This is looking to be a great discussion around why Intel architecture (Network, Storage, Compute) is powerful for getting the most out of VMware Virtual SAN.

Veeam On (Part 2)

It has been a great week.  VeeamOn has been a great conference and its clear Rick and the team really wanted to have a different spin on the IT conference.  While there are some impressive grand gestures, speakers and sessions (I Seriously feel like I’m in a spaceship right now) its the little details that stand out too.

My favorite little things so far.

  • General Session does not start at 8AM.
  • Breakfast runs till 9AM not 7:45AM like certain other popular conferences. (Late night networking and no food going into a General Session is not a good idea!).
  • Day 2 keynotes are by the partners, giving me a mini-taste of VMware, HP, Cisco, Netapp’s and Microsoft’s “Vision” for the new world of IT.
  • A really interesting mix of attendees.  I’ll go from having a conversation with 10TB to someone else with 7PB’s and some of the challenges we are discussing will be the same.
  • LabWarz  is easy to wander in and tests you skills in a far more “fun” and meaningful way than trying to dive a certification test between sessions.
  • The vendor expo isn’t an endless maze of companies but are companies (and specifically only the products within them) that are relevant to Veeam users.

Time to check the log…

You can see from the year 5 rings that there was great budget, and much storage was added!

Any time you open a ticket with VMware (or any vendor) the first thing they generally want you to do is pull the logs and send them over.  They then use their great powers (of grep) to try to find the warning signs, or results of a known issue (or new one!).  This whole process can take quite some time, and frustratingly some issues roll out of logs quickly, are buried in 10^14 of noise, or can only be found with an environment that is down and has not been rebooted.  I recently had a conference call with a vendor where they instructed a customer that we would have to wait for one (or more!) complete crashes to their storage array before they would be able to get the logs to possibly find a solution.

This is where LogInsight comes to the rescue.  With real time indexing, graphs that do not require you learn ruby to make, and machine learning to auto group similar messages you can find out why your data center has crashed in 15 minutes instead of 15 days.

Recently while deploying a POC I had a customer who complained of intermittent performance issues on a VDI cluster they couldn’t quite pin down.  Internal teams were arguing (Storage blamed network, network blamed AD, Windows/AD blamed the VMware admin).  A quick search for “error*,crit*,warn*” across all infastruture on the farm (Firewall/Switch/Fabric/DiskArray/Blades/<infinate number of locations View hides logs> returned thousands of unrelated errors for internal certificates not being signed and other non-interesting events.   LogInsight’s auto grouping allowed for quick filtering of the noise to uncover the smoking gun. A Fibre Channel connection inside of a blade chassis was flapping (from a poorly seated HBA).  IT was not bad enough to trigger port warnings on the switches, or an all paths down error, but it was enough to impact user experience randomly.  This issue was a ghost that had been plaguing them for two weeks at this point.  LogInsight found it in under 15 minutes of searching.  It was great to have clear evidence so we could end internal arguing as well as hold the vendor accountable so they couldn’t deflect blame to VMware or another product.

I’d encourage everyone to download a free trial and post back in the comments what obscure errors or ghosts in the machine you end up finding.

LSI Firmware VSAN

I’ve been talking to LSI over the past couple months in relation to VSAN and have a couple updates on issues and thoughts.


1. LSI support does not support their driver if it is purchased through an OEM.  They will not accept calls from VMware regarding this driver in this case either.  If you want LSI to support the VMware driver stack, you must buy direct from them.

2. LSI branded MegaRAID cards do not support JBOD (I understand that it is on the roadmap).  Dell and others are offering alternative firmwares that allow this, but they have no comment or support statement on this.

3. MegaRAID CLI can be used with RAID 0 to manage cards (i’ll release a guide if there is interest) and performance is comparable and on supported systems is very stable.  Don’t rule it out, and with all the back and forth on support for JBOD it strangely might be the safer until I get full testing reports from the Perc730 next week.

4. The Dell Perc730 has JBOD support now.  Despite being a MegaRAID I’m hearing good things in the field so far (I’ll update if I hear otherwise).

5. LSI prefers dealing with hardware vendors, and largely being a back end chip-set manufacturer.  A stronger relationship with VMware is needed (especially with PCI-Express networking on the horizion).

6. HP is switching to Adaptec for controllers.  Hopefully this should bring their JBODs onto the VSAN HCL and allow for supplier diversity.

7. I’ve heard statements from Dell that VMware is intensifying the testing procedures for VSAN.  It looks like this will catch H310/2208 type issues first.

8. Ignore the SM2208 on the HCL for pass through.  Neither VMware nor LSI will support it.

How to buy IT? (Part 1).

On twitter John Troyer asked “Are there books/classes on “How to purchase enterprise tech” or is it mostly analysts & tribal knowledge? Not a regular supply chain thing.”

There were some quick responses with the general consensus this was something learned by trial and fire, and a universal dread if a procurement department had any power in the decision.

Lets outline some of the problems.

1. IT buying requires you actually understand what your purchasing. A storage array and fibre channel network isn’t just “Tubes, RAW Terra-bytes, and dollar signs” (Despite one IT Director telling me this as he wrested with if he should buy a Netapp or an HDS giving absolutely zero concern for performance, support agreements, or usable capacity.)

2. IT purchasing isn’t always fully in IT’s control. Procurement will bid out that server, and end up with a H310 RAID controller that isn’t on the VSAN HCL, or will drop iLO meaning setup will take 10x as long and fail to deliver. Procurement departments are often judged on money saved, and now how badly they screw up time to deliver. Then again, IT staff often will declare ridiculous things (A $500K in Cisco switching to power 3 Cisco UCS blades) or make terrible decisions. But should procurement really be the back stop on keeping them from drunken buying toys they don’t need?

3. They vendors/channel system means the only people with information to properly consult on a products ability to deliver are the same people most of the time selling it to you. I saw a consulting firm for a do a 3 vendor shootout/need assessment for backup for a customer. 2 out of 3 vendors refused to return the analysts phone calls. Do we really want to do business with opaque vendors though even if they are the 50 Billion pound gorilla of the industry?

4. The amount of time that is often invested in finding the right solution can massively dwarf the amount of time required to actually implement. I once worked a storage deal that took 2 years. Servers were crashing, data was lost, and deck chairs where being shuffled, but the IT director was in paralyses of a decision. If a customer tells me they are about to buy storage, I warn them that they are inviting a barking carnival of vendors and nothing productive will be done until the product hits their loading dock. Get ready for bribe offers, teams of a dozen people to show up without being scheduled, and an amount of FUD that feels like an invading horde of barbarians. Once they smell blood in the water, every VAR within 500 miles will be at your door. In the end, is a quick and decision, or slow and methodical solution any better than throwing a dart at a Magic Quadrant? (Incidentally this is my theory of how some of the storage ones are scored).

Going over these 4 quickly.

1. This one is hard. If you don’t have SME’s in house don’t just put yourself at the mercy of your vendors. DO NOT TRUST YOUR VARs. They will lean on whatever has more margin this quarter. Realize that your SME’s internally may have agenda’s (If the whole internally team is Cisco certified, and that is their value then don’t take their recommendation for a Nexus 7K at face value). Pay someone to review this who is not going to be selling you the box. Pick several vendors to have review, come up with a scoring system of needs, risks, and have a 3rd party arbitrate the scoring.

2. Procurement is the wrong department to prevent waste. Realize that saving 10% and having 1/4 of solutions ship to you incomplete isn’t “winning”. Start with making sure the solution will work and THEN look at cost control. Often cost control is what weak non-technical decision makers fall back on (They are afraid the solution will not work, and want to limit the damage). Push hard to understand (or make someone make you understand) the decision at hand. Don’t try to cost control a project you don’t think is going to work. Delay the decision until you understand and pick decisively and correctly. If you don’t trust what your subordinates or consultants are recommending (because its often not working) don’t slash their budgets, replace them with people who can deliver on what they ask for.

3. Make sure you network and benchmark with others in your industry. While from time to time its best to break from the herd if technology is a differentiation point, but limit this to where there is a really compelling value. Don’t pick a storage vendor with an experimental protocol when your a 5 billion dollar company with only 10TB of capacity needs for Tier 1. its not worth the savings. Inversely recognize when your challenges are unique. If you have 50 field offices and your competitors have 5 so its time to consider VDI despite no one else doing it.

4. Ask difficult questions, know your criteria, and know why the last solution worked or didn’t work if this is a migration.The vendors will try to tie you up in quicksand and keep themselves in play as long as possible. Strike, hard and fast. If they can’t respond quickly to your needs then they don’t understand how to qualify them.
“I’m sorry, we are only accepting bids from vendors with 4 Hour onsite non-contractor support”
Know who’s not a fit before they call so you don’t waste time with non-starters.
“We need storage that can provide 10TB, 10000, IOPS with a 95% data skew of 8%, at an average block size of 16KB with a 75/25% read/write mix and a compressibility of less than 10%.
Know your requirements for purchase down to the most granular bit if you don’t want to play 20 quotes.
“Our GPFS implementation will give you 20K IOPS with 4 SATA drives and no flash”.
“Average Dedupe is 500% so you’ll only need to buy 2TB of usable”
“FCoE will be [Insert anything useful]”.

Magic Server/Storage/Network pixie dust isn’t real. Watch out for ridiculous absolute statements.

Why VDI?

I was reading Justin Paul Justifying the Cost of Virtual Desktops: Take 2 and had some thoughts on where he see’s the cost model of VDI. I know Brian Madden has talked at great length of all the false cost models for VDI that exist (and I’ve seen it in the field) .

1. I Agree with Justin on power with some narrow changes. Unless its a massive deployment, another 4 hosts in the data center isn’t going to break the bank. Unless your forcing people to use thin clients, your also not saving anything real on the client side (and certain thing (Lync, MMR etc) require Windows Embedded clients at a minimum anyways. The only case where I’ve successfully made this was a call center that was 24/7 and handled disaster operations in Houston. After IKE everyone learned how hard it is find fuel, anything that reduces the generator and battery backup budget actually has real implications.

2. Justin does make good points about SA and keeping up with the Windows OS releases on physical machines is just as expensive as VDA. Sadly this is only true if companies are not just standardizing on Windows 7 and running it into the ground for the next 5-7 years. Hey it worked for XP right?

3. While I agree a ticket system helps track time spent restoring machines etc, no one makes non-billable IT resources track time to the level of detail and meta tags/search to make building an in house ROI model possible. The best luck I’ve had is having people do a week survey with 15 minute intervals broken down is as close as you’ll get in house IT to do. Its painful to get even that done. Unless your desktop support is outsourced (And you have access to their reports!) This is going to always be sadly a fuzzy poorly tracked cost. I’d argue VMware Mirage (or equally good application streaming/imaging system) can provide a lot of the opex benefits without the consolidation and other pro/cons of VDI. VDI extends beyond imaging and breakfix. Its about mobility, security, and

4. People work from home today with VPN, and Shadow IT (LogMeIn etc). The ability to do this isn’t what you sell, its the execution and polish (Give a sales person a well maintained, PCoIP desktop and they will grab their iPad and never come back to the office). Its the little things (like Thin Print letting them print to their home PC). Ultimately it isn’t the “occasional” or snow day remote users that sell VDI. its the road warriors and branch offices (who are practically the same thing with as little attention as they get from central IT typically).