Skip to content

VSAN is now up to 30% cheaper!

Ok, I’ll admit this is an incredibly misleading click bait title. I wanted to demonstrate how the economics of cheaper flash make VMware Virtual SAN (and really any SDS product that is not licensed by capacity) cheaper over time. I also wanted to share a story of how older slower flash became more expensive.

Lets talk about a tale of two cities who had storage problems and faced radically different cost economics. One was a large city with lots of purchasing power and size, and the other was a small little bedroom community. Who do you think got the better deal on flash?

Just a small town data center….

A 100 user pilot VDI project was kicking off. They knew they wanted great storage performance, but they could not invest in a big storage array with a lot of flash up front. They did not want to have to pay more tomorrow for flash, and wanted great management and integration. VSAN and Horizon View were quickly chosen. They used the per concurrent user licensing for VSAN so their costs would cleanly and predictably scale. Modern fast enterprise  flash was chosen that cost ~$2.50 per GB and had great performance. This summer they went to expand the wildly successful project, and discovered that the new version of the drives they had purchased last year now cost $1.40 per GB, and that other new drives on the HCL from their same vendor were available for ~$1 per GB. Looking at other vendors they found even lower cost options available.  They upgraded to the latest version of VSAN and found improved snapshot performance, write performance and management. Procurement could be done cost effectively at small scale, and small projects could be added without much risk. They could even adopt the newest generation (NVMe) without having to forklift controllers or pay anyone but the hardware vendor.

Meanwhile in the big city…..

The second city was quite a bit larger. After a year long procurement process and dozens of meetings they chose a traditional storage array/blade system from a Tier 1 vendor. They spent millions and bought years worth of capacity to leverage the deepest purchasing discounts they could. A year after deployment, they experienced performance issues and wanted to add flash. Upon discussing with the vendor the only option was older, slower, small SLC drives. They had bought their array at the end of sale window and were stuck with 2 generations old technology. It was also discovered the array would only support a very small amount of them (the controllers and code were not designed to handle flash). The vendor politely explained that since this was not a part of the original purchase the 75% discount off list that had been on the original purchase would not apply and they would need to pay $30 per GB. Somehow older, slower flash had become 4x more expensive in the span of a year.  They were told they should have “locked in savings” and bought the flash up front. In reality though, they would  locking in a high price for a commodity that they did not yet need. The final problem they faced was an order to move out of the data center into 2-3 smaller facilities and split up the hardware accordingly.  That big storage array could not easily be cut into parts.

There are a few lessons to take away from these environments.

  1. Storage should become cheaper to purchase as time goes on. Discounts should be consistent and pricing should not feel like a game show. Software licensing should not be directly tied to capacity or physical and should “live” through a refresh.
  2. Adding new generations of flash and compute should not require disruption and “throwing away” your existing investment.
  3. Storage products that scale down and up without compromise lead to fewer meetings, lower costs, and better outcomes. Large purchases often leads to the trap of spending a lot of time and money on avoiding failure, rather than focusing on delivering excellence.

Veeam On (Part 2)

It has been a great week.  VeeamOn has been a great conference and its clear Rick and the team really wanted to have a different spin on the IT conference.  While there are some impressive grand gestures, speakers and sessions (I Seriously feel like I’m in a spaceship right now) its the little details that stand out too.

My favorite little things so far.

  • General Session does not start at 8AM.
  • Breakfast runs till 9AM not 7:45AM like certain other popular conferences. (Late night networking and no food going into a General Session is not a good idea!).
  • Day 2 keynotes are by the partners, giving me a mini-taste of VMware, HP, Cisco, Netapp’s and Microsoft’s “Vision” for the new world of IT.
  • A really interesting mix of attendees.  I’ll go from having a conversation with 10TB to someone else with 7PB’s and some of the challenges we are discussing will be the same.
  • LabWarz  is easy to wander in and tests you skills in a far more “fun” and meaningful way than trying to dive a certification test between sessions.
  • The vendor expo isn’t an endless maze of companies but are companies (and specifically only the products within them) that are relevant to Veeam users.

Veeam On! (Part 1)

IMG_5267

I’m in Vegas rounding out the conference tour (VMworld,SpiceWorld,VMworld,DellWorld) for what looks to be a strong finish. This is my first time at VeeamOn and I’m looking forward to briefings across the full Veeam portfolio. I’m looking forward to being shamed by the experts in Lab Warz and getting my hands dirty with the v9.

More importantly I”m looking forward to some great conversations. The reason why I value going to conferences goes beyond great sessions and discussions with vendors at the solutions expo. The conversations with end users (small, large and giant) help you learn where the limits are (and how to push past them) in the tools you rely on. I’ve had short conversations over breakfast that saved me six months of expensive trial and error that others had been through. A good conference will attract both small and massive scale customers and bring together great conversations that will help everyone change their perspective and get things done.

All good things…

I started my IT career as a customer.  It was great having complete ownership of the environment but eventually I wanted more.  I moved to the partner side and the past five years have been amazing. I have worked with more environments than I can count.  It exposed me to diverse technical and operational challenges. It gave me the opportunity to see first hand past the marketing what worked and what did not work. I would like to thank everyone (customers, co-workers) and all of the people who I was able to directly work with who helped me reach this point in my career. I also want to thank people who freely share to the greater community. Their blogs, their words of caution, their advice, their presentations at conferences all contributed in helping me succeed. I will miss the amazing team at Synchronet but it was time for change.

Starting today, I will be in a new role at VMware in Technical Marketing for VMware VSAN. I am excited for this change, and look forward to the challenges ahead. In this position I hope to learn and give back to the greater community that has helped me reach this point. I will still blog various musings here, but look for VSAN and storage content at Virtual Blocks.

I look forward to the road ahead!

 

Time to check the log…

You can see from the year 5 rings that there was great budget, and much storage was added!

Any time you open a ticket with VMware (or any vendor) the first thing they generally want you to do is pull the logs and send them over.  They then use their great powers (of grep) to try to find the warning signs, or results of a known issue (or new one!).  This whole process can take quite some time, and frustratingly some issues roll out of logs quickly, are buried in 10^14 of noise, or can only be found with an environment that is down and has not been rebooted.  I recently had a conference call with a vendor where they instructed a customer that we would have to wait for one (or more!) complete crashes to their storage array before they would be able to get the logs to possibly find a solution.

This is where LogInsight comes to the rescue.  With real time indexing, graphs that do not require you learn ruby to make, and machine learning to auto group similar messages you can find out why your data center has crashed in 15 minutes instead of 15 days.

Recently while deploying a POC I had a customer who complained of intermittent performance issues on a VDI cluster they couldn’t quite pin down.  Internal teams were arguing (Storage blamed network, network blamed AD, Windows/AD blamed the VMware admin).  A quick search for “error*,crit*,warn*” across all infastruture on the farm (Firewall/Switch/Fabric/DiskArray/Blades/<infinate number of locations View hides logs> returned thousands of unrelated errors for internal certificates not being signed and other non-interesting events.   LogInsight’s auto grouping allowed for quick filtering of the noise to uncover the smoking gun. A Fibre Channel connection inside of a blade chassis was flapping (from a poorly seated HBA).  IT was not bad enough to trigger port warnings on the switches, or an all paths down error, but it was enough to impact user experience randomly.  This issue was a ghost that had been plaguing them for two weeks at this point.  LogInsight found it in under 15 minutes of searching.  It was great to have clear evidence so we could end internal arguing as well as hold the vendor accountable so they couldn’t deflect blame to VMware or another product.

I’d encourage everyone to download a free trial and post back in the comments what obscure errors or ghosts in the machine you end up finding.

HDS G400/600 “It is required to install additional shared memory”

I have some DIMMS laying around here somewhere...

I have some DIMMS laying around here somewhere…

Quick post here! If your setting up a new Hitachi H800 (G400/600) and are trying to setup a Hitachi Dynamic Tiering pool you may get the following error. “To use a pool with the Dynamic Tiering function enabled, it is required to install additional shared memory.”

You will need to login to the maintenance utility (This is what runs on the array directly). Here is the procedure.

The first step is figuring how much memory you need to reconfigure. This will be based on how much capacity is being dedicated to Dynamic Provisioning Pools.  As the documents reference Pb (little b which is a bit odd) these numbers are smaller than they first appear.

  • No Extension DP – .2Pb with 5GB of Memory overhead
  • No Extension HDT – .5Pb with 5GB of Memory overhead
  • Extension 1 – 2Pb  with 9GB of Memory overhead
  • Extension 2 – 6.5Pb with 13GB of Memory overhead

There are also  extensions 3 and 4 (which use 17GB and 25GB respectively) however I believe they are largely needed for larger Shadow Image, Volume Migrations, Thin Image, and TrueCopy configurations.
In the Maintenance Utility window, click Hardware > Controller Chassis. In the Controller Chassis window, click the CTLs tab. Click Install list, and then click Shared Memory. In the Install Shared Memory window pick which extensions you need and select install (and grab a cup of coffee because this takes a while).  This can be done non-disruptively, but it would be best to do at lower IO as your robbing cache from the array for the thin provisioning lookup table.

You can find all this information on page 171 of the following guide.

Screen Shot 2015-07-27 at 9.13.09 AM

 

LSI Firmware VSAN

I’ve been talking to LSI over the past couple months in relation to VSAN and have a couple updates on issues and thoughts.

 

1. LSI support does not support their driver if it is purchased through an OEM.  They will not accept calls from VMware regarding this driver in this case either.  If you want LSI to support the VMware driver stack, you must buy direct from them.

2. LSI branded MegaRAID cards do not support JBOD (I understand that it is on the roadmap).  Dell and others are offering alternative firmwares that allow this, but they have no comment or support statement on this.

3. MegaRAID CLI can be used with RAID 0 to manage cards (i’ll release a guide if there is interest) and performance is comparable and on supported systems is very stable.  Don’t rule it out, and with all the back and forth on support for JBOD it strangely might be the safer until I get full testing reports from the Perc730 next week.

4. The Dell Perc730 has JBOD support now.  Despite being a MegaRAID I’m hearing good things in the field so far (I’ll update if I hear otherwise).

5. LSI prefers dealing with hardware vendors, and largely being a back end chip-set manufacturer.  A stronger relationship with VMware is needed (especially with PCI-Express networking on the horizion).

6. HP is switching to Adaptec for controllers.  Hopefully this should bring their JBODs onto the VSAN HCL and allow for supplier diversity.

7. I’ve heard statements from Dell that VMware is intensifying the testing procedures for VSAN.  It looks like this will catch H310/2208 type issues first.

8. Ignore the SM2208 on the HCL for pass through.  Neither VMware nor LSI will support it.

The use cases for a Synology

I often run into a wide mix of high and low end gear that people use to solve challenges. Previously I wrote on why you shouldn’t use a Synology or cheap NAS device as your primary storage system for critical workloads, but I think its time to clarify where people SHOULD consider using a Synology in a datacenter enviroment.

A lot has been written about why you shouldn’t apply the same performance SLA to all workloads
, but I’d argue the bigger discussion in maintaining SLA’s without breaking the budget is treating up-time SLA’s the same way. Not all workloads need HA, and not all workloads need 4 hour support agreements. There is a lot of redundant data, and ethereal data in the data center and having a device that can cheaply store that data is key to not having to make compromises to those business critical workloads that do need it. I see a lot of companies evaluating start-ups, scaling performance and flash usage back, under-staffing IT ops staff, cutting out monitoring and management tools, and other cost saving but SLA crushing actions actions in order to free up the budget for that next big high up-time storage array. Its time for small medium enterprises to quit being fair to storage availability. It is time to consider that “good enough” storage might be worth the added management and overhead. While some of this can be better handled by data reduction technologies, and storage management policies and software some times you just need something cheap. While I do cringe when I see RAID 5 Drobo’s running production databases, there are use cases and here’s a few I’ve found for the Synology in our datacenter over the years.

But my testing database needs 99.999999% uptime!

But my testing database needs 99.999999% uptime!

1. Backups, and data export/import – In a world where you often end up with 5 copies of your data (Remote Replica’s for DR, Application team silo has their own backup and archive solution) using something thats cheap for bulk image level backups isn’t a bad idea. The USB and ESATA ports make them a GREAT place for transferring data by mail (Export or import of a Veeam seed) or for ingesting data (Used ours on thanksgiving to import a customer’s VM that was fleeing the abrupt shutting down of a hosting provider in town). While its true you can pass USB through to a VM, I’ve always found it overly complicated, and generally slower than just importing straight to a datastore like the Synology can do.

2. Swing Migrations – For those of us Using VMware VSAN, having a storage system that can cross clusters is handy in a pinch, and keeps downtime and the need to use Extended vMotion to a minimum.  A quick and dirty shared NFS export means you can get a VM from vCenter A to vCenter B with little fuss.

 

Screen Shot 2014-12-28 at 4.51.55 PM3. Performance testing – A lot of times you have an application that runs poorly, and before you buy 40-100K worth of Tier1 flash you want to know if it will actually run faster, or just chase its tail. A quick and dirty datastore on some low price Intel flash (S3500’s or S3700 drives are under $2.50 a GB) can give you a quick rocket boost to see if that application can soar! (Or if that penguin will just end up CPU bound). A use case I’ve done is put a VDI POC on the Synology to find out what the IOPS mix will look like with 20-100 users before you scale to production use for hundreds or thousands of users. Learning that you need to size heavy because of that terrible access database application before you under invest in storage is handy.

4. A separate failure domain for network and management services – For those of us who live in 100% VMware environments, having something that can provide a quick NTP/DHCP/Syslog/SMTP/SMS/SSH service. In the event of datacenter apocalypse (IE an entire VMware cluster goes offline) this plucky little device will be delivering SMTP and SMS alerts, providing me services I would need to rebuild things, give me a place to review the last screams (syslog). While not a replacement for better places to run some of these services (I generally run DHCP off the ASA’s and NTP off of the edge routers) in a small shop or lab, this can provide some basic redundancy for some of these services if the normal network devices are themselves not redundant.

4. Staging – A lot of times we will have a project that needs to go live in a very short amount of time, and we often have access to the software before the storage or other hardware will show up. A non-active workload rarely needs a lot of CPU/Memory and can leech off of a no reservation resource pool, so storage is often the bottleneck. Rather than put the project on hold, having some bulk storage on a cheap NAS lets you build out the servers, then migrate the VM’s once the real hardware has arrived, collapsing project time lines by a few days or week so your not stuck waiting on procurement, or the SAN vendor to do an install. For far less than cost to do a “rush Install” of a Tier 1 array I can get a Synology full of drives onsite, setup before that big piece of disk iron comes online.

5. Tier 3 Workloads – Sometimes you have a workload that you could just recover, or if it was down for a week you wouldn’t violate a Business SLA. Testing, Log dumps, replicated archival data, and random warehouses that it would take more effort to sort through than horde are another use case. Also the discussion of why you are moving it the Synology opens up a talking point with the owner of why they need the data in the first place (and allows for bargaining, such as “if you can get this 10TB of syslog down to 500GB I’ll put it back on the array”). Realistically technology like VSAN and array auto-tier has driven down the argument for using these devices in this way, but having something that borders on being a desktop recycling bin.

How to buy IT? (Part 1).

On twitter John Troyer asked “Are there books/classes on “How to purchase enterprise tech” or is it mostly analysts & tribal knowledge? Not a regular supply chain thing.”

There were some quick responses with the general consensus this was something learned by trial and fire, and a universal dread if a procurement department had any power in the decision.

Lets outline some of the problems.

1. IT buying requires you actually understand what your purchasing. A storage array and fibre channel network isn’t just “Tubes, RAW Terra-bytes, and dollar signs” (Despite one IT Director telling me this as he wrested with if he should buy a Netapp or an HDS giving absolutely zero concern for performance, support agreements, or usable capacity.)

2. IT purchasing isn’t always fully in IT’s control. Procurement will bid out that server, and end up with a H310 RAID controller that isn’t on the VSAN HCL, or will drop iLO meaning setup will take 10x as long and fail to deliver. Procurement departments are often judged on money saved, and now how badly they screw up time to deliver. Then again, IT staff often will declare ridiculous things (A $500K in Cisco switching to power 3 Cisco UCS blades) or make terrible decisions. But should procurement really be the back stop on keeping them from drunken buying toys they don’t need?

3. They vendors/channel system means the only people with information to properly consult on a products ability to deliver are the same people most of the time selling it to you. I saw a consulting firm for a do a 3 vendor shootout/need assessment for backup for a customer. 2 out of 3 vendors refused to return the analysts phone calls. Do we really want to do business with opaque vendors though even if they are the 50 Billion pound gorilla of the industry?

4. The amount of time that is often invested in finding the right solution can massively dwarf the amount of time required to actually implement. I once worked a storage deal that took 2 years. Servers were crashing, data was lost, and deck chairs where being shuffled, but the IT director was in paralyses of a decision. If a customer tells me they are about to buy storage, I warn them that they are inviting a barking carnival of vendors and nothing productive will be done until the product hits their loading dock. Get ready for bribe offers, teams of a dozen people to show up without being scheduled, and an amount of FUD that feels like an invading horde of barbarians. Once they smell blood in the water, every VAR within 500 miles will be at your door. In the end, is a quick and decision, or slow and methodical solution any better than throwing a dart at a Magic Quadrant? (Incidentally this is my theory of how some of the storage ones are scored).

Going over these 4 quickly.

1. This one is hard. If you don’t have SME’s in house don’t just put yourself at the mercy of your vendors. DO NOT TRUST YOUR VARs. They will lean on whatever has more margin this quarter. Realize that your SME’s internally may have agenda’s (If the whole internally team is Cisco certified, and that is their value then don’t take their recommendation for a Nexus 7K at face value). Pay someone to review this who is not going to be selling you the box. Pick several vendors to have review, come up with a scoring system of needs, risks, and have a 3rd party arbitrate the scoring.

2. Procurement is the wrong department to prevent waste. Realize that saving 10% and having 1/4 of solutions ship to you incomplete isn’t “winning”. Start with making sure the solution will work and THEN look at cost control. Often cost control is what weak non-technical decision makers fall back on (They are afraid the solution will not work, and want to limit the damage). Push hard to understand (or make someone make you understand) the decision at hand. Don’t try to cost control a project you don’t think is going to work. Delay the decision until you understand and pick decisively and correctly. If you don’t trust what your subordinates or consultants are recommending (because its often not working) don’t slash their budgets, replace them with people who can deliver on what they ask for.

3. Make sure you network and benchmark with others in your industry. While from time to time its best to break from the herd if technology is a differentiation point, but limit this to where there is a really compelling value. Don’t pick a storage vendor with an experimental protocol when your a 5 billion dollar company with only 10TB of capacity needs for Tier 1. its not worth the savings. Inversely recognize when your challenges are unique. If you have 50 field offices and your competitors have 5 so its time to consider VDI despite no one else doing it.

4. Ask difficult questions, know your criteria, and know why the last solution worked or didn’t work if this is a migration.The vendors will try to tie you up in quicksand and keep themselves in play as long as possible. Strike, hard and fast. If they can’t respond quickly to your needs then they don’t understand how to qualify them.
“I’m sorry, we are only accepting bids from vendors with 4 Hour onsite non-contractor support”
Know who’s not a fit before they call so you don’t waste time with non-starters.
“We need storage that can provide 10TB, 10000, IOPS with a 95% data skew of 8%, at an average block size of 16KB with a 75/25% read/write mix and a compressibility of less than 10%.
Know your requirements for purchase down to the most granular bit if you don’t want to play 20 quotes.
“Our GPFS implementation will give you 20K IOPS with 4 SATA drives and no flash”.
“Average Dedupe is 500% so you’ll only need to buy 2TB of usable”
“FCoE will be [Insert anything useful]”.

Magic Server/Storage/Network pixie dust isn’t real. Watch out for ridiculous absolute statements.

Is VDI really not “serious” production?

This post is in response to a tweet by Chris Evans (Who I have MUCH respect for and is one of the people that I follow on a daily basis on all forms of internet media). The discussion on twitter was unrelated (Discussing the failings of XtremeIO) and the point that triggered this post was when he stated VDI is “Not serious production”.

While I might have agreed 2-3 years ago when VDI was often in POC, or a plaything of remote road warriors or a CEO, VDI has come a long way in adoption. I’m working at a company this week with 500 users and ALL users outside of a handful of IT staff work in VDI at all times. I”m helping them update their service desk operations and a minor issue with VDI (profile server problems) is a critical full stoppage of the business. Even if all of their 3 critical LOB apps going down would be less of an impact. At least people could still access email, jabber and some local files.

There are two perspectives I have from this.

1. Some people are actually dependent on VDI to access all those 99.99999% uptime SLA apps so its part of the dependency tree.

2. We need to quit using 99.9% SLA up-time systems and process’s to keep VDI up. It needs real systems, change control, monitoring and budget. 2 years ago I viewed vCOPS for View as an expensive necessity, now I view it as a must have solution. I’m deploying tools like LogInsight to get better information and telemetry of whats going on, and training service desks on the fundamentals of VDI management (that used to be the task of a handful of sysadmins). While it may not replace the traditional PC and in many ways is a middle ground towards some SaaS web/mobile app future, its a lot more serious today than a lot of people realize.

I’ve often joked that VDI is the technology of last resort when no other reasonable offering made sense (Keep data in datacenter, solve apps that don’t work under RDS, organizations who can’t figure out patch/app distribution, highly mobile but poorly secured workforce). For better or for worse its become the best tool for a lot of shops, and its time to give it the respect it deserves.

At least the tools we use to make VDI serious today (VSAN/VCOPS/LogInsight/HorzionView6) are a lot more serious than the stuff I was using 4 years ago.

My apologies, for calling our Chris (which wasn’t really the point of this article) but I will thank him for giving me cause to reflect on the state of VDI “seriousness” today.

How does your organization view and depend on VDI today, and is there a gap in perception?