Skip to content

Posts from the ‘Storage’ Category

VSAN is now up to 30% cheaper!

Ok, I’ll admit this is an incredibly misleading click bait title. I wanted to demonstrate how the economics of cheaper flash make VMware Virtual SAN (and really any SDS product that is not licensed by capacity) cheaper over time. I also wanted to share a story of how older slower flash became more expensive.

Lets talk about a tale of two cities who had storage problems and faced radically different cost economics. One was a large city with lots of purchasing power and size, and the other was a small little bedroom community. Who do you think got the better deal on flash?

Just a small town data center….

A 100 user pilot VDI project was kicking off. They knew they wanted great storage performance, but they could not invest in a big storage array with a lot of flash up front. They did not want to have to pay more tomorrow for flash, and wanted great management and integration. VSAN and Horizon View were quickly chosen. They used the per concurrent user licensing for VSAN so their costs would cleanly and predictably scale. Modern fast enterprise  flash was chosen that cost ~$2.50 per GB and had great performance. This summer they went to expand the wildly successful project, and discovered that the new version of the drives they had purchased last year now cost $1.40 per GB, and that other new drives on the HCL from their same vendor were available for ~$1 per GB. Looking at other vendors they found even lower cost options available.  They upgraded to the latest version of VSAN and found improved snapshot performance, write performance and management. Procurement could be done cost effectively at small scale, and small projects could be added without much risk. They could even adopt the newest generation (NVMe) without having to forklift controllers or pay anyone but the hardware vendor.

Meanwhile in the big city…..

The second city was quite a bit larger. After a year long procurement process and dozens of meetings they chose a traditional storage array/blade system from a Tier 1 vendor. They spent millions and bought years worth of capacity to leverage the deepest purchasing discounts they could. A year after deployment, they experienced performance issues and wanted to add flash. Upon discussing with the vendor the only option was older, slower, small SLC drives. They had bought their array at the end of sale window and were stuck with 2 generations old technology. It was also discovered the array would only support a very small amount of them (the controllers and code were not designed to handle flash). The vendor politely explained that since this was not a part of the original purchase the 75% discount off list that had been on the original purchase would not apply and they would need to pay $30 per GB. Somehow older, slower flash had become 4x more expensive in the span of a year.  They were told they should have “locked in savings” and bought the flash up front. In reality though, they would  locking in a high price for a commodity that they did not yet need. The final problem they faced was an order to move out of the data center into 2-3 smaller facilities and split up the hardware accordingly.  That big storage array could not easily be cut into parts.

There are a few lessons to take away from these environments.

  1. Storage should become cheaper to purchase as time goes on. Discounts should be consistent and pricing should not feel like a game show. Software licensing should not be directly tied to capacity or physical and should “live” through a refresh.
  2. Adding new generations of flash and compute should not require disruption and “throwing away” your existing investment.
  3. Storage products that scale down and up without compromise lead to fewer meetings, lower costs, and better outcomes. Large purchases often leads to the trap of spending a lot of time and money on avoiding failure, rather than focusing on delivering excellence.

HDS G400/600 “It is required to install additional shared memory”

I have some DIMMS laying around here somewhere...

I have some DIMMS laying around here somewhere…

Quick post here! If your setting up a new Hitachi H800 (G400/600) and are trying to setup a Hitachi Dynamic Tiering pool you may get the following error. “To use a pool with the Dynamic Tiering function enabled, it is required to install additional shared memory.”

You will need to login to the maintenance utility (This is what runs on the array directly). Here is the procedure.

The first step is figuring how much memory you need to reconfigure. This will be based on how much capacity is being dedicated to Dynamic Provisioning Pools.  As the documents reference Pb (little b which is a bit odd) these numbers are smaller than they first appear.

  • No Extension DP – .2Pb with 5GB of Memory overhead
  • No Extension HDT – .5Pb with 5GB of Memory overhead
  • Extension 1 – 2Pb  with 9GB of Memory overhead
  • Extension 2 – 6.5Pb with 13GB of Memory overhead

There are also  extensions 3 and 4 (which use 17GB and 25GB respectively) however I believe they are largely needed for larger Shadow Image, Volume Migrations, Thin Image, and TrueCopy configurations.
In the Maintenance Utility window, click Hardware > Controller Chassis. In the Controller Chassis window, click the CTLs tab. Click Install list, and then click Shared Memory. In the Install Shared Memory window pick which extensions you need and select install (and grab a cup of coffee because this takes a while).  This can be done non-disruptively, but it would be best to do at lower IO as your robbing cache from the array for the thin provisioning lookup table.

You can find all this information on page 171 of the following guide.

Screen Shot 2015-07-27 at 9.13.09 AM

 

Is your SAN ready to format itself with Windows 7?

Quick warning. While rebooting an array for an update I noticed that it was configured to roll through 4 NIC’s worth of PXE on the boot order. If you have a PXE imaging server on the same subnet as one of the NIC’s this could cause the boot to hang, or worse if really misconfiguration re-image your array and destroy your data, or access to your data.

If your system is trying to PXE on reboot, please contact your vendor and discuss disabling it. While I noticed this on a Cisco Invicta, this isn’t really a Cisco problem and any modern software storage system this is possible. There is little more terrifying than watching a 300K storage device launch a windows PXE installer.

The long awaited HDS “Year Z” post.

HDS has been on a long journey that has led us to “Year Z”.  Almost four years ago the road map leaked to The Register that HDS would eventually merge their Modular (Then AMS, now HUS) with their Enterprise (VSP, now VSP G1000).  The promise of a single block operating system, with a unified file, object, branch NAS and block management suite was a long way off.

hitachi_nextgen_storage_small

Previously customers had to choose between platforms based on capacity, features, cost, up-time SLA’s, and performance. Often times a single feature requirement like storage virtualization would add a six figure amount to a bill of materials. Dependencies on High end ASIC’s made scaling down the costs of the VSP impossible.  Today Hitachi has solved these problems and delivered a single platform that allows product selection to largely be done entirely based on capacity and performance needs. Features to seamlessly flow from the smallest to the largest platform on the line card.  The G200, G400, G600, G800, G1000 provides a lot of sizes and price points without the confusion that multiple operating systems and system architectures provide.  As other vendors add more platforms and OS variants to address different markets, its interesting seeing HDS consolidate product families.  I’m curious if Netapp has some dusty old “One Platform” marketing slides that HDS can borrow.

My hats off to the engineers who managed to get full ASIC emulation running on the Intel Processors so that we can have VSP functionality without a six figure price tag. While I love the ugly duckling that is SNM2, it is good to see Hitachi moving on to faster, fancier management tools.

Screen Shot 2015-05-08 at 10.40.18 AM  Infrastructure Director and the new management tools looks to match the “pretty” GUI’s that modern storage managers are coming to expect, as well as powerful automation and provisioning workflows to make provisioning and management a largely automated task.

I’ve loved the HUS for providing “simple” but reliable storage.  I’ve often called it the pet rock of storage (configure, present everything to VMware, and stay in VMware for your management all day every day).  VVOL support allows for snapshot offloading (Faster Veeam backups!) and more granular feature management.  Most importantly it eliminates Vmware/Storage team miscommunication from causing VM’s to not get replicated, protected etc.

The use cases for a Synology

I often run into a wide mix of high and low end gear that people use to solve challenges. Previously I wrote on why you shouldn’t use a Synology or cheap NAS device as your primary storage system for critical workloads, but I think its time to clarify where people SHOULD consider using a Synology in a datacenter enviroment.

A lot has been written about why you shouldn’t apply the same performance SLA to all workloads
, but I’d argue the bigger discussion in maintaining SLA’s without breaking the budget is treating up-time SLA’s the same way. Not all workloads need HA, and not all workloads need 4 hour support agreements. There is a lot of redundant data, and ethereal data in the data center and having a device that can cheaply store that data is key to not having to make compromises to those business critical workloads that do need it. I see a lot of companies evaluating start-ups, scaling performance and flash usage back, under-staffing IT ops staff, cutting out monitoring and management tools, and other cost saving but SLA crushing actions actions in order to free up the budget for that next big high up-time storage array. Its time for small medium enterprises to quit being fair to storage availability. It is time to consider that “good enough” storage might be worth the added management and overhead. While some of this can be better handled by data reduction technologies, and storage management policies and software some times you just need something cheap. While I do cringe when I see RAID 5 Drobo’s running production databases, there are use cases and here’s a few I’ve found for the Synology in our datacenter over the years.

But my testing database needs 99.999999% uptime!

But my testing database needs 99.999999% uptime!

1. Backups, and data export/import – In a world where you often end up with 5 copies of your data (Remote Replica’s for DR, Application team silo has their own backup and archive solution) using something thats cheap for bulk image level backups isn’t a bad idea. The USB and ESATA ports make them a GREAT place for transferring data by mail (Export or import of a Veeam seed) or for ingesting data (Used ours on thanksgiving to import a customer’s VM that was fleeing the abrupt shutting down of a hosting provider in town). While its true you can pass USB through to a VM, I’ve always found it overly complicated, and generally slower than just importing straight to a datastore like the Synology can do.

2. Swing Migrations – For those of us Using VMware VSAN, having a storage system that can cross clusters is handy in a pinch, and keeps downtime and the need to use Extended vMotion to a minimum.  A quick and dirty shared NFS export means you can get a VM from vCenter A to vCenter B with little fuss.

 

Screen Shot 2014-12-28 at 4.51.55 PM3. Performance testing – A lot of times you have an application that runs poorly, and before you buy 40-100K worth of Tier1 flash you want to know if it will actually run faster, or just chase its tail. A quick and dirty datastore on some low price Intel flash (S3500’s or S3700 drives are under $2.50 a GB) can give you a quick rocket boost to see if that application can soar! (Or if that penguin will just end up CPU bound). A use case I’ve done is put a VDI POC on the Synology to find out what the IOPS mix will look like with 20-100 users before you scale to production use for hundreds or thousands of users. Learning that you need to size heavy because of that terrible access database application before you under invest in storage is handy.

4. A separate failure domain for network and management services – For those of us who live in 100% VMware environments, having something that can provide a quick NTP/DHCP/Syslog/SMTP/SMS/SSH service. In the event of datacenter apocalypse (IE an entire VMware cluster goes offline) this plucky little device will be delivering SMTP and SMS alerts, providing me services I would need to rebuild things, give me a place to review the last screams (syslog). While not a replacement for better places to run some of these services (I generally run DHCP off the ASA’s and NTP off of the edge routers) in a small shop or lab, this can provide some basic redundancy for some of these services if the normal network devices are themselves not redundant.

4. Staging – A lot of times we will have a project that needs to go live in a very short amount of time, and we often have access to the software before the storage or other hardware will show up. A non-active workload rarely needs a lot of CPU/Memory and can leech off of a no reservation resource pool, so storage is often the bottleneck. Rather than put the project on hold, having some bulk storage on a cheap NAS lets you build out the servers, then migrate the VM’s once the real hardware has arrived, collapsing project time lines by a few days or week so your not stuck waiting on procurement, or the SAN vendor to do an install. For far less than cost to do a “rush Install” of a Tier 1 array I can get a Synology full of drives onsite, setup before that big piece of disk iron comes online.

5. Tier 3 Workloads – Sometimes you have a workload that you could just recover, or if it was down for a week you wouldn’t violate a Business SLA. Testing, Log dumps, replicated archival data, and random warehouses that it would take more effort to sort through than horde are another use case. Also the discussion of why you are moving it the Synology opens up a talking point with the owner of why they need the data in the first place (and allows for bargaining, such as “if you can get this 10TB of syslog down to 500GB I’ll put it back on the array”). Realistically technology like VSAN and array auto-tier has driven down the argument for using these devices in this way, but having something that borders on being a desktop recycling bin.

Is VDI really not “serious” production?

This post is in response to a tweet by Chris Evans (Who I have MUCH respect for and is one of the people that I follow on a daily basis on all forms of internet media). The discussion on twitter was unrelated (Discussing the failings of XtremeIO) and the point that triggered this post was when he stated VDI is “Not serious production”.

While I might have agreed 2-3 years ago when VDI was often in POC, or a plaything of remote road warriors or a CEO, VDI has come a long way in adoption. I’m working at a company this week with 500 users and ALL users outside of a handful of IT staff work in VDI at all times. I”m helping them update their service desk operations and a minor issue with VDI (profile server problems) is a critical full stoppage of the business. Even if all of their 3 critical LOB apps going down would be less of an impact. At least people could still access email, jabber and some local files.

There are two perspectives I have from this.

1. Some people are actually dependent on VDI to access all those 99.99999% uptime SLA apps so its part of the dependency tree.

2. We need to quit using 99.9% SLA up-time systems and process’s to keep VDI up. It needs real systems, change control, monitoring and budget. 2 years ago I viewed vCOPS for View as an expensive necessity, now I view it as a must have solution. I’m deploying tools like LogInsight to get better information and telemetry of whats going on, and training service desks on the fundamentals of VDI management (that used to be the task of a handful of sysadmins). While it may not replace the traditional PC and in many ways is a middle ground towards some SaaS web/mobile app future, its a lot more serious today than a lot of people realize.

I’ve often joked that VDI is the technology of last resort when no other reasonable offering made sense (Keep data in datacenter, solve apps that don’t work under RDS, organizations who can’t figure out patch/app distribution, highly mobile but poorly secured workforce). For better or for worse its become the best tool for a lot of shops, and its time to give it the respect it deserves.

At least the tools we use to make VDI serious today (VSAN/VCOPS/LogInsight/HorzionView6) are a lot more serious than the stuff I was using 4 years ago.

My apologies, for calling our Chris (which wasn’t really the point of this article) but I will thank him for giving me cause to reflect on the state of VDI “seriousness” today.

How does your organization view and depend on VDI today, and is there a gap in perception?

Why you shouldn’t run BCA off a Synology (or QNAP, or other cheap Linux NAS)…

In my life as a VMware consultant I run into the following Mad Lib when trying to solve storage problems for Business critical Applications.

A customer discovers they have run out of (IOPS/Capacity/Throughput/HCL) with their existing (EMC/Dell/HP/Netapp) array. They sized only for Capacity without understanding that (RAID 6 with NL-SAS is slow, 2GB of Cache doesn’t deliver 250K IOPS). The have spent all their (Budget/rackspace/Power/Political Power/Moxie). There is also an awkward quiet moment where its realized that (Thick provisioning on Thick provisioning is wasteful, I can’t conjure IOPS out of a hat, Dedupe is only 6%, Snapshots are wasting 1/2 of their array and are still not real backs, They can’t use COW so SRM can’t test failover). Searching for solutions they hear from a junior tech that there is this new (home-made/SOHO appliance) that can meet their (Capacity/IOPS) needs at a cheap price point. And if they buy it, it probably will work… For a while.

Here’s whats missing from the discussion.

1. The business needs more than 3-5 days for parts replacement, or tickets being responded to. (Real experiences with these devices).

2. The business needs something not based on desktop class non-ECC RAM motherboards.

3. The Business needs REAL HCL’s that are verified and not tested on customers. (QNAP was saying Green drives that lacked proper TLER, and are not designed for RAID would be fine to use for quite a while).

4. The Business needs systems that are actually secured

Now I’ve heard the other argument “but John I’ll have 2 of them and just replicate!”

This is fine (once you realize that RSYNC and VMDK’s don’t play nice) until you get bit buy a code bug that hits both platforms. While technically on the VMware HCL, these guys are using open source targets (iSCSI and NFS) and are so incredibly removed from the upstream developers that they can’t quickly get anything fixed or verified quickly. 2 Systems that have a nasty iSCSI MPIO bug, or have a NFS timeout problem are worse than 1 system that “just works”. Also as these boxes are black box’s they often miss out from the benefits of open source (you patch and update on their schedules, which is why My QNAP had a version of OpenSSL at one point that was 4 years old despite being on the newest release). If both systems have hardware problems because of a power surge, or thermal problems, or user error or a bad batch your still stuck waiting days to get a fix. If its software you may be holding your breath for quite a while. With a normal server OEM or Tier 1 storage provider you have parts in 4 hours, and reliability and freedom that these boxes can’t match.

Now at this point your probably saying “but John, I need 40K IOPS and I don’t have 70K to shovel into an array.

And thats where Software Defined Storage bridges the gap. Now with SuperMicro You can get solid off the shelf servers with 4 hour support agreements without breaking the bank (This new parts support program is global BTW). For storage software you can use VMware VSAN, a platform that reduces, costs, complexity, and delivers great performance. You massively reduce your support foot print (one company for hardware, one for software) reducing operational costs and capital costs.

Nothing against the Synology, QNAP, Drobo of the world, but lets stick to the right tool for the right job!

KeyNote Part1

VMware: EVO:RAIL – It looks like our shift to SuperMicro for VSAN was the right choice. Will be be looking for
EVO:Rack – A vBlock without limits? We will see.

OpenStack – VMware is doing a massive amounts of code push to OpenStack so OpenStack can control vSphere, NSX etc allowing for people to run VMware API’s and OpenStack API’s for higher level functionality.

Containers – Docker, Google, Pivotal are allowing very clean and consistent operational deployments.

NSX – Moving security from the edge to Layer 2. Get ready to hear “Zero Trust networking”. The biggest challenge in Enterprise shops is they are going to have to define and understand their networking needs on a granular level. For once network security ability will outrun operational understanding. If your a Sysadmin today get ready to have to understand and defend every TCP connection your application makes, but take comfort in that policy engines will allow this discussion to only have to happen once.

Cloud Volumes – While I’m most excited about this as a replacement for Persona, there are so many use cases (Physical, Servers, Thinapp, Profiles, VDI) that I know its going to take some serious lab time to understand where all we can use this.

vCloud Air – In a final attempt to get SE’s everywhere to quit calling it “Veee – Cheese” VMware is re-branding the name. I was skeptical last year, but have found a lot of interest in clients in recent months as Hurricane Seasons closes in Houston

VMworld Day 1

I’m looking forward to this week and here are a few highlights of what I’ll be looking into.

On the tactical

1. Settling on a primary load balancing partner for VMware View. (Eying Kemp, anyone have any thoughts?). I’ve got a number of smaller deployments (few hundred users) that need non-disruptive maintenance operations, and patching on the infrastructure and are looking to take their smaller pilots or deployments forward.

2. Learn more about VDP-A designs, and best practices. I’ve seen some issues in the lab with snapshots not getting removed from the appliance and need to understand the scaling and design considerations better.

3. Check out some of the HOL updates. Find out if a VVOL lab in the office is worth the investment.

On the more general strategic goals.

Check out cutting edge vendors, and technologies from VMware.

CloudVolumes – More than just application layering. Server application delivery, Profile abstraction thats fast and portable, and a serious uplift to persona and ThinApp. Really interested in use cases having it used as a delivery method for Thinapp.

DataGravity – In the era of software defined storage this is a company making a case that an array can provide a lot of value still. Very interesting technology but the questions remain. Does it work? Does it Scale? Will they add more file systems, and how soon will EMC/HP buy them to bolt this logic into their Tier 1 arrays. Martin Glassborow has made a lot of statements that new vendors don’t do enough to differentiate, or that we’ve reached peak features (Snaps, replication, cache/tier flash, data reduction etc) but its interesting to see someone potentially breaking outside of this mold of just doing the same thing a little better or cheaper.

VSAN – Who is Marvin? What happened to Virsto? I’ve got questions and I hope someone has answers!

LSI 2008 Dell H310 VSAN rebuild performance concerns

Just a quick note for anyone seeing VSAN performance issues with Dell H310 or LSI 2008 controllers. Its not a secret that the LSI 2008 and H310 Dell with stock firmwares have a very shallow queue depth of 25 (a LSI 2208 in comparison is 600, and a Dell H910 is 975). These are some of the weakest cards to be certified on the HCL and for a small ROBO deployment, or a low VM count with low contention should be fine. Remember part of the benefit of SDS is you can scale down.

For a quick look at firmware queue depths check out Duncan’s article on this.

Now one user tried to push things a little to far, running 5 hosts, with only 3 with storage with 70+ VM’s using Dell H310’s. Performance and experience was fine, until he lost one node and a rebuild kicked off. Running 70 VM’s on 2 hosts combined with the replication overhead was too much and caused an interruption of service (but no data corruption). VMware support tied it back to poor performance of the H310 and heavy load on a degraded 2 Node system trying to rebuild.

In Synchronet’s Lab’s one my earlier VSAN lab builds ran into some odd write latency, that I had initially suspected was the result of a excessively cheap 10Gbps switch. While building out a validation of a solidly performing SMB bundle this week I sought out to get to the root cause. A quick test last Friday (running the vCenter Appliance install wizard) showed at ~250 IOPS a perfectly good Intel S3700 200GB SSD drive spiking over 30ms of write latency and continuing upward. This test was performed against a regular SSD and not a VSAN based datastore isolating the issue. Previous IOMeter workloads showed thousands of read IOPS but write IOPS choking pretty quickly with high latency and low cache rates. Subsequent lab equipment did not have this issue, but had used newer switches muddling the root cause.

We have upgraded our LSI cards to the current firmware, and as of this evening confirmed that the queue depth on the 2008 is increased to 600. Now technically while not supported the H310 is an LSI 2008 controller, and if this is for a lab or a POC or your feeling brave you can follow this guide here. (Note this is not endorsed or supported by VMware/Dell). I’ll see how far I can push writes this afternoon but I expect this should have made our similar issues go away. Alternatively upgrading to something with a queue depth of at least 600 should help fix this (its generally ~$150 per host for a mediocre HBA/RAID Controller that supports decent queue depths).

This is a quick flash (And reminder) of why its a good idea to work with a VSAN partner who validates their builds, understands the technology, and is ready to support you from architecture to implementation (Ok thats my quick advertisement). One thing I am offering right now is anyone interested in a quick architecture call, and a setup of the VSAN assessment tool (TM) I can hopefully help get you the raw information you need to make intelligent decisions on things like Queue Depth, and cache size, and understanding things like the data skew that is important to setting a foundation for a solid VSAN design.

As with any new technology I encourage everyone to do their homework. There’s a lot of FUD going around (and just lack of training and knowledge from a lot of vendors) and issues like this is part of why I’ve been happy to work with VMware’s software defined storage team and the hardware OEM’s for our customer builds. Greater flexibility and response on helpful information (like JBOD mode on LSI 2208’s in a previous post, or adjustments to the HCL, or documentation and firmwares to fix the queue depth issue). For those of you looking for a quick summery. VSAN is a great product and very powerful. Remember a configuration that would be fine for 5 VM’s in a branch isn’t quite going to look the same as for 100 VM’s in a datacenter, or 1000 in a VDI farm. Trying to build the cheapest VSAN configuration that has enough capacity should not be a goal, and cutting corners in the wrong areas can sneak up on you. I do expect some updated guidance from support and the HCL on this. I am hearing 256 but realistically given how cheap the 600 Queue Depth, and how annoying it is to swap out an HBA and re-cable things I’d encourage everyone to start at this point if it makes sense.

Special thanks to…

VMware – for getting a quick Root Cause Analysis, and from reading the story provided a steady voice of reason on support so nothing further drastic was done and stayed involved until the situation was resolved.
JasonGill – for providing us a good read and not jumping to conclusions that this was a software bug.
Brandon Wardlaw – for braving the controller upgrades and somehow not bricking any of my old lab gear.
LSI – For making an updated firmware for their old controllers and not hiding it.

Note, Expect follow up posts and edits. As I get more benchmarks and numbers about various queue depth’s I’ll post them (or if anyone has any send them to me!). I’m also going to be benchmarking some different switches (Cisco, Juniper, Brocade, Netgear) over the following weeks and hopefully if I have time publish some of our results on things like a 140mpps 1Gbps Brocade vs. terrifyingly cheap Netgear 10Gbps switch.

*UPDATE*

We did some testing in the lab without VSAN, just doing a basic vCenter Server install to SSDs (Intel 3500) on this datastore. With the 25 queue depth default firmware we saw 30ms of write latency at 352 write iops. Using an LSI based upgrade firmware with 600 queue depth we pushed 1500 iops at a sub 1ms of latency (.17 ms). Its pretty clear that outside of light/ROBO usage the H310 controller is unsuitable for VSAN usage until Dell supports the LSI code upgrade. What is very concerning is that 4 out of the8 of the Dell VSAN nodes are based on this underpowered HBA. This includes one with 15K drives that I’m guessing is meant to be a performance option. This raises a few questions.

1. Does Dell have a firmware upgrade from LSI to push out (We know it exists) to help resolve this.
2. Did Dell run any benchmarks or have any storage architects look at these configs before they submitted them as a VSAN ready build?
3. I’m hearing reports from the field that Dell reps are saying that VSAN isn’t supported for production workloads. Is this underpowered config partly to blame for this?

DellVSANhuh