Skip to content

Posts from the ‘Uncategorized’ Category

Time to check the log…

You can see from the year 5 rings that there was great budget, and much storage was added!

Any time you open a ticket with VMware (or any vendor) the first thing they generally want you to do is pull the logs and send them over.  They then use their great powers (of grep) to try to find the warning signs, or results of a known issue (or new one!).  This whole process can take quite some time, and frustratingly some issues roll out of logs quickly, are buried in 10^14 of noise, or can only be found with an environment that is down and has not been rebooted.  I recently had a conference call with a vendor where they instructed a customer that we would have to wait for one (or more!) complete crashes to their storage array before they would be able to get the logs to possibly find a solution.

This is where LogInsight comes to the rescue.  With real time indexing, graphs that do not require you learn ruby to make, and machine learning to auto group similar messages you can find out why your data center has crashed in 15 minutes instead of 15 days.

Recently while deploying a POC I had a customer who complained of intermittent performance issues on a VDI cluster they couldn’t quite pin down.  Internal teams were arguing (Storage blamed network, network blamed AD, Windows/AD blamed the VMware admin).  A quick search for “error*,crit*,warn*” across all infastruture on the farm (Firewall/Switch/Fabric/DiskArray/Blades/<infinate number of locations View hides logs> returned thousands of unrelated errors for internal certificates not being signed and other non-interesting events.   LogInsight’s auto grouping allowed for quick filtering of the noise to uncover the smoking gun. A Fibre Channel connection inside of a blade chassis was flapping (from a poorly seated HBA).  IT was not bad enough to trigger port warnings on the switches, or an all paths down error, but it was enough to impact user experience randomly.  This issue was a ghost that had been plaguing them for two weeks at this point.  LogInsight found it in under 15 minutes of searching.  It was great to have clear evidence so we could end internal arguing as well as hold the vendor accountable so they couldn’t deflect blame to VMware or another product.

I’d encourage everyone to download a free trial and post back in the comments what obscure errors or ghosts in the machine you end up finding.

LSI Firmware VSAN

I’ve been talking to LSI over the past couple months in relation to VSAN and have a couple updates on issues and thoughts.


1. LSI support does not support their driver if it is purchased through an OEM.  They will not accept calls from VMware regarding this driver in this case either.  If you want LSI to support the VMware driver stack, you must buy direct from them.

2. LSI branded MegaRAID cards do not support JBOD (I understand that it is on the roadmap).  Dell and others are offering alternative firmwares that allow this, but they have no comment or support statement on this.

3. MegaRAID CLI can be used with RAID 0 to manage cards (i’ll release a guide if there is interest) and performance is comparable and on supported systems is very stable.  Don’t rule it out, and with all the back and forth on support for JBOD it strangely might be the safer until I get full testing reports from the Perc730 next week.

4. The Dell Perc730 has JBOD support now.  Despite being a MegaRAID I’m hearing good things in the field so far (I’ll update if I hear otherwise).

5. LSI prefers dealing with hardware vendors, and largely being a back end chip-set manufacturer.  A stronger relationship with VMware is needed (especially with PCI-Express networking on the horizion).

6. HP is switching to Adaptec for controllers.  Hopefully this should bring their JBODs onto the VSAN HCL and allow for supplier diversity.

7. I’ve heard statements from Dell that VMware is intensifying the testing procedures for VSAN.  It looks like this will catch H310/2208 type issues first.

8. Ignore the SM2208 on the HCL for pass through.  Neither VMware nor LSI will support it.

How to buy IT? (Part 1).

On twitter John Troyer asked “Are there books/classes on “How to purchase enterprise tech” or is it mostly analysts & tribal knowledge? Not a regular supply chain thing.”

There were some quick responses with the general consensus this was something learned by trial and fire, and a universal dread if a procurement department had any power in the decision.

Lets outline some of the problems.

1. IT buying requires you actually understand what your purchasing. A storage array and fibre channel network isn’t just “Tubes, RAW Terra-bytes, and dollar signs” (Despite one IT Director telling me this as he wrested with if he should buy a Netapp or an HDS giving absolutely zero concern for performance, support agreements, or usable capacity.)

2. IT purchasing isn’t always fully in IT’s control. Procurement will bid out that server, and end up with a H310 RAID controller that isn’t on the VSAN HCL, or will drop iLO meaning setup will take 10x as long and fail to deliver. Procurement departments are often judged on money saved, and now how badly they screw up time to deliver. Then again, IT staff often will declare ridiculous things (A $500K in Cisco switching to power 3 Cisco UCS blades) or make terrible decisions. But should procurement really be the back stop on keeping them from drunken buying toys they don’t need?

3. They vendors/channel system means the only people with information to properly consult on a products ability to deliver are the same people most of the time selling it to you. I saw a consulting firm for a do a 3 vendor shootout/need assessment for backup for a customer. 2 out of 3 vendors refused to return the analysts phone calls. Do we really want to do business with opaque vendors though even if they are the 50 Billion pound gorilla of the industry?

4. The amount of time that is often invested in finding the right solution can massively dwarf the amount of time required to actually implement. I once worked a storage deal that took 2 years. Servers were crashing, data was lost, and deck chairs where being shuffled, but the IT director was in paralyses of a decision. If a customer tells me they are about to buy storage, I warn them that they are inviting a barking carnival of vendors and nothing productive will be done until the product hits their loading dock. Get ready for bribe offers, teams of a dozen people to show up without being scheduled, and an amount of FUD that feels like an invading horde of barbarians. Once they smell blood in the water, every VAR within 500 miles will be at your door. In the end, is a quick and decision, or slow and methodical solution any better than throwing a dart at a Magic Quadrant? (Incidentally this is my theory of how some of the storage ones are scored).

Going over these 4 quickly.

1. This one is hard. If you don’t have SME’s in house don’t just put yourself at the mercy of your vendors. DO NOT TRUST YOUR VARs. They will lean on whatever has more margin this quarter. Realize that your SME’s internally may have agenda’s (If the whole internally team is Cisco certified, and that is their value then don’t take their recommendation for a Nexus 7K at face value). Pay someone to review this who is not going to be selling you the box. Pick several vendors to have review, come up with a scoring system of needs, risks, and have a 3rd party arbitrate the scoring.

2. Procurement is the wrong department to prevent waste. Realize that saving 10% and having 1/4 of solutions ship to you incomplete isn’t “winning”. Start with making sure the solution will work and THEN look at cost control. Often cost control is what weak non-technical decision makers fall back on (They are afraid the solution will not work, and want to limit the damage). Push hard to understand (or make someone make you understand) the decision at hand. Don’t try to cost control a project you don’t think is going to work. Delay the decision until you understand and pick decisively and correctly. If you don’t trust what your subordinates or consultants are recommending (because its often not working) don’t slash their budgets, replace them with people who can deliver on what they ask for.

3. Make sure you network and benchmark with others in your industry. While from time to time its best to break from the herd if technology is a differentiation point, but limit this to where there is a really compelling value. Don’t pick a storage vendor with an experimental protocol when your a 5 billion dollar company with only 10TB of capacity needs for Tier 1. its not worth the savings. Inversely recognize when your challenges are unique. If you have 50 field offices and your competitors have 5 so its time to consider VDI despite no one else doing it.

4. Ask difficult questions, know your criteria, and know why the last solution worked or didn’t work if this is a migration.The vendors will try to tie you up in quicksand and keep themselves in play as long as possible. Strike, hard and fast. If they can’t respond quickly to your needs then they don’t understand how to qualify them.
“I’m sorry, we are only accepting bids from vendors with 4 Hour onsite non-contractor support”
Know who’s not a fit before they call so you don’t waste time with non-starters.
“We need storage that can provide 10TB, 10000, IOPS with a 95% data skew of 8%, at an average block size of 16KB with a 75/25% read/write mix and a compressibility of less than 10%.
Know your requirements for purchase down to the most granular bit if you don’t want to play 20 quotes.
“Our GPFS implementation will give you 20K IOPS with 4 SATA drives and no flash”.
“Average Dedupe is 500% so you’ll only need to buy 2TB of usable”
“FCoE will be [Insert anything useful]”.

Magic Server/Storage/Network pixie dust isn’t real. Watch out for ridiculous absolute statements.

Why VDI?

I was reading Justin Paul Justifying the Cost of Virtual Desktops: Take 2 and had some thoughts on where he see’s the cost model of VDI. I know Brian Madden has talked at great length of all the false cost models for VDI that exist (and I’ve seen it in the field) .

1. I Agree with Justin on power with some narrow changes. Unless its a massive deployment, another 4 hosts in the data center isn’t going to break the bank. Unless your forcing people to use thin clients, your also not saving anything real on the client side (and certain thing (Lync, MMR etc) require Windows Embedded clients at a minimum anyways. The only case where I’ve successfully made this was a call center that was 24/7 and handled disaster operations in Houston. After IKE everyone learned how hard it is find fuel, anything that reduces the generator and battery backup budget actually has real implications.

2. Justin does make good points about SA and keeping up with the Windows OS releases on physical machines is just as expensive as VDA. Sadly this is only true if companies are not just standardizing on Windows 7 and running it into the ground for the next 5-7 years. Hey it worked for XP right?

3. While I agree a ticket system helps track time spent restoring machines etc, no one makes non-billable IT resources track time to the level of detail and meta tags/search to make building an in house ROI model possible. The best luck I’ve had is having people do a week survey with 15 minute intervals broken down is as close as you’ll get in house IT to do. Its painful to get even that done. Unless your desktop support is outsourced (And you have access to their reports!) This is going to always be sadly a fuzzy poorly tracked cost. I’d argue VMware Mirage (or equally good application streaming/imaging system) can provide a lot of the opex benefits without the consolidation and other pro/cons of VDI. VDI extends beyond imaging and breakfix. Its about mobility, security, and

4. People work from home today with VPN, and Shadow IT (LogMeIn etc). The ability to do this isn’t what you sell, its the execution and polish (Give a sales person a well maintained, PCoIP desktop and they will grab their iPad and never come back to the office). Its the little things (like Thin Print letting them print to their home PC). Ultimately it isn’t the “occasional” or snow day remote users that sell VDI. its the road warriors and branch offices (who are practically the same thing with as little attention as they get from central IT typically).

The VSAN build 2 (Watch out for partitions!)

A quick post. So my disks had been burn in tested by AcmeMicro so they had partitions on them. VSAN will to protect you from yourself refuse to install on disks with existing partitions.

A quick check for the disk ID’s (naa.##############). needs to be run.

~ # esxcli storage core device list
Display Name: Local SEAGATE Disk (naa.5000c500583aeb05)

Once you have the ID’s check for partition tables (Note the second line first number is the partition number so in this case I have a partition 1 and 2).

~ # partedUtil getptbl /dev/disks/naa.5000c500583aeb05
121601 255 63 1953525168
1 2048 718847 7 128
2 718848 1951168511 7 0

Last we have to delete the partition info.

~ # partedUtil delete /vmfs/devices/disks/naa.5000c500583aeb05 1
~ # partedUtil delete /vmfs/devices/disks/naa.5000c500583aeb05 2

At this point we can now install VSAN and eat cake 🙂

VSAN build #2 Part 1 JBOD Setup and Blinkin Lights

(Update, the SM2208 controller in this system is being removed from the HCL for pass through.  Use RAID 0)

Its time to discuss the second VSAN build. This time we’ve got something more production ready, properly redundant on switching and ready to deliver better performance. The platform used is the SuperServer F627R2-F72PT+

The Spec’s for the 4 node’s

2 x 1TB Seagate Constellation SAS drives.
1 x 400GB Intel SSD S3700.
12 x 16GB DDR3 RAM (192GB).
2 x Intel Xeon E5-2660 v2 Processor Ten-Core 2.2GHz
The Back end Switches have been upgraded to the more respectable M7100 NetGear switches.

Now the LSI 2208 Controller for this is not a pass through SAS controller but an actual RAID controller. This does add some setup, but it does have a significant queue depth advantage over the 2008 in my current lab (25 vs 600). Queues are particularly important when dropping out of cache bursts of writes to my SAS drives. (Say from a VDI recompose). Also Deep queues help SSD’s internally optimize commands for write coalescence internally.

If you go into the GUI at first you’ll be greeted with only RAID 0 as an option for setting up the drives. After a quick email to Reza at SuperMicro he directed me to how to use the CLI to get this done.

CNTRL + Y will get you into the Megaraid CLI which is required to set JBOD mode so SMART info will be passed through to ESXi.

$ AdpGetProp enablejbod -aALL // This will tell you the current JBOD setting
$ AdpSetProp EnableJBOD 1 -aALL //This will set JBOD for the Array
$ PDList -aALL -page24 // This will list all your devices
$ PDMakeGood -PhysDrv[252:0,252:1,252:2] -Force -a0 //This would force drives 0-2 as good
$ PDMakeJBOD -PhysDrv[252:0,252:1,252:2] -a0 //This sets drives 0-2 into JBOD mode

They look angry don't they?

They look angry don’t they?

Now if you havn’t upgraded the firware to at least MR5.5 (23.10.0.-0021) you’ll discover that you have red drive lights on your drives. You’ll want to grab your handy dos boot disk and get the firmware from SuperMicro’s FTP.

I’d like to thank Lucid Solution’s guide for ZFS as a great reference.

I’d like to give a shout out to the people who made this build possible.

Phil Lessley @AKSeqSolTech for introducing me to the joys of SuperMicro FatTwin’s some time ago.
Synchronet, for continuing to fund great lab hardware and finding customers wanting to deploy revolutionary storage products.

Out of support, budget, capacity. The myth of the Mygyver IT Hero. (Part 1).

If you have worked in IT you’ve run into variations of the following question.
Help, my MD3000i thats 10 years old and out of support/life is out of space and hanging on by a questionable back plane connection! How do I fix this/keep using it for 5 years?

Many IT staff (Particularly in the SMB realm) get excited when they are faced with this challenge. They feel this is part of Why IT exists. They quickly brandish their chewing gum; coat hanger; easy bake oven; rubber chicken; and dive into these problems so they can brag about it and live to see another day. They are vaulted as mullet wearing hero’s with 92 disk RAID 5 or QNAP based HA Clusters that let them have Enterprise like features on 1/10 the budget. They are convinced their goal is to run an IT shop with as little budget as possible, and mask poor communication and architecture skills with a never end series or heroic 28 hour battles.

I come not to praise this hero but to bury him. He is a risk to his business, our profession, and he needs to be stopped as he undermines the credibility of us all. There is doing more with less, and then there is the ridiculous that is our Mullet wearing bandit. Lets examine the cast of characters that leads to these messes.

Mr. “I don’t need support, just more GB!”

Out of support critical hardware is not something that just happens overnight. At some point in purchasing that shiney new VNXe someone who had a 100K budget made a choice between buying the extra 2 years of support or getting more capacity/more ram in the hosts. You’ll recognize this guy as he will often dirrect his entire budget on making one number really high. Expect to find Quad Socket hosts with 32GB of RAM, or possibly an all Flash SAN with a Single Fibre Channel Switch. Everything will be redundant except the one thing he does not understand. Expect a Terabyte of RAM, and a 4 disk RAID 5 in his SQL server.

Mr. Brand Name

This is the IT guy who’s convinced that solid architecture or support agreements are not needed as long as he’s got brand names. He will go out and pick up Solid Brand names (EMC/Cisco/VMware) but pick their Small business offering that have the same support or feature or capacity that he needs. Expect to find Cisco/Linksys RV series Routers and SG switches. VNXe or EMC/Lenovo storage deployed for an Oracle RAC cluster. Do not be shocked when your discover he is running Production Servers on VMware GSX/Workstation/Fusion. He thinks he’s a hero as he’s got all the right “toys” without spending the real money required to get the right ones for the job.

Mr Open Source

In no way view this as an attack on open source. (I’m typing this into WordPress, and this server runs Appache Linux) which in this case is the right tool for the job. This IT guy’s lone goal is to spend nothing on software. If you ask him what storage he’s running he’ll mumble something about OpenSolaris ZFS, with Xen, and SquirrelMail for a 10 man office. Now he will often not actually fully understand the technologies that he is deploying or have the skills to soundly deploy them making things more difficult. He stands out in that he will deploy Servers on a non-LTS Ubuntu Desktop Edition. Generally it will take days or weeks even for a RHCE to make sense of the network. Expect his keyboard to be switched to Dvorak. You can run a test to identify this guy from a rational open source skilled admin.

Mr. Chicken Little

Chicken Little blends in with normal functioning small IT admins except he has one big flaw. He’s afraid of the sky failing. Every time someone mentions moving a simple, logical thing to the cloud (Email, Spam Filtering, Website hosting) he shrieks like a chicken fighting for his life.

The point of this post is partly to rant, and partly to explain something I’ve found to be common sense for a while.
Any project needs to scope what the baseline RPO/RTO/Reliability/Availability as well as capacity and performance baselines before it should be signed off on. What can I get for the change in my pockets is a game to play in a dollar store, not in IT. Saying No, or translating ridiculous budget reductions into reductions of user functionality and not reliability are skills that every good IT pro should have. Sadly virtualization, overcommitment of resources, and the consumerization of IT have made this problem worse. Part 2 of this article will talk about strategies for assuring that budget is tied to further updates and roll outs, and how to overcome the budget cliffs, and problems of scaling infastruture to meet “just in time” and other new trends from the operations side.

Are containers our future?

This is a quick post in reaction to Alex Benik’s post at Gigaom. While I like Gigaom’s commentary on the industry at large, they really don’t seem to understand infastruture always. Alex starts out by stating that the current industry practice of separating out applications with their own dedicated OS instances, and having low utilization is a terrible problem. He almost paints hypervisors as part of the problem. he cites 7% CPU usage on EC2 instances as a key example of what is wrong with virtualization and usage.

I’ve got a few quick thoughts on this.

1. The reason Amazon EC2 can be so cheap is because Amazon can over subscribe instances heavily. Low average CPU workloads is the foundation for virtualization and all kinds of other industries (Shared web hosting etc). He’s turned the reason for virtualization being a great cost saving technology into a problem that needs to be solved. If everyone was running them 100% all the time then there would be a problem.

2. He’s assuming that CPU is the primary bottleneck. As others (Jonathan Frappier) have pointed out that storage is often the bottleneck. There comes a point where you can only get so much disk IO to a virtual machine. In large enterprises with Shared Storage Arrays, eventually bottlenecks in storage IO (Queue Depths on HBA’s, LUN’s etc) start to crop up, and eventually it becomes easier to scale out to more hosts, than try to scale deep. VMware and others have created technologies (CBRC, vfrc, vSAN) that will help this. Also memory is helping and hurting this density problem.

3. Until the recent era of large memory hosts, memory was often a bottleneck. As 64 bit databases and applications became ever hungrier to cache data locally this waged a 2 factor war on CPU utilization. Hosts with VM’s with 16GB of ram quickly ran out of RAM before they ran out of CPU. Also Memory and disk IO subtly influence CPU in ways you don’t factor. Once memory is exhausted on a host, and over subscription is occurring. CPU usage can spike, as process’s take longer to finish. In memory workloads allow CPU’s to process data quicker, and jobs finish sooner. Vendor recommendations for ridiculous memory allocations don’t help the solution either (I’m looking at you Sage). When Vendors recommend 64GB of RAM for a database server serving 150 users, its become clear that SQL monkeys everywhere have given up on actually doing proper indexing or archiving and instead are relying on memory cache. This demand on memory causes hosts to fill up long before CPU usage can become a problem unless managers are willing to trust a balloon driver to intelligently swap out the “memory bloat”. (Internally and with customer vCloud deployments I’ve seen much better utilization by oversubscribing memory 2 or 3 times). This is not a bad thing unless an application has scale. (its now cheaper to throw hardware at the problem than write proper code/index/optimizations).

4. He’s also forgetting the reason we went with virtualization in the first place. To separate out applications so that we could update them independently from each other. No longer running into issues where rebooting a server to fix one application caused another application to go down. Anyone who’s worked in shared tenant container hosting can tell you that its not really that great. Comparability matrix’s, larger failure zones and all kinds of problems can come up. For homogenous web hosting its a fine solution. For the enterprise trying to mix diverse workloads it can be a nightmare. We use BSD containers internally for some websites, but beyond that we stick to the hypervisors, as a more general use, stable and easier to support platform. I’d argue JEOS, vFabric and other stripped down VM approaches are a better solution as they enforce instance isolation while giving us massive efficiency gains from the kitchen sync (I’m going to call out websphere on this)deployments of old.

Getting the ratio of CPU to Memory to Disk IO and capacity is hard. Painfully hard. Given that CPU is often one of the cheapest components (and most annoying to try to upgrade) its no wonder that IT managers everywhere who come from a history of CPU’s being the bottlenecks often get a little out of hand in overkill with CPU purchasing. I’ve been in a lot of meetings where I’ve had to argue with even internal IT staff that more CPU isn’t the solution (The graphs don’t lie!) while disk latency is out the roof. I’d strangely argue a current move to scale out (Nutanix/VSAN etc) might fix a lot of broken purchasing decisions (LOTS of CPU, low memory, disk IO).

vSphere Distributed Storage and why its not going to be “production ready” at VMworld

vSphere Distributed Storage (or vSAN) is a potentially game changing feature for VMware. Being able to run its own flash caching, auto mirroring/striping storage system that’s fully baked into the hypervisor is powerful. Given that storage is such a huge part of the build out, it makes sense that this is a market in need of disruption.

Now as we all hold our breath for VMworld I’m going to give my prediction that it will not be listed as production ready from day one and her are my reasons.

1. VMware is always cautious with new storage technologies. VMware got burned by the SCSI UNMAP fiasco, and since has been slow to release storage features dirrectly. NFS cloning for view underwent extensive testing, and tech preview status.
2. Vmware doesn’t like to release home grown products straight to production. They do this with acquisitions (mirage, View, Horizon Data, vCops) but they tread carefully with internal products. They are not Microsoft (shipping a broken snapshot feature for two versions was absurd).
3. The trust and disruption needs to happen slowly. Not everyone’s workload fits scale out, and encouraging people to “try it carefully” sets expectations right. I think it will be undersold by a lot, and talked down by a lot of vendors but ultimately people will realize that it “just works”. I’m looking for huge adoption in VDI where a single disk array often can cause awkward bottlenecks. This also blunts any criticisms from the storage vendor barking carnival, and lets support for it build up organically. Expect shops desperate for an easier cheaper way to scale out VDI, and vCloud environments turn to this. From a market side I expect an uptick in 2RU server’s being used, and the back plane network requirements pushing low latency top of rack 10Gbps switching further into mainstream for smaller shops and hosting providers who have been holding out.

These predictions I’m making are based on my own crystal ball. I’m not currently under any NDA for this product.
No clue what I’m talking about? Go check out this video

Which Hyper-V admin are you?


Microsoft has tried really hard to get the message out that Hyper-V is a real platform that can solve real problems. Every Hyper-V deployment I run into tries to solve a few too many problems. I’ve noticed that Hyper-V admin’s are a special breed who generally fall into one of a few camps.

1. The Octopus/Hydra – Someone armed with best practices from the late 90’s is convinced that dedicated network cables on a per VM basis are the only way to deploy Hyper-V. The end result is 6 VM’s and 14network cables coming out of the host. The admin’s in these environments don’t get that just because a server is replacing 6 servers doesn’t mean it needs to take up as much rack space as before.

2. The Kitchen sink – The admin’s in these enviroments don’t let something like best practices slow you down. They try to cram as many roles as possible onto a single box. You can spot these guy’s as they have never heard of core edition.
DHCP – Check
DNS – Check
AD – Why not throw in ALL the FSMO roles?
Hyper-V Sure!

3. Macgyver – Budget, what budget? This admin can be spotted by the questionable supermicro stuffed full of SATA drives and held together by duct tape. He likely choose Hyper-V for its support of fake raid (Dedicated storage controllers or arrays with cache is for fancy rich folk!). He will occasionally be seen with a cluster, but it will be based on SATA RAID 5 in a QNAP, and thinks that its only a matter of time before he can get RSYNC to copy locked VHD’s.

4. The Hoarder – You will know him by his snapshots. He’s convinced somehow that snapshots are a replacement for backups, and has 400GB of Snapshots on an 80GB VM. Even introducing the thought of trying to consolidate him unnerves him. Eventually someone will come along with VMware converter and clean up this mess, but otherwise its best to avoid eye contact with this disaster in the making.