Skip to content

Protected: The performance problem is in another castle

This content is password protected. To view it please enter your password below:

So your thinking about taking an offer… What do you need to know?

It’s a new year and I’m sure some of you had resolutions to look at a new job. As budgets “Unfreeze” new jobs are opening.

This post has some history. Previous to coming to VMware I worked for an IT consultancy and used to at one time be a hiring manager. It was always interesting seeing why people chose to stay, leave, or join our shop. Even when people left, I heard about their future moves by being a common reference used for the previous manager. On top of this new hires would often ask me a million different questions about a job trying to compare the old company with the new company regarding benefits (both compensation and non-compensation related).

From this, I’ve amassed an interesting list of:

1. Compensation that often is overlooked
2. Things you want to know about a job before you take it for the quality of life reasons
3. How to know if the grass is greener (or not) on the other side

While this list isn’t something you would send in whole to a recruiter, it’s information that through various sources you might want to try to understand before making a jump to a new job. The first half is Job questions; the second half is compensation questions.

 

The Job Questions…

What’s the team/dept/companies view on Training?

If they don’t have a training programme or allow time for training/skills improvement that could be a red flag.

Why is the position open?

Growth, backfill, etc. This is the reverse of “why are you looking to leave your last job?”. If it’s the 3rd time, they tried to fill a roll something may be off…

What are the expected hours? What are the exceptions, holidays, etc.?

I once worked an outage till 4 AM then was expected to walk into the office by 8:30 AM I was happy to leave that place. School districts might do four day work weeks in the summer; some oil/gas companies do 4 x 10’s or other weird schedules. Occasionally I have to take calls early or late (to deal with people in EMEA, ANZ, etc.).

Are there SLA’s in place?

What is expected of your team, and are they equipped to meet it?

What is the annual IT/Department budget?

Whats the budget for your group look like? What projects have been funded as well as what is planning on being funded can be a proxy for this question. You don’t want to walk into a shop with 8-year-old systems and no budget for replacement.

Who determines the IT budget?

What’s the process, who are the actors involved?

What’s the company’s position on open/capex IT spend?

Lease vs. Buy. Are they balanced, or for financial reporting reasons (ROIC) are they 100% one or the other if possible.

Are they cloud (friendly, neutral, hostile)?

What are they using for cloud now, and what are they planning on migrating?

What does your current infrastructure look like?

Shiney brand new VxRAIL/UCP-HC Cluster, or 200 Physical servers running Windows 2000? How bad is the technical debt? What/where are the datacenters, who are the providers, what is the networking, (WAN, and edge/campus gear). What storage vendors and hypervisors are in play.

What is the spread of the tasks expected and are they reasonable?

There’s nothing like being hired to be a data center architect and discovering that fixing printers is in your responsibilities. Skill growth requires you focus on things that matter. Also, if managers see you fixing printers or doing other lower end work, they tend to mentally associate what you should be paid with the bottom 10% rather than the top 10% valuable work.

What Services are outsourced?

Does someone manage printers, the WAN circuits, the storage, backups, the DR, etc.? Beware shops that don’t believe in outsourcing anything as they tend to view in-house labor as a “free” commodity.

What are they doing for DR? 

This question is a mix of what is their plan, and what is a reality. How often is it tested? Do they hit the SRM failover button once a quarter, or do they have an out of date binder?

What is the targeted refresh cycle for Network/Servers/Storage?

Do they run stuff five years, ten years? Do they run gear beyond its natural life, or beyond support agreements?

What is the maintenance schedule?
Do they patch at all, is there automation in patching.
What Compliance do they have?
PCI/HITECH/etc.

The Team Questions

Who will be the manager? Can I meet them? – It’s a red flag if you can not meet your line manager. You will want to know the person who will assess your performance, impact your bonus, assign you good (or bad) projects, etc…

What is your biggest daily/weekly frustration?

Key things to note is if this is something you can stand, or if this is something that’s fixable. Bonus points if you bring unique skills, or you will be working on a project to fix it. “Our Fibre Channel network is slow, but the HCI project you will be on should fix that!”

Ask about how success is measured?

Is there a forced Stack rank? Are there general metrics that you target (uptime, on-time delivery of projects?).

Who is on the team? Can I meet them? Knowing who you work with is crucial. Are they talented, Friendly, cooperative?

How does the team communicate? Are there daily meetings, do they use Slack, do they just use email, is everyone in the same building? What percentage of the teamwork remotely?

How is documentation handled? (Well documented Wiki, vs. the last guy, torched Jira on the way out and you will be guessing passwords).

What are the platforms and Vendors?  Are you a CCIE and it’s an all Juniper shop? Don’t be scared! The key is knowing what area’s you will need training.

What is the new employee onboarding process?  – Will it be two days of well-orchestrated events, or will you still be waiting for a phone and computer 30 days later?

What are expectations for the first 90, 180, 365 days? Is there a project, or milestone or education path they need you to have accomplished. How long do they expect you to fit into the shoes?

What is the cross coverage?

Is there only one person who knows how to restore from backup? Is there cross training? This can be bad if you want to go on vacation…

What is the upward mobility? 

What are the expectations for moving up in title, rank, role/responsibility? Are there defined elements to your career path and claim or will you be “IT Dude” until “Head IT Dude” retires?

*What about the Company*

What’s the companies roadmap?

If they don’t know where they want to go then it’s going to difficult to help steer them there.

What is the YoY growth? Is the company growing, or is it holding on for life support? Some industries are cyclical (Oil/Gas) some are past their prime (Sun Microsystems was a different company to work for in 1994, and 2001).

How many Employees are at the company? At a five man company, you might have to put toner in the printer. At IBM you likely will not know that person name. Some people like large companies, some like smaller. There are pros and cons to both.

Is the company profitable under GAAP? Companies sometimes do crazy things like claim they are profitable if you exclude employee compensation. If a company is a tech startup growing 100% year over year, don’t expect this one to be true, but if it’s a mature public company, this is something you can look up.

If not, what is the timeline or pathway towards profitability? If it’s a startup, it may be planning on exiting soon, or taking more VC and growing to the moon. Both have their risks, make sure you understand them. What is the runway (how long at the current burn rate will they survive)?

What is the companies competitive advantage? Is it low cost? Is it Intellectual property? Is it market saturation/penetration? This can shed some light on how the company operates. A ruthless lean manufacturing company might give employee’s 8-year-old laptops because they are cheap on capital spending.

What is the biggest roadblock to scaling the company?

Is it sales, marketing, operations, R&D?

What challenges does the company have at the moment? What do you foresee coming?

This can be quite telling; it can show that they’ve taken the time to identify and address challenges. Identifying key competitors here can help quite a bit.

Compensation Questions

1099 or W-2 (US). Contractor? The contractor who’s a W-2 of the contracting company? Full-time employee of end customer? LOTS of ways to chop this. There are tax implications of being 1099. Note, there are potential issues with being a 1099 as a tech worker if you are treated like a full-time employee.

Pay Cycle – You shouldn’t be living paycheck to paycheck, but knowing the cycle makes sense if your rolling from a weekly to a monthly you may need to move some things around to handle the change in cash flow.

Salary Base and it’s growth – can it grow? Is there an org chart with clear steps to moving up and getting bumps in pay? Does everyone get 1% raises and stagnate till they leave? A company that hasn’t given raises in 5 years has given everyone a pay cut.

OTE Bonus. Cash value or is it a multiplier based on base pay? Tied to metrics or your boss and directors random fancy? (This isn’t that bad, but you need to know who decides it). While there is an “On Target Earnings” nothing stops you from getting over 100%. The biggest way to see how real this is is to check with GlassDoor and existing employees who’ve been there 4-5 years. Sometimes a bonus is real; sometimes they are “Virtual”. For bonus how often is it paid out, and will they pro-rate a partial bonus for a new employee joining mid-cycle?  I once had a co-worker leave for a job that he thought made 10% more but he forgot to ask about if they had a bonus. At the end of the year, he learned they didn’t have them (or raises) and discovered he didn’t make more money.

Insurance – PPO/HSA/HMO/EPO/POS all have different issues. What’s in network vs. out of network? Also Dental and Insurance. What about medications?  Eyecare health insurance is a scam/pre-payment program. Use EyebuyDirrect or some online place to buy glasses, or max our HSA and get LASIK if you can. Reddit has a good thread explaining the difference here and how to compare.

Education

School, College, Certifications, Classes. – Do they pay for certification tests, if so how many attempts? The key one to test the seriousness of this is to ask others in the department what they have spent in the past year.

Conferences – Tacking onto certifications do they pay for VMworld? Do they cover travel and hotels? Are you banned from events in Vegas even if they are a lower cost than San Francisco? (not uncommon in SLED).

Sabbatical In our company you can apply for 3-month transfers to wildly different jobs to learn about how that role functions? You can do a 1-week education track (take education in something unrelated).

Stock and Investment Compensation

RSU (Restricted Stock Units)’s – If you keep getting these every year on a standard 2-5 (Depends on company and grant window) year vestment schedule, you eventually end up with a rather nice kicker. This also is nice if your stock doubles within a given year (Well except for capital gains). The longer you stay, the stickier these become, and the more a company likes you, the more they will give you to “handcuff” you to the company. The more a company wants you to stay the more you get these. A decent 6 figure pile of this is nice and can be used in leverage with a company who wants to poach on you why they better give you a bigger base (or a bigger pile of them!).

Stock Options – Inversely if you work for a startup, you might get stock options. These are a LONG shot gambling game (like 2% pay off), but I know some guys who their stock is trading in the 30’s and their options were in the $2 range so assuming they make it to lockout I expect to get a call to hang out on their yacht. Personally, there are so many options to screw the employee like clawbacks/ratchet clauses I don’t put much faith in these.  https://tldroptions.io

ESPP – Buy stock at a discount (See above comments). Note these are bought at a 10-15% discount based on the beginning or ending window (Whichever is lower) so its a game of heads I win, tails you lose against the market and can pay pretty well (or just be a nice couple grand of cash). I’ve had windows where I made 15%, sometimes I’ve made 115%. These are structured where you make money no matter once but read the fine print.

ESOP – The weird retirement type cousin of ESPP. I hear these are more common overseas.

Flexibility in work

Paternity leave – Some places do partial pay, some to maternal OR paternal, and some do maternal only AFTER you burn out your PTO. Note maternity, paternity, and adoption leave may have different rules. I’ve got a family member whose company policy is six months. Wife is a pediatrician at a children’s hospital. She gets Zero. This is all over the place in the US.

Vacation – My first job I had zero vacation for the first year. Note some companies this is more negotiable than salary; sometimes it’s less. Are Sick days different? Do you need a doctor note? Are there back out times for vacation (VMworld I’m pretty sure is a non-starter in my current role). Do they make you take a vacation for conferences (Yes I’ve seen this a lot sadly…)

Flex Time/Overtime pay – Can you turn overtime into time off? If you come in early can you leave early? Do you get paid for overtime (even if your an exempt employee some places will still pay if approved)? Does the company miscategorize helpdesk as exempt or other questionable legal practices?

Commute Costs – Company Car, parking pass, bus pass, toll pass? What’s the non-reimbursed depreciation? What is the $ per mile they allow for trips to the datacenter? Do you get a car allowance (EMEA this is more common)?

Work from home/anywhere Can I just leave town on Wed/Thursday and go to a beach house to finish working out the week? There are HUGE costs savings to working from home, but do pay attention if you need to supply your desk, chairs, monitors, etc.

Expense
Do they let you do your booking, do they require a corporate credit card (no points can be brutal, to the point of $20-30K easily for some people in compensation) Can you expense travel lounges on long flights. Can you expense more than $15 for lunch with a customer? Using Lift instead of downtown and airport parking has cut my mileage to non-existent for my car.

Travel

Travel Points and status – Traveling for work a lot adds up. Note this is a NON-taxable (Weird exclusion). So when traveling, I can get hotel points and airline points. With Southwest, I have a companion pass (My wife flies free with me), and with Marriot, I get free cocktails and appetizers in the afternoon and breakfast in the morning in the executive lounge. I get free upgrades with Marriot when traveling so that $150 small room can turn into a 40th-floor suite sometimes.
Travel Policy – Do they make you fly 18 hours, five hops to save $100?

Do they put you in first class if the flight is over 4 hours?
Do you stay in the Motel 8 and have to share a room (or PAY for your spouse’s 1/2 of the room if they happen to travel with you!). Do they make you fly in the morning you are presenting when it’s 12 times zones away, or do they put you up in the hotel for the weekend to adjust to the time zone, and be a tourist for the weekend?

Team Offsite, outings, parties, etc. – Got a team offsite and can you expense going snowmobiling or something cool? Beer bash for finishing release? If you are on campus are there free movie nights and other things. Does the boss cover happy hour on Friday?

Retirement stuff

401K – What’s the match? Is it partial? Does it take a while to get vested? What can you invest in? Are the default options all garbage or can you keep fees low and put money into low fee index?

401A – Like a 401K match but you don’t have to put money in, they just put x% of your salary. Common in Education and non-profits.
457(b) – Can withdraw from it without early penalty if you no longer work for the said employer. This one carries risks if the employer goes insolvent.
403B – A lower overhead 401K plan with no match. Common in Education and non-profits.

Pension – These do exist in a few places still in the US. More common overseas.

MISC.

Equipment allowance. My wife spends money on books of stethoscopes. Some people can expense screens, laptops, mice. We have vending machines for phone chargers, mice, etc. around our offices.

Telecom – Will they cover your cell phone or data plan? Did they buy you a pager to get out of paying your cell phone bill (I had one of these in 2008)?

Gym reimbursement  – Do they pay for Gym memberships.

ESXI 6.5 Patch 2 – vSAN Support Insight!

ESXi 6.5 Patch 2 is out, and with it comes a product improvement that I’ve been excited about for quite some time. The KB for what’s new can be found here.

Three storage improvements came out with this release.

  • vSAN Support Insight (including a dedicated customer bulletin with more details on this feature)
  • Adaptive resynchronization (Previously released for 6.0) – Adaptive Resync adjusts the bandwidth share allocated to Resync I/O to minimize impact to client I/O. With this feature, Resync speed will adaptively adjust during off peak and high peak I/O cycles. During off-peak cycles Resync will speed up and during high peak cycles Resync will slow down. This ensures Resync continue to make progress while minimizing impact to the client I/O.
  • Multipath support for SAS systems“vSAN now enables multiple redundant paths from server to storage with no setup required, when used with a supported multipath driver. An example of such a system is HPE Synergy.”

vSAN support insight is revolutionary in it’s ability to change the support experiance, accelerate product improvements. Support for vSphere has typically revolved around a predictable script. You call in, and if your issue isn’t easily triagable you may need to export logs. This process has some challenges because:

1. It takes time to pull logs and upload them.

2. If the issue your cluster has impacts avalability to the logs this can drag out getting a resolution.

3. Additional Logs may be needed to compare before/after with the issue.

On the support side of things, the inital call often begins with you trying to articulate your issue, describe your enviroment and any releavent details. The support staff essentailly being “blind” on that initial call until you can describe enough of the enviroment, push logs, or setup a webex/remote sessions to show the issue.

vSAN Support Insight helps with these challenges by automatically pushing configuration, health, and performance telemtry to VMware. Removing these delays is critical to improving support outcomes.This phone home data set also provides a framework for future product improvements, future support enhancements, and better cross corelation of issues for engineering.

Blog
blogs.vmware.co…upport-insight/

Video
storagehub.vmwa…-demonstration/

StorageHub Documentation
storagehub.vmwa…support-insight

HBA all the way! (and what is this HBA 330+ thing?!?)

Duncan wrote a great blog summarizing why HBA’s are a better choice over RAID controllers. Looking back we’ve seen a shift with some of our OEM’s to even go so far as to have their ready nodes always configured for HBA controllers due to their simplicity, lower cost, and fast performance.

One question that has come up recently is “What is the HBA 330+?”. Dell customers may have noticed that the HBA 330 became the default option on their 13th generation ReadyNodes some time ago. On Dell 14th generation quotes show up with a “+” added to the card causing some concern that maybe this device is not the same one certified. Upon consulting with the vSAN ReadyLabs it seems this card has the exact same PCI ID, and is, in fact, the exact same HBA. Only minor cabling changes made that in no way impact it’s recommended driver or firmware or certification status. This is currently the ONLY certified option for Dell 14G ReadyNode servers and I expect it to likely stay that way until NVMe replaces SCSI for customers.

Going forward I expect NVMe to increasingly replace SAS/SATA, and in this case,  we will see a mixture of direct PCI-Express connections, or connections through a PCI-E crossbar. All NVMe ready nodes I’ve seen tested are showing that replacing the HBA  leads to lower latency, less CPU overhead, and consistent outcomes.

 

 

vSAN Deduplication and Compression Tips!

I’ve been getting some questions lately and here are a few quick thoughts on getting the most out of this feature.

If you do not see deduplication or compression at all:

  1. See if the object space reservation policy has been set to above zero, as this reservation will effectively disable the benefits of deduplication for the virtual machine.
  2. Do not forget that swap is by default set to 100% but can be changed.
  3. If a legacy client or provisioning command is used that specifies “thick” or “Eager Zero Thick” this will override the OSR 100%. To fix this, you can reapply the policy. William Lam has a great blog post with some scripts on how to identify and resolve this.
  4. Make sure data is being written to the capacity tier. If you just provisioned 3-4 VM’s they may still be in the write buffer. We do not waste CPU or latency deduplicating or compressing data that may not have a long lifespan. If you only provisioned 10 VM’s that are 8GB each it’s quite possible that they have not destaged yet. If you are doing testing clone a lot of VM’s (I tend to create 200 or more) so you can force the destage to happen.

Performance anomalies (and why!) when testing vSAN’s deduplication and compression.

I’ve always felt that it’s incredibly hard to performance test deduplication and compression features, as real-world data has a mix of compressibility, and duplicate blocks and some notes I’ve seen from testing. Note: these anomalies often happen on other storage systems with these features and highlight the difficulty in testing these features.

  • Testing 100% duplicate data tends to make reads and writes better than a baseline of the feature off as you avoid any bottleneck on the destage from cache process, and the tiny amount of data will end up in a DRAM cache.
  • Testing data that compresses poorly on vSAN will show the little impact to read performance as vSAN will write the data fully hydrated to avoid any CPU or latency overhead in decompression (not that LZ4 isn’t a fast algorithm, to begin with).
  • Write throughput and IOPS for bursts that do not start to fill up the cache show little overhead. This is true, as the data is written non-compacted to reduce latency

These quirks stick out in synthetic testing, and why I recommend reading the space efficiencies guide for guidance on using this and other features.

New and noteworthy vSAN KB’s worth a read.

While vSAN Health Checks are constantly expanding, it’s still worth keeping up with the new KB’s to see what’s going on and if there are any issues you need to consider.

Here’s a few KB’s worth a read. 

vSAN 2017 Quarterly Advisory for Q2

This article includes links to important bug fixes in recent patch releases, outstanding issues, known workarounds and other informational articles.

2150957

 

File services support by NetApp ONTAP Select 9.2 for VMware vSAN datastores

This article provides information about NetApp’s ONTAP Select solution that offers file services on VMware vSAN datastore.

2151182

  

Setting up active-passive dual pathing with vSAN and vSphere

This article explains setting up active-passive dual pathing with vSAN and vSphere. This one is a bit interesting as it includes some information on the superiority of native drivers in handling internal duel path SAS fabrics in managing failover and failback.

2151225

 

Understanding vSAN memory consumption in ESXi 6.5.0d/ 6.0 U3 and later

This article provides information about memory consumption in the latest version of vSAN 6.2 (ESXi 6.0 Update 3 and later) and vSAN 6.6 (ESXi 6.5.0d and later) and a provides example scenarios.

    2113954

Duplicate SCSI IDs causing SATA drives in drive bay #1 to go missing from ESXi when running the nhpsa driver on Gen 9 HPE Synergy compute modules, HPE ProLiant DL-series servers that include a SAS expander

This document highlights an issue observed when using Gen 9 HPE Synergy compute modules or HPE ProLiant DL-series servers with ESXi 6.5, the native nhpsa driver and SATA drives. There’s a workaround for now (Leave drive bay 1 empty, or use a SAS device for it).

2150104

 

 

vSAN 6.6 Ondisk upgrade to version 5 fails with the error “A general system error occurred: Unable to complete Sysinfo operation…”

This is resolved by going to 6.6.1 and performing an update while using the re-sync throttle function. Note, if your on vSAN 6.6 you REALLY want to get to vSAN 6.6.1. Huge performance improvements, beyond bug fixes like this.

2151316

When is the right time to transition to vSAN?

 

When is the right time to swap to vSAN?

Some people say: When you refresh storage!

Others say it’s: When you refresh Servers!

They are both right. It’s not an “or” both are great times to look at it. Let us dig deeper….

Amazing ROI on switching to HCI can come from a full floor sweep that is tied to refreshing with faster servers, and newer loss cost to acquire and maintain storage. There are even awsome options for people who want another level of wrapped support and deployment (VxRAIL, HCP-UC).

But what about for cases where an existing server or storage investment makes a wholesale replacement seem out of reach?  What about the guy who just bought storage or servers yesterday and learned about vSAN (or new features that they needed like Encryption or local protection today?

Lets split these situations up and discuss how to handle them.

What happens when my existing storage investment is largely meeting my needs? What should I do with the server refresh?

Nothing prevents you from buying ReadyNodes without drives and adding them later as needed without disruption. Remember ESXi includes the vSAN software so there will be nothing to “install” other than drives in the hosts. HBA’s  are the most common missing feature from a new server and a proper high queue depth vSAN certified HBA is relatively cheap (~$300). That’s a solid investment. Not having to take a server offline later to raise the hood and install something is instant ROI on those components. Remember with Dell/Lenovo/SuperMicro/Fujitsu vSAN Config assist will handle deploying the right driver/firmware for you at the push of a button.

Some other housecleaning items to do when your deploying new hosts (on the newest vSphere!) to get you vSAN ready down the road.

  1. See if the storage is vVols compatible. If it is, start deploying it. SPBM is best way to manage storage going forward, and vSAN and vVols both share this management plane. As you move forward into vSAN, having vRA, vCloud Director, OpenStack and other tools that leverage SPBM configured to use it will allow you to leverage your existing storage investment more efficiently. It’s also a great way to familiarize yourself with vSAN management. Being able to expose storage choice into vRA to end users is powerful. Remember, VAIO and VM Encrypt also use SPBM. so it’s time to start migrating your storage workflows over to it!
  2. Double check your upcoming support renewals to make sure that you don’t have a spike creeping up on you. Having a cluster of vSAN deployed and testedand with hosts ready to expand rapidly puts you in a better position to avoid getting cornered into one more year of expensive renewals. Also watch out for other cost creep. Magic stretched cluster virtualization devices or licensing, FCoE gear, fabric switches, structured cabling for Fibre Channel expansion, and special monitoring tools for fabrics all have hidden capex and support costs. [LOL]
  3. Look at expansion costs on that storage array. Arrays will often be discounted deeply on the initial purchase but expansion can sometimes be 2-3x what the initial purchase cost was! Introducing vSAN for expansion guarantee’s  lower cost per GB as you expand (vSAN doesn’t tax drives or RAM like other solutions).
  4. Double check those promised 50x dedupe ratios and insanely low latency figures! Often data efficiency claims are made and include  Snapshots, Thin Provisioning, linked clones and other basic features.   Also, check to see that you’re getting the performance you need.

What happens when my servers were just refreshed, but I need to replace storage?

If your servers are relatively new (Xeon v3/v4/Intel Scalable/AMD EPYC) then there is a good chance that adding the needed pieces to turn them into ReadyNodes is not far off. Check out the ready node bill of materials to see if your existing platform will work. See what it needs and reach out to your server vendor for the needed HBA (and possibly NIC) upgrades to get them ready for vSAN. Your vSAN SE’s and account teams can help!

 

 

 

How big should my vSAN or vSphere cluster be?

This is a topic that comes up quite a bit. A lot has been written previously about how big should your vSphere clusters be and Duncan’s musings on this topic are still very valid.

It generally starts with:

“I have 1PB in my storage frame today, can I build a 1PB vSAN cluster?”

The short response is yes, you can certainly build a PB vSAN cluster, and build 64 node clusters (there are customers who have broken 2 PB within a cluster, and customers with 64 node clusters), but you stop and think if you should.

You want 16PB in a single rack, and 99.9999999% availability?

We have to stop and think about things beyond cost control when designing availability. I always chuckle when people talk about arrays having seven 9’s of availability. The question to ask yourself is if the storage is up, but the network is down does anyone care? Once we include things “outside of storage” we often find that the reality of uptime is often more limited. The actual environmental (Power, Cooling) of a datacenter are rated at best 99.98% by the uptime institute. Traditionally we tried to make the floor tile that our gear sat in to be as resilient as possible.

 

 

James Hamilton of Amazon  has pointed to WAN connectivity to being another key bottleneck to uptime.

 “The way most customers work is that an application runs in a single data center, and you work as hard as you can to make the data center as reliable as you can, and in the end you realize that about three nines (99.9 percent uptime) is all you’re going to get,”

The uptime institute has done a fair amount of research in this space, and historically their definition of a Tier IV facility involved providing only up to 99.99% uptime (4 nines).

 

Getting beyond 4 nines of uptime for remote users (who are the mercy of half finished internet standards like BGP) is possible but difficult.

Availability most be able to account for the infastructure it rests on, and resiliency in storage and applications must account for the physical infrastructure.

 

Lets review traditional storage cost and operational concepts and why we today have reached a point where customers are putting over 1PB into a storage pool.

  1. Capital Costs – Some features may be licensed per frame, and significant discounts may be given if large purchase are made up front rather than as capacity is needed. Sparing capacity and overhead as a % of a storage pool become smaller if your growth rate is fixed.
  2. Opex – While many storage frames may have federation tools, there are still process’s that are often done manually, particularly for change control reasons because of the scale of an outage of a frame (I talked to a customer who had one array fail and take out 4000 VM’s including their management virtual machines).
  3. Performance – wide striping or on hybrid systems aggregating cache and controllers and ports reduced the change of a bottleneck being reached.
  4. The next Change Control Window for my Array is 2022

    Patching/Change Control – Talking to a lot of customers they are often running the same firmware that their storage array came with. The risk, or the 15 second “gap” in IO as controllers are upgraded is often viewed as a huge risk. This is made worst by the most risk averse application on the cluster effectively dictates patching and change control windows. No one enjoys late night all hands on deck patching windows for storage arrays.

  5. Parallel remediation in patch windows – Deploying more storage systems means more manual intervention. Traditional arrays often lack good tools for management and monitoring of parallel remediation. Often times more storage arrays means more change control windows.
  6. Aligning the planets on the HCL –  To upgrade a Fibre Channel Array, you must upgrade ESXi, the Array, The Fabric Version, the Fibre Channel HBA firmware, and the server BIOS to align with the ESXi upgrade.  This is a lot of moving parts, all of which that carry risks of a corner case being identified.

 

Lets review how vSAN dresses these costs without driving you to put everything in one giant cluster..

  1.  Capital Costs – vSAN licensing is per socket and hosts can be deployed with empty drive bays. Drives for regular severs regularly fall in in price, making it cheaper to purchase what you need now and add drives to hosts as needed to meet capacity growth. Overhead for spare capacity for rebuilds does reduce as you add hosts, but nothing forces you to fill each host with capacity up front and no additional licensing costs will be invoked by having partially full servers.
  2. Opex – vSAN’s normal management plane (vCenter) is easily federated and storage policies span clusters without any additional work. Lifecycle management like controller updates from the Config assist, and health monitoring alerts easily roll up to a single pane of glass.
  3. Performance – All Flash has changed the game. You no longer need 1000 spindles and wide striping to get fast or consistent performance. Pooling workloads with 3 tier storage architecture and storage arrays actually increases the chance that you might saturate throughput, or buffers on fibre channel switching.
  4. Patching – vSAN patching can be done simply using existing tools for updating ESXi (VMware Update Manager), and lifecycle update for storage controllers can be pushed by a simple click from the UI in vSAN 6.6. Customers already have ESXi patching windows and processes deployed and maintenance mode with vMotion is as trusted and battle tested means to evacuate a host.
  5. VMware Update manager (VUM) can remediate multiple clusters in parallel. This means you can patch as many (or as few) clusters, and when used with DRS this is fully automated including placement of virtual machines.
  6. Additional intelligence has been deployed for vSAN to include remediation of Firmware. Given that vSAN does not use proprietary Fibre Channel fabrics, is integrated into ESXi, and lacks the need for proprietary fabric HBA’s this significantly reduces the number of planets to align when planning an upgrade window.

In summery I wanted to say. While vSAN can certainly scale to the multi-PB cluster size, you should look if you actually need to scale up this much. In many cases you would be better served by at scale running multiple clusters.

vSAN Backup and SPBM policies.

I get asked a lot of questions about how Backup works with vSAN. For the most part it’s a simple request for a vendor support statement and VADP/CBT documentation. The benefit of native vSAN snapshots (better performance!) does come up, but I will point out there is more to backup and restores than just the basics. Lets look at how one vendor (Veeam) integrates SPBM into their backup workflow.

 

Storage Based Policies can tie into availability and restore planning. When setting up your Backup or Replication software make sure that it supports the ability to restore a VM to it’s SPBM policy, as well as have the ability to do custom mapping. You do not want to have to do a large restore job then after the restore re-align block locations again to apply a policy if only the default cluster policy is used for restores. This could result in a 2x or longer restore time. Check out this Video for an example of what Backup and Restore SPBM integration looks like.

While some questions are often around how to customize SPBM policies to increase the speed of backups (on Hybrid possibly increase a stripe policy), I occasionally get questions about how to make restores happen more quickly.

A common situation for restores is that a volume needs to be recovered and attached to a VM simple to recover a few files, or allow temporarily access to a retired virtual machine. In a perfect world you can use application or file level recovery tools from the backup vendor but with some situations an attached volume is required. Unlike a normal restore this copy of data being recovered and presented is often ephemeral. In other cases, the speed of recovery of a service is more important than the protection of it’s running state (maybe a web application server that does not contain the database).  In both these cases I thought it worth looking at creating a custom SPBM policy that favored speed of recovery, over actual protection.

 

In this example  I’m using a Failure To Tolerate (FTT) of 0.  The reason for this is two fold.

  1. Reduce the capacity used by the recovered virutal machine or volume.
  2. Reduce the the time it takes to hydrate the copy.

In addition I’m adding a stripe width of 4. This policy will increase the recovery speed by splitting the data across multiple disk groups.

Now it should be noted that some backup software allows you to a run a copy from the backup software itself (Veeam’s PowerNFS server is an example). At larger scale this can often tax the performance of the backup storage itself. This temporary recovery policy could be used for some VM’s to speed to recovery of services when protection of data can be waived for the short term.

Now what if I decide I want to keep this data long term?  In this case I could simple change the policy attached to the disk or VM to a safer FTT=1 or 2 setting.

How to bulk create VMkernel Ports for vMotion and vSAN in vSAN 6.6

Quick post time!

A key part of vSAN 6.6 improvements is the new configuration assist menu. Common configuration requirements are tested, and wizards can quickly be launched that will do various tasks (Setup DRS, HA, create a vDS and migrate etc).

One of my least favorite repetitive tasks to do in the GUI is setup VMkernel Ports for vSAN and vMotion. Once you create your vDS and port groups, you can quickly create these in bulk for all host at once.

Once you put in the IP address for the first host in the cluster it will auto fill the remainder by adding one to the last octet. Note, this will use the order that hosts were added to the cluster (So always add them sequentially). Note you can also bulk set the MTU if needed.

If you have more questions about vSAN, vSAN networking, or want more demo’s check out the vSAN content, head over to storagehub.vmware.com

The GIF below walks through the entire process:

So Easy a caveman could do it!