Skip to content

Posts tagged ‘vSAN’

How to rebuild a VCF/vSAN cluster with multiple corrupt boot devices

Note: this is the first part of a series.

In my lab, I recently had an issue where a large number of hosts needed to be rebuilt. Why did they need to be rebuilt? If you’ve followed this blog for a while, you’ve seen the issues I’ve run into with SD cards being less than reliable boot devices.

Why didn’t I move to M.2 based boot devices? Unfortunately, these are rather old hosts and unlike modern hosts, there is not an option for something nice like a BOSS device. This is also an internal lab cluster used by the technical marketing group, so while important, it isn’t necessary “mission critical” by any means.

As a result of this, and a power hiccup I ended up with 3 hosts offline that could not restart. Given that many of my VM’s were set to only FTT=1 this means complete and total data loss right?

Wrong!

First off, the data was still safe on the disk groups of the 3 offline hosts. Once I can get the hosts back online the missing components will be detected and the objects will become healthy again (yah, data loss!). vSAN does not keep the metadata or data structures for the internal files systems and object layout on the boot devices. We do not use the boot device as a “Vault” (if your familiar with the old storage array term). If needed all of the drives in a dead host can be moved to a physically new host and recovery would be similar to the method I used of reinstalling the Hypervisor on each host.

What’s the damage look like?

Hopping into my out of band management (My datacenter is thousands of miles away) I discovered that 2 of the hosts could not detect their boot devices, and the 3rd failed to fully reboot after multiple attempts. I initially tried reinstalling ESXi on the existing devices to lifeboat them but this failed. As I noted in a previous blog, SD cards don’t always fully fail.

Live view of the SD cards that will soon be thrown into a Volcano

If vSAN was only configured to tolerate a single failure, wouldn’t all of the data at least be inaccessible with 3 hosts offline? It turns out this isn’t the case for a few reasons.

  1. vSAN does not by default stripe data wide to every single capacity device in the cluster. Instead, it chunks data out into fresh components every 255GB (Note you are welcome to set strip width higher and force more sub-components being split out of objects if you need to).
  2. Our cluster was large. 16 hosts and 104 physical Disks (8 disks in 2 disk groups per host).
  3. Most VM’s are relatively small, so out of the 104 physical disks in the cluster, having 24 of them offline (8 per host in my case). still means that the odds of those 24 drives hosting 2 of the 3 components needed for a quorum is actually quite low.
  4. A few of the more critical VM’s were moved to FTT=2 (vCenter, DNS/NTP servers) making their odds even better.

Even in the case of a few VM’s that were impacted (A domain Controller, some front end web servers), we were further lucky by the fact that these were redundant virtual machines already. Given both of the VMs providing these services didn’t fail, it became clear with the compounding ods in our favor that for a service to go offline was more in the odds of rolling boxcars twice, than a 100% guarantee.

This is actually something I blogged about quite a while ago. It’s worth noting that this was just an availability issue. In most cases of actual true device failure for a drive, there would normally be enough time between loss to allow for repair (and not 3 hosts at once) making my lab example quite extreme.

Lessons Learned and other takeaways:

  1. Raise a few Small but important VM’s to a higher FTT level if you have enough hosts. Especially core management VMs.
  2. vSAN clusters can become MORE resilient to loss of availability the larger they are, even keeping the same FTT level.
  3. Use higher quality boot devices. M.2 32GB and above with “real endurance” are vastly superior to smaller SD cards and USB based boot devices.
  4. Consider splitting HA service VM’s across clusters (IE 1 Domain Controller in one of our smaller secondary clusters).
  5. For Mission-Critical deployments use of a management workload domain when using VMware Cloud Foundation, can help ensure the management is fully isolated from production workloads. Look at stretched clustering, and fault domains to take availability up to 11.
  6. Patch and reboot your hosts often. Silently corrupt embedded boot devices may be lurking in your USB/SD powered hosts. You might not know it until someone trips a breaker and suddenly you need to power back on 10 hosts with dead SD devices. Regular patching will catch this one host at a time.
  7. While vSAN is incredibly resilient always have BC/DR plans. Admins make mistakes and delete the wrong VMs. Datacenters are taken down by “Fire/Flood/Blood” all the time.

I’d like to thank Myles Grey and Teodora Todorova Hristov for helping me make sense of what happened and getting the action plan to put this back together and grinding through it.

Understanding File System Architectures.

File System Taxonomy

I’ve noticed that Clustered File Systems, Global file systems, parallel file systems and distributed file systems are commonly confused and conflated. To explain VMware vSAN™ Virtual Distributed File System™ (VDFS) I wanted to highlight some things that it is not. I’ll be largely pulling my definitions from Wikipedia but I look forward to hearing your disagreements on twitter. It is work noting some file systems can have elements that cross the taxonomy of file system layers for various reasons. In some cases, some of these definitions are subcategories of others. In other cases, some file systems (GPFS as an example) can operate in different modes (providing RAID and data protection, or simply inherent it from a backing disk array).

Clustered File System

A clustered file system is a file system that is shared by being simultaneously mounted on multiple servers. Note, there are other methods of clustering applications and data that do not involve using a clustered file system.

Parallel file systems

Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance. While the vSAN layer mirrors some characteristics (Distributed RAID and striping) it does not 100% match with being a parallel file system.

Examples would include OneFS and GlusterFS.

shared-disk file system

shared disk file systems are a clustered file system but are not a parallel file system. VMFS is a shared disk file system. The most common form of a clustered file system that leverages a storage area network (SAN) for shared access of the underlying LBAs. Clients are forced to handle the translation of file calls, and access control as the underlying shared disk array has no awareness of the actual file system itself. Concurrency control prevents corruption. Ever mounted NTFS to 2 different windows boxes and wondered why it corrupted the file system? NTFS is not a shared disk file system and the different operating systems instances do not independently by default know how to cleanly share the partition when they both try to mount it. In the case of VMFS, each host can mount a given volume as read and write, while cleanly making sure that access to specific subgroups of LBA’s used for different VMDKs (or even shared VMDKs) is properly handled with no data corruption. This is commonly done over a storage area network (SAN) presenting LUNs (SCSI) or namespaces (NVMe over fabrics). protocol to share this is block-based and can range from Fibre Channel, iSCSI, FCoE, FCoTR, SAS, Infiniband etc.

Example of 2 hosts mounting a group of LUNs and using VMFS to host VMs

Examples would include: GFS2, VMFS, Apple xSAN (storenext).

Distributed file systems

Distributed file systems do not share block-level access to the same storage but use a network protocol to redirect access to the backing file server exposing the share within the namespace used. In this way, the client does not need to know the specific IP address of the backing file server, as it will request it when it makes the initial request and within the protocol (NFSv4 or SMB) be redirected. This is not exactly a new thing (DFS in Windows is a common example, but similar systems were layered on top of Novell based filers, proprietary filers etc). These redirects are important as they prevent the need to proxy IO from a single namespace server and allow the data path to flow directly from the client to the protocol endpoint that has active access to the file share. This is a bit “same same but different” to how iSCSI redirects allow connection to a target that was not specified in the client pathing, or ALUA pathing handles non-optimized paths in the block storage world. For how vSAN exposes this externally using NFS, Check out this blog, or take a look at this video:

The benefits of a distributed file system?

  1. Access Transparency. This allows back end physical data migrations/rebuilds to happen without the client needing to be aware and re-pointing at the new physical location. clients are unaware that files are distributed and can access them in the same way as local files are accessed.
  2. Transparent Scalability. Previously you would be limited to the networking throughput and resources of a single physical file server or host that hosted a file server virtual machine. With a distributed file system each new share can be distributed out onto a different physical server and cleanly allow you to scale throughput for front end access. In the case of VDFS, this scaling is done with containers that the shares are distributed across.
  3. Capacity and IO path efficiency – Layering a scale-out storage system on top of an existing scale-out storage system can create unwanted copies of data. VDFS uses vSAN SPBM policies on each share and integrates with vSAN to have it handle the data placement and resiliency. In addition layering, a scale-out parallel file system on top of a scale-out storage system leads to unnecessary network hops for the IO path.
  4. Concurrency transparency: all clients have the same view of the state of the file system. This means that if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner. This is distinctly different from how some global file systems operate.

It is worth noting that VDFS is a distributed file system that exists below the protocol supporting containers. A VDFS volume is mounted and presented to the container host using a secure direct hypervisor interface that bypasses TCP/IP and the vSCSI/VMDK IO paths you would traditionally use to mount a file system to virtual machine or container. I will explore more in the future. For now, Duncan Explains it a bit on this blog.

Examples include: VDFS, Mirosoft DFS, BlueArc Global Namespace

Global File System

Global File Systems are a form of a distributed file system where a distributed namespace provides transparent access to different systems that are potentially highly distributed (IE in completely different parts of the world). This is often accomplished using a blend of caching and the use of weak affinity. There are trade-offs in this approach as if the application layer is not understood by the client accessing the data you have to deal with manually resolving conflicting save attempts of the same file, or forcing one site to be “authoritative” slowing down non-primary site access. While various products in this space have existed they tend to be an intermediate step for an application-aware distributed collaboration platform (or centralizing data access using something like VDI). While async replication can be a part of a global file system, file replication systems like DFS-R would not technically qualify. Solutions like Dropbox/OneDrive have reduced the demand for this kind of solution.

Examples include: Hitachi HDI

Where do various VMware storage technologies fall within this?

VMFS – a Clustered file system, that specifically falls within the shared-disk file system. While powerful and one of the most deployed file systems in the enterprise datacenter, it was designed for use with larger files that are (With some exceptions) only accessed by a single host at a time. While support for higher numbers of files and smaller files has improved significantly over the years, general-purpose file shares are currently not a core design requirement for it.

vVols – Not a clustered file system. An abstraction layer for SAN volumes, or NFS shares. For block volumes (SAN) it leverages SUB-LUN units and directly mounts them to the hosts that need them.

VMFS-L – A non-clustered variant Used in vSAN prior to the 6.0 release. Also used for the ESXi installed volume. File system format is optimized for DAS. Optimization include aggressive caching with for the DAS use case, a stripped lockdown lock manager, and faster formats. You commonly see this used on boot devices today.

VDFS – vSAN Virtual Distributed File System. A Distributed file system that sits inside the hypervisor directly onto of vSAN objects providing the block back end. As a result, it can easily consume SPBM policies on a per-share basis. For anyone paying attention to the back end, you will notice that objects are automatically added and concatenated onto volumes when the maximum object size is reached (256GB). components behind these objects can be striped, or as a result of various reasons be automatically spanned and created across the cluster. It is currently exposed through protocol containers that export NFSv3 or NFSv4.1 as a part of vSAN file services. While VDFS does offer a namespace for NFSv4.1 one connections and handles redirection of share access, it does not currently globally redirect between disparate clusters, so it would not be considered a global file system.

Improving NIC and switch performance for vSAN (and other IP storage)

This is going to be a short post collecting a few tricks to unlock some bottlenecks in storage networking that may grow over time:

Unfortunetly a lot of troubeshooting of networking performance stops earlier than it should. Two common incomplete troubleshooting workflows I’ve seen:

  1. Someone checks that network utilization on a host isn’t near the link speed and says “network not the bottleneck”.
  2. Someone calls the networking team and they look at the switchports utilization based on SNMP polling, or do a quick “Show interface” and don’t see obvious port errors (CRC, drops, giants etc). They proudly close the ticket as “Switches are fine!”

Buffer Configuration Considerations

In one of my labs where we have a Nexus 9000 series switch we found performance was looking a bit limited. Seeing higher than expected re-ransmits we dug into it deeper. We dug deeper into buffer utilization. Discovering that the default mesh configuration was limiting buffer access to 500 KB per port, we adjusted the buffers using the qos ns-buffer-profile ultra-burst command. This signifigantly opened up performance, reducing TCP incast inssues (which cause retransmits) and brought performance more in line with what we would expect for the cluster. For anyone looking for more inormation on this command (and how to look at buffers) see the QoS Guide. Note for solving buffer contention different switches will have different options for configuring buffers, priortizing which flows to drop first, and allocating buffer to ports. In other cases it may be simpler to just buy switches with deeper buffers to begin with. Rather than trying to chop apart a 12MB-40MB buffer, simply purchasing a switch with an 8GB buffer can avoid a lot of the need for buffer management consideration.

I”ve been asked about the HPE 5950 series switch. Digging into the CoS/QoS guide I found a few things:

You can detect how often you exceed a buffer with the display buffer usage interface command.

<switchname> Display buffer usage interface hundredgige 1/0/1

This command will be more useful than the “display buffer usage” as that only tracks usage over a 5 second rolling outage, vs the violation tracker that the interface counter can track (Which will detect very short microbursts that may be causing buffer full and retransmits conditions nad latency). Note the default buffer threashold is 70%.

burst-mode enable appears to be a similar command to the ultra-burst buffer configuration and is recomended for cases that include “Traffic enters a device from multiple same-rate interfaces and goes out of an interface with the same rate.” Given this scenario is exactly what we would see from TCP incast (multiple vSAN hosts trying to talk to the same host and filling a buffer), this is likely something you would want to turn on. As I don’t have one of these switches in my lab, I’d love any feedback anyone has from trying this command. If anyone from HPE Networking is reading this, feel free to reach out.

TCP dispatch queues tuning

In another example in the lab, a test of raw throughput was coming up short. A review of the back end disk groups showed a lack of congestion (latency was low, write cache fill rate was low). A review of network utilization showed only 30% utilization on the link speed, but high latency (20ms+ between the nodes).

Investigation showed that the throughput was bottlenecking on a single threaded TCP process (CPU for attached world at 100%). By raising the TCP RX queues from the default of 1 to 4, this eleminated this as a bottleneck and returned performance to expected levels.

Steps to set this are:

Set the advanced setting on the host

$ esxcfg-advcfg -s 4 /Net/TcpipRxDispatchQueues

Or for PowerCLI:

Get-AdvancedSetting -Entity <esxi host> -Name Net.TcpipRxDispatchQueues | Set-AdvancedSetting -Value ‘4’

Reboot the host once this is set.

To validate this setting:

$ esxcfg-advcfg -g /Net/TcpipRxDispatchQueues

It’s worth noting that Niel’s blog on vMotion tuning reported higher throughput per stream than I saw (His blog reports 15Gbps per stream). This may be a result of my lab hosts using inexpensive cheaper Intel 5xx series NICs that lack the advanced offloads that the Intel 700 or 800 series cards have. Mellanox CX Series cards also have similar capabilities. Without these offloads, more CPU is needed to push the same throughput and this would compound together to bring the performance ceiling even lower with the cheaper NICs.

Summery

For anyone seeing bottlenecks on lower-cost NIC’s, or wanting to push more than 15Gbps per host of vSAN traffic, keep an eye on this setting, and talk to GSS if you are concerned this default may be causing a bottleneck. For new hosts I’d strongly consider smarter NIC’s that contain hardware LRO/TSO, RSS VxLAN GENEVE offload capabilities and make sure that your driver and firmware are both up to date. Note in a future release this default may change.

If you have any feedback on these commands (or questions on other commands or switches!) reach out to me on twitter @Lost_Signal

Peanut Butter is Not Supported with vSphere/Storage Networking/vSAN/VCF

 From time to time I get oddball questions where someone asks about how to do something that is not supported or a bad idea. I’ll often fire back a simple “No” and then we get into a discussion about why VMware does not have a KB for this specific corner case or situation. There are a host of reasons why this may or may not be documented but here is my monthly list of “No/That is a bad idea (TM)!”.

How do I use VMware Cloud Foundation (VCF) with a VSA/Virtual Machine that can not be vMotion’d to another host?

This one has come up quite a lot recently with some partners, and storage vendors who use VSA’s (A virtual machine that locally consumes storage to replicate it) incorrectly claiming this is supported. The issue is that SDDC Manager automates upgrade and patch management. In order to patch a host, all running virtual machines must be removed. This process is triggered when a host is placed into maintenance mode and DRS carefully vMotions VMs off of the host. If there is a virtual machine on the host that can not be powered off or moved, this will cause lifecycle to fail.

What about if I use the VSA’s external lifecycle management to patch ESXi?

The issue comes in, running multiple host patching systems is a “very bad idea” (TM). You’ll have issues with SDDC Manager not understanding the state of the hosts, but also coordination of non-ESXi elements (NSX perhaps using a VIB) would also be problematic. The only exception to using SDDC manager with external lifecycle tooling tools are select vendor LCM solutions that done customization and interop (Examples include VxRAIL Manager, the Redfish to HPE Synergy integration, and packaged VCF appliance solutions like UCP-RS and VxRACK SDDC). Note these solutions all use vSAN and avoid the VSA problem and have done the engineering work to make things play nice.

JAM also not supported!

Should I use a Nexus 2000K (or other low performing network switch) with vSAN?

While vSAN does not currently have a switch HCL (Watch this space!) I have written some guidance specifically about FEXs on this personal blog. The reality is there are politics to getting a KB written saying “not to use something”, and it would require cooperation from the switch vendors. If anyone at Cisco wants to work with me on a joint KB saying “don’t use a FEX for vSAN/HCI in 2019” please reach out to me! Before anyone accuses me of not liking Cisco, I’ll say I’m a big fan of the C36180YC-R (ultra deep buffers RAWR!), and have seen some amazing performance out of this switch recently when paired with Intel Optane.

Beyond the FEX, I’ve written some neutral switch guidance on buffers on our official blog. I do plan to merge this into the vSAN Networking Guide this quarter. 

I’d like to use RSPAN against the vDS and mirror all vSAN traffic, I’d like to run all vSAN traffic through a ASA Firewall or Palo Alto or IDS, Cisco ISR, I’d like to route vSAN traffic through a F5 or similar requests…

There’s a trend of security people wanting to inspect “all the things!”.  There are a lot of misconceptions about vSAN routing or flowing or going places.

Good Ideas! – There is some false assumptions you can’t do the following. While they may add complexity or not be supported on VCF or VxRAIL in certain configurations, they certainly are just fine with vSAN from a feasibility standpoint.

  1. Routing storage traffic is just fine. Modern enterprise switches can route OSPF/Static routes etc at wire-speed just fine all in the ASIC offloads. vSAN is supported over layer 3 (may need to configure static routes!) and this is a “Good idea” on stretched clusters so spanning tree issues don’t crash both datacenters!
  2. vSAN over VxLAN/VTEP in hardware is supported.
  3. VSAN over VLAN backed port groups on NSX-T is supported.

Bad Ideas!

Frank Escaros-Buechsel with VMware support once told someone “While we do not document that as not supported, it’s a bit like putting peanut butter in a server. Some things we assume are such bad idea’s no one would try them, and there is only so much time to document all bad ideas.

  1. Trying to mirror high throughput flows of storage or vMotion from a VDS is likely to cause performance problems. While I”m not sure of a specific support statement, i’m going to kindly ask you not to do this. If you want to know how much traffic is flowing and where, consider turning on SFLOW/JFLOW/NetFlow on the physical switches and monitoring from that point. vRNI can help quite a bit here!
  2. Sending iSCSI/NFS/FCoE/vSAN storage traffic to an IDS/Firewall/Load balancer. These devices do not know how to inspect this traffic (trust me, they are not designed to look at SCSI or NVMe packets!) so you’ll get zero security value out of this process. If you are looking for virus binaries, your better off using NSX guest introspection and regular antivirus software. Because of the volume, you will hit the wire-speed limits of these devices. Outside of path latency, you will quickly introduce drops and re-transmits and murder storage traffic performance. Outside of some old Niche inline FC encryption blades (that I think Netapp used to make), inline storage security devices are a bad idea. While there are some carrier-grade routers that can push 40+ Gbps of encryption (MLXe’s I vaguely remember did this) the costs are going to be enormous, and you’ll likely be better off just encrypting at the vSCSI layer using the VM Encryption VAIO filter. You’ll get better security that IPSEC/MACSEC without massive costs.

Did I get something wrong?

Is there an Exception?

Feel free to reach out and lets talk about why your environment is a snowflake from these general rules of things “not to do!”

Should I use a Nexus 2000 series (Cisco FEX) with VMware vSAN?

This question has come up a few times with customer networking teams and it’s one that I must admit confuses me that we are having to have in 2019.

The short response is no. You should avoid using these devices with vSAN, and in general with virtualization or storage traffic.

They were designed for a time when low utilization of physical servers or low-density virtualization was the norm. At the same time, the price for 10Gbps ports on fast switches was incredibly expensive.

Cisco’s troubleshooting notes on Cisco FEX make a few statements.

Move any servers with bursty traffic flows such as storage arrays and video endpoints off of the FEX and connect them directly to the base ports of the parent switch.

Common questions that  have come up at VMworld and other discussions:

Q: why should I listen to a guy who does storage and virutalization about networking?

A: I don’t disagree. How about one of the Co-Founders of the company that built the FEX?

 

 

Q: What is VMware doing to fix this with vSAN

A: This isn’t really a VMware problem. Storage or other large traffic flows like vMotion suffer on Cisco FEX devices. Note other east/west heavy traffic flows suffer in light buffered oversubscribed environments. vMotion, and NSX are also not going to perform there best without real switch ports.

Q: What are some model numbers for the device?

A: Devices:

N2K-C2148T-1GE

N2K-C2224TP-1GE

N2K-C2248TP-1GE

N2K-C2232PP-10GE

N2K-C2232TM-10GE

N2K-C2248TP-E-1GE

N2K-C2232TM-E-10GE

N2K-C2232TM-E-10GE

N2K-C2248PQ-10GE

N2K-C2348UPQ-10GE

B22

Q: My networking team told me they are just like an external line card for a switch chassis?

A: Your networking team is incorrect. A real switch port can send traffic to another port without hair-pinning through another device. It’s arguable that a hub would provide a more direct route for packets from one port to another than what the FEX product line offers. Modern switches also offer much larger buffers that can help mitigate TCP incast and other issues that you will see at scale.

Q: How do I determine if my networking teams have deployed Cisco  FEX devices?

A: This can be difficult without physical inspection to known issues with Cisco  Discovery  Protocol) not working correctly with some configurations of the devices. One sign is if the port on the switch has incredibly high designations 100/1/1 you may be looking at a FEX. It’s best to have your data center operation teams inspect the racks, and take note of model numbers in the same way you would have them physically inspect for cardboard or other things you don’t want in your datacenter. Ultimately the best solution is preventative. Talk to your networking teams about the risks of using FEX devices before they are deployed. 

Q: What are some alternatives to look at?

I’m happy to take comments from other networking people about this but I’ve seen two general choices that customers use instead.

For Cisco customers looking for a device that need FCoE, the Nexus 56xx, 6000, and 7000 offer real switch ports as well as larger buffers. Note: older Nexus 50xx and 55xx have relatively small VoQ buffers that tend to not scale well with larger clusters.

For customers not needing FCoE support (which should be most customers in 2019), the C36180YC-R offers:

  • 10/25Gbps access ports
  • A massive 8 GB of port buffer
  • A fast modern multi-core ASIC

VMworld 2018

Another year another VMworld. I’ve got a few sessions I will be presenting:

 

HCI1473BU The vSAN I/O Path Deconstructed: A Deep Dive into the Internals of vSAN
??? Mystery Session: 7/27 at 3:30PM
HCI1769BU We Got You Covered: Top Operational Tips from vSAN Support Insight
HCI3331BU Better Storage Utilization with Space Reclamation/UNMAP

 

The vSAN I/O Path Deconstructed is an interesting inside look at the IO path of vSAN and the reasoning behind it.

We Got You Covered: Top Operational Tips from vSAN Support Insight shows off the phone home capabilities of vSAN and can help address your questions about what and how this data is used. We are also going to discuss how you can leverage similar views of performance as GSS and engineering to identify how to get the most out of vSAN.

HCI3331BU is a session that has been years in the making for me. “Where did my space go” is a question I get often. We will explain where that missing PB of storage went and how to reclaim it. The savings from implementing UNMAP should be able to fund your next VMworld trip!

Lastly, I’ve got a mystery session that should be unveiled later. Follow me on Twitter @Lost_Signal, and I’ll talk about what it will be when the time comes.

Pete and I will be recording for the vSpeakingPodcast Podcast LIVE! At the HCI Zone (Found near the VMware booth). We’ve got some new guests as well as some favorites lined up.

How big should my vSAN or vSphere cluster be?

This is a topic that comes up quite a bit. A lot has been written previously about how big should your vSphere clusters be and Duncan’s musings on this topic are still very valid.

It generally starts with:

“I have 1PB in my storage frame today, can I build a 1PB vSAN cluster?”

The short response is yes, you can certainly build a PB vSAN cluster, and build 64 node clusters (there are customers who have broken 2 PB within a cluster, and customers with 64 node clusters), but you stop and think if you should.

You want 16PB in a single rack, and 99.9999999% availability?

We have to stop and think about things beyond cost control when designing availability. I always chuckle when people talk about arrays having seven 9’s of availability. The question to ask yourself is if the storage is up, but the network is down does anyone care? Once we include things “outside of storage” we often find that the reality of uptime is often more limited. The actual environmental (Power, Cooling) of a datacenter are rated at best 99.98% by the uptime institute. Traditionally we tried to make the floor tile that our gear sat in to be as resilient as possible.

 

 

James Hamilton of Amazon  has pointed to WAN connectivity to being another key bottleneck to uptime.

 “The way most customers work is that an application runs in a single data center, and you work as hard as you can to make the data center as reliable as you can, and in the end you realize that about three nines (99.9 percent uptime) is all you’re going to get,”

The uptime institute has done a fair amount of research in this space, and historically their definition of a Tier IV facility involved providing only up to 99.99% uptime (4 nines).

 

Getting beyond 4 nines of uptime for remote users (who are the mercy of half finished internet standards like BGP) is possible but difficult.

Availability most be able to account for the infastructure it rests on, and resiliency in storage and applications must account for the physical infrastructure.

 

Lets review traditional storage cost and operational concepts and why we today have reached a point where customers are putting over 1PB into a storage pool.

  1. Capital Costs – Some features may be licensed per frame, and significant discounts may be given if large purchase are made up front rather than as capacity is needed. Sparing capacity and overhead as a % of a storage pool become smaller if your growth rate is fixed.
  2. Opex – While many storage frames may have federation tools, there are still process’s that are often done manually, particularly for change control reasons because of the scale of an outage of a frame (I talked to a customer who had one array fail and take out 4000 VM’s including their management virtual machines).
  3. Performance – wide striping or on hybrid systems aggregating cache and controllers and ports reduced the change of a bottleneck being reached.
  4. The next Change Control Window for my Array is 2022

    Patching/Change Control – Talking to a lot of customers they are often running the same firmware that their storage array came with. The risk, or the 15 second “gap” in IO as controllers are upgraded is often viewed as a huge risk. This is made worst by the most risk averse application on the cluster effectively dictates patching and change control windows. No one enjoys late night all hands on deck patching windows for storage arrays.

  5. Parallel remediation in patch windows – Deploying more storage systems means more manual intervention. Traditional arrays often lack good tools for management and monitoring of parallel remediation. Often times more storage arrays means more change control windows.
  6. Aligning the planets on the HCL –  To upgrade a Fibre Channel Array, you must upgrade ESXi, the Array, The Fabric Version, the Fibre Channel HBA firmware, and the server BIOS to align with the ESXi upgrade.  This is a lot of moving parts, all of which that carry risks of a corner case being identified.

 

Lets review how vSAN dresses these costs without driving you to put everything in one giant cluster..

  1.  Capital Costs – vSAN licensing is per socket and hosts can be deployed with empty drive bays. Drives for regular severs regularly fall in in price, making it cheaper to purchase what you need now and add drives to hosts as needed to meet capacity growth. Overhead for spare capacity for rebuilds does reduce as you add hosts, but nothing forces you to fill each host with capacity up front and no additional licensing costs will be invoked by having partially full servers.
  2. Opex – vSAN’s normal management plane (vCenter) is easily federated and storage policies span clusters without any additional work. Lifecycle management like controller updates from the Config assist, and health monitoring alerts easily roll up to a single pane of glass.
  3. Performance – All Flash has changed the game. You no longer need 1000 spindles and wide striping to get fast or consistent performance. Pooling workloads with 3 tier storage architecture and storage arrays actually increases the chance that you might saturate throughput, or buffers on fibre channel switching.
  4. Patching – vSAN patching can be done simply using existing tools for updating ESXi (VMware Update Manager), and lifecycle update for storage controllers can be pushed by a simple click from the UI in vSAN 6.6. Customers already have ESXi patching windows and processes deployed and maintenance mode with vMotion is as trusted and battle tested means to evacuate a host.
  5. VMware Update manager (VUM) can remediate multiple clusters in parallel. This means you can patch as many (or as few) clusters, and when used with DRS this is fully automated including placement of virtual machines.
  6. Additional intelligence has been deployed for vSAN to include remediation of Firmware. Given that vSAN does not use proprietary Fibre Channel fabrics, is integrated into ESXi, and lacks the need for proprietary fabric HBA’s this significantly reduces the number of planets to align when planning an upgrade window.

In summery I wanted to say. While vSAN can certainly scale to the multi-PB cluster size, you should look if you actually need to scale up this much. In many cases you would be better served by at scale running multiple clusters.

Did you get a fake ReadyNode?

We’ve all been there…

Maybe its the streets of NYC, or a corner stall in a mall in Bangkok, or even Harwin St here in Houston. Someone tried to sell you a cut rate watch or sunglasses. Maybe the lettering was off, or the gold looked a bit flakey but you passed on that possibly non-genuine watch or sunglasses. It might have even been made in the same factory, but it is clear the QC might have issues. You would not expect the same outcome as getting the real thing. The same thing can happen in ReadyNodes.

Real ReadyNodes for VMware vSAN have a couple key points.

They are tested. All of the components have been tested together and certified. Beware anyone in software-defined storage who doesn’t have some type of certification program as this opens the doors to lower quality components, or hardware/driver/firmware compatibility issues. VMware has validated satisfactory performance with the ReadyNode configurations. A Real ReadyNode looks beyond “will these components physically connect” and if they will actually deliver.

vSAN ReadyNodes offer choice. ReadyNodes are available from over a dozen different server OEM’s. The VMware vSAN Compatibility Guide offers over a thousand verified hardware components also to supplement these ReadyNodes for further customization. ReadyNodes are not limited to a single server or compoennt vendor.

They are 100% supported by VMware. Real VMware ReadyNodes don’t require virtual machines to mount, present or consume storage, or non-VMware supported VIBs be installed.

They are Mature. They run a 7th release, battle-tested, mature hypervisor integrated storage stack.

So what do you do if you’ve ended up with a fake ReadyNode? Unlike the fake watch I had to throw away, you can check with the vSAN compatibility list and see if you can with minimal controller or storage devices changes convert your system in place over to vSAN. Remember if your running ESXI 5.5 update 1 or newer, you already have vSAN software installed. You just need to license and enable it!

Virtual SAN performance Service: What is it? (And what about these other things)

One of the newest exciting features of Virtual SAN 6.2 is the new performance service.  This is an ESXi native performance monitoring system with API, as well as UI access.

One misconception I wanted to be clear on is that it does not require the use of vCenter Operations Manager, or the vCenter database. Instead, Virtual SAN performance service uses the Virtual SAN object store to store its data in a distributed and protected fashion.  For smaller environments who do not want the overhead of VSOM this is a great solution, and will complement the existing tools.

Now why would you want to deploy VSOM if this turnkey simple, low overhead performance system is native? Quite a few reasons:

  • VSOM offers longer term granular performance tracking. The native Virtual SAN performance service uses the same roll up schedule as vCenter’s normal performance graphs.
  • VSOM allows for forecasting and capacity planning as it analysis trends.
  • VSOM allows overlaying performance from multiple area’s and systems (Including things like switching, application KPI’s) to do root cause and anomaly analysis and correlation.
  •  VSOM offers powerful integration with LogInsight allowing event correlation with performance graphs.
  • VSOM allows for rolling up performance information across hundreds (or thousands of sites) into larger dashboards.
  • In heterogenous enivrements using traditional storage, VSOM allows collecting fabric, and array performance information.

Screen Shot 2016-05-08 at 3.45.16 PM

vCenter Server vDisk Advanced metrics

So if I don’t enable this service (or deploy VSOM) what do I get? You still get basic Latency, IOPS, throughput information from the normal vCenter performance graphs by looking at the vDisk layer. You miss out on back end component views (things like internal SSD queues and latency) as well as datastore/cluster wide metrics, but you can still troubleshoot basic issues with the built in performance graphs.

What about VSAN Observer? For those of you who remember previously this information was only available by using the Ruby vCenter shell interface (RVC). VSAN observer provides powerful visibility, but it had a number of limits:

  • It was designed originally for internal troubleshooting and lacks consistency with the vCenter UI.
  • It ran on its own web service separately and was not integrated into the existing vCenter graphs.
  • It was manually enabled from the RVC CLI
  • It could not be accessed by API
  • It was not recommended to run it continuously, or to deploy a separate Virtual machine/Container to run it from.

All of these limitations have been addressed with the Virtual SAN performance service.

I expect the performance service will largely replace VSAN Observer uses. VSAN observer will still be useful for customers who have not upgraded to VSAN 6.2 or where you do not have capacity available for the performance database.

Screen Shot 2016-05-08 at 3.59.02 PMThere is an extensive amount of metrics that can be reviewed. It offers “top down” visibility of cluster wide performance, and virtual machine IOPS and latency.

 

 

 

 

Click to expand!

Individual device metrics

Virtual SAN Performance service also offers “bottom up” visibility into device latency and queues on individual capacity and cache devices.  For quick troubleshooting of issues, or verification of performnace it is a great and simple tool that can be turned on with a single checkbox.

 

 

 

Requirements:

vSphere 6.0u2

vCenter 6.0u2 (For UI)

Up to 255GB of capacity on the Virtual SAN datastore (You can choose the storage policy it uses).

In order to enable it simple follow these instructions.

 

How to handle isolation with scale out storage

I would like to say that this post was inspired by Chad’s guide to storage architectures. When talking to customers over the years a recurring problem surfaced.  Storage historically in the smaller enterprises tended towards people going “all in” on one big array. The idea was that by consolidating the purchasing of all of the different application groups, and teams they could get the most “bang for buck”.  The upsides are obvious (Fewer silo’s and consolidation of resources and platforms means lower capex/opex costs). The performance downsides were annoying but could be mitigated. (normally noisy neighbor performance issues). That said the real downside to having one (or a few) big arrays are often found hidden on the operational side.

  1. Many customers trying to stretch their budget often ended up putting Test/Dev/QA and production on the same array (I’ve seen Fortune 100 companies do this with business critical workloads). This leads to one team demanding 2 year old firmware for stability, and the teams needing agility trying to get upgrades. The battle between stability and agility gets fought regularly in the change control committee meetings further wasting more people’s time.
  2. Audit/regime change/regulatory/customer demands require an air gap be established for a new or existing workload. Array partitioning features are nice, but the demands often extend beyond this.
  3. In some cases, organizations that had previously shared resources would part ways. (divestment, operational restructuring, budgetary firewalls).

Feed me RAM!

“Not so stealthy database”

Some storage workloads just need more performance than everyone else, and often the cost of the upgrade is increased by the other workloads on the array that will gain no material benefit. Database Administrators often point to a lack of dedicated resources when performance problems arise.  Providing isolation for these workloads historically involved buying an exotic non-x86 processor, and a “black box” appliance that required expensive specialty skills on top of significant Capex cost. I like to call these boxes “cloaking devices” as they often are often completely hidden from the normal infrastructure monitoring teams.

A benefit to using a Scale out (Type III)  approach is that the storage can be scaled down (or even divided).  VMware VSAN can evacuate data from a host, and allow you to shift its resources to another cluster. As Hybrid nodes can push up to 40K IOPS (and all flash over 100K) allowing even smaller clusters to hold their own on disk performance. It is worth noting that the reverse action is also possible. When a legacy application is retired, the cluster that served it can be upgraded and merged into other clusters. In this way the isolation is really just a resource silo (the least threatening of all IT silos).  You can still use the same software stack, and leverage the same skill set while keeping change control, auditors and developers happy. Even the Database administrators will be happy to learn that they can push millions of orders per minute with a simple 4 node cluster.

In principal I still like to avoid silos. If they must exist, I would suggest trying to find a way that the hardware that makes them up is highly portable and re-usable and VSAN and vSphere can help with that quite a bit.