Should “Luck” factor into your DR/BC plans?
The greatest storage and systems administrator of all time was Montgomery “Scotty” Scott. No matter how far outside of design the ship was pushed he generally found away after saying “the ship can’t take anymore” to find the capacity to prevent disaster. His key secret?
- Expectation setting (He always looked good when he under promised and over delivered).
- Hiding reserve capacity (A key tallent in many storage management practices).
- A magic ability to get limitless budget for repairs, replacement parts and ships.
The reality in storage management is we can not all be Scotty (nor should we need to be). Sometimes we end up in scenarios that the system was not designed for. Thankfully there are sometimes capabilities of storage systems that vendors can expose that allow us to opportunistically exceed design expectations and “win” in Kobayashi Maru “no-win” scenarios. When data has gone missing or is expected to be gone for good, what is involved in your plan?
When planning disaster recovery or business continuity should you include “Might be there” safety nets? When drafting target Recovery Point Objectives (RPO) or Recovery Time Objectives (RTO) should or can you count on these vs. properly investing in a good backup/disaster recovery solution?
Restore Accidentally Deleted LUN
A lot of data loss scenarios are murkier to plan for than you realize Accidently deleting a LUN is a shockingly common occurrence. Poorly updated LUN number abstraction maps, and separation of duties (3 people involved in identifying a volume to delete on a SQL cluster) can all lead to this.
Some storage arrays have magic un-delete buttons. This can range from a trashcan to an obscure command that requires support to invoke. This capability is generally contingent on free space being available to retain the data that was deleted. I’m always nervous about including this in an RPO/RTO promise. The problem is in an out of space condition one of two things happen when counting no this capability:
1. The array will go read-only and crash every virtual machine (well abruptly pause if VAAI is working)
2. The snapshots will auto-delete
“But John I don’t have a high enough change rate, and I run my array at 20% usage!”
This may be true, but ransomware has a nasty habit of:
1. Re-writing all of your data.
2. Encrypting the data so that 4x dedupe and compression turn to a negative dedupe rate. Either of these activities can trigger an out-of-space condition.
You also need to be concerned with ransomware like IO activities coming from your users/application owners:
- DBA decides to turn on encryption on a database and doesn’t tell anyone.
- Large batch process re-writes the data
- Large data ingestion events
“But why would this problem happen at the same time I’m deleting a LUN?”
One of these things often causes the other. An out-of-space condition will often make all volumes on an array go read-only. This generally forces a storage admin to delete LUNs quickly. This outage often can happen at weird hours without proper caffeine, visibility, or communication.
Capacity Reservation Mitigation
Preventing out of space conditions (to prevent this scenario) can be done by “always provisioning thick” and reserving 110% capacity for snapshots, but practically the costs associated with doing this with storage that doesn’t tier into cheap S3 isn’t a feasible solution for all but the most deep-pocketed of datacenters. It may be tempting to “throw primary storage” at this problem, but that budget is often better invested in other mitigations.
Unplanned Data Loss
Other scenarios where “maybe I can recover your data” tools come into play are failures that exceed the design of the storage platform.
An example that would cause this is the rebuild of your 92 SATA disk RAID 5 hits a Latent Sector Error (LSE) causing an Unrecoverable read error. A single read failure in this situation causes the raid rebuild to stall. In theory, your data is lost. Depending on your platform and the tooling of your storage partner though, you may be able to accept a small amount of data loss and force the rebuild to go forward anyway.
Luck based rebuilds on multi-drive failure
Some platforms limit the rebuild domain for an LSE impact by using per volume RAID/rebuilds (vSAN does this) to reduce the impact of a drive failure that exceeds tolerance. Depending on how the error works you could be accepting an unspecified corruption of a few files or you could be hoping for “luck” in where the error is to not lose data. The only thing I like to count on in design for these is the speed of recovery. Rather than need to invoke a disaster recovery plan on 3 of 100 drives failing simultaneously, knowing I only need to rehydrate 3% (or potentially much much less) of the data from backup helps with planning cache/simultaneous restore plans.
Overriding split-brain protection
Specific to vSAN if you had a thermal meltdown in the data center on your HCI cluster and lost quorum and 1 copy of the data on a RAID 1 mirror from the cascading cooling failures you would have data unavailability. You can call support and they can upload a recovery tool to attempt to defy the angry storage gods and clone a full copy.
All of these scenarios involve a few things:
1. Operational failures.
2. Design failures of some kind.
3. Require the equivalent of a D20 dice roll to get your data back.
If you needed one of these “might be there” recovery options to hit an RPO/RTO/SLA it generally can be solved by better design.
How to better prevent accidental deletions
If you live in a data center with highly SILO’d ITIL operations, miscommunications are a risk in all operational changes that involve storage volumes/LUNs. There are a few ways though to improve communications and reduce errors between the storage and virtualization teams.
vSphere Storage APIs for Storage Awareness (VASA) allows VMware administrators better visibility into the storage layer. This allows VMware administrators to have a vision into what the internal volume numbers are for a given virtual machine or datastore.
Virtual Volumes simplifies communication even further by offloading the deletion task entirely to the VMware administrator. Deleting a virtual machine automatically deletes the associated volumes with it, removing any miscommunication between the VMware and storage team.
Operational Methods To Prevent Accidental LUN deletion
The best operational advice for storage arrays I have is to train your staff to disconnect LUNs and then wait 48-72 hours before deleting LUNs. There shouldn’t be an urgent need to delete a LUN.
“But John we urgently need that space back!”
Pretty much all modern storage arrays support TRIMUNMAP/DEALLOCATE as a way to allow the operating system/hypervisor to perform deletions from a higher layer and push through those deleted blocks. Rather than blindly deleting an entire volume, making sure deletions of VMDKs are pushed through from VMFS is a much safer/easier alternative. Auto shrinking VMDKs also allow for deletions from guest OSs to be pushed through end to end. The closer you can delete data to the application the less chance you risk miscommunication.
Lastly, using vSAN or vVols simplifies this further. If you delete a VMDK the space is freed up, and vSAN supports thin volumes shrinking by UNMAP/TRIM from the guest OS in the virtual machine. vSAN and vVols pierce through layers of abstraction to make storage capacity management just a simpler way to handle things.
These various “tricks” are great when they work. I still don’t think they play a primary role in planning your recovery speed, or the point of recovery for recovering from failure. The smartest thing Scotty ever did was keep his “might work” tools in his back pocket and promise only what the ship was designed for.
This blog came about from a conversation with some of the other Veeam Vanguards.