Posts from the ‘Uncategorized’ Category

Jan 3

VMware vSphere Reliable Memory – A few thoughts

According to a study by Google, The annual incidence of uncorrectable errors was 1.3% per machine and 0.22% per DIMM. This rate rises to 1.7–2.3% after seeing corrected errors. Hard errors are caused by physical factors, such as excessive temperature variation, voltage stress, or physical stress brought upon the memory bits. Soft errors are random bit flips, typically associated with alpha particle radiation, solar winds and are correctable.

As the number of DIMMs and density of them increases, I suspect this only gets worse and rapidly approaches 100% if I have something important to work on.

Odds of me seeing this increase to 100% the closer I am to recording a Demo

Now, what happens with a VMware host, when the CPU detects unrecoverable errors for memory? This depends on who gets the bad bit:

-VMkernel: Crash (i.e. PSOD) the ESX host unless the kernel is within MCE Safe context.

VMM: Kill the VM. (The Virtual Machine should restart).

User space: Kill the user world (Most processes can be restarted).

Now, what if we want some protection? It’s worth noting that using ECC memory provides some basic protection (A single bit randomly flipping), and more importantly, provides awareness against larger protections through active scrubbing (So we don’t commit corrupt data to disk). If we want to mitigate larger failures (Such as an entire memory device on a DIMM or a DIMM itself) we need to look for more advanced protection methods.

Memory Mirroring: This is pretty simple and fairly expensive. This involves mirroring all DIMMs, so that in the event of a DIMM failure the server will keep on running. This is only outmatched by more extreme triple-redundant quorum/voting systems used on spaceflight computers. This is only considered for mission-critical systems in extremely difficult to reach places (Submarine, diamond mine, etc).

Single Device Data Correction (SDDC) – Out of the normal 18 memory devices on a DIMM you kee 1 device for CRC and 1 device for parity. If one if the devices fails, its data can be reconstructed. This is called single-device data correction (SDDC). Think of this a bit like RAID 4 (dedicated parity device) with checksums stored on a dedicated device also rather than with the block of data. Note a +1 option, effectively keeps a “hot spare” device so that after a failure is mitigated, you can support another failure. For Intel The Silver/Bronze SKUs offer an adaptive variant called Adaptive Data Correction (ADC), at Bank granularity.

Double Device Data Correction (DDDC) – This is where things start to get fancy and weird. By combining two 4x DIMMs into the same memory channel you can run a double parity scheme across both devices. This comes with performance impacts (Memory throughput seems to be the main issue). This doesn’t seem to be recommended for high throughput applications (HPC).

Adaptive Double DRAM Device Correction (ADDDC) – New with the Intel Scalable series processors (2017), this enables the ability to avoid the performance pre-failure that the DDDC design normally imposes. Note this feature doesn’t work with 8x DIMM layouts (smaller 8 and 16GB DIMMs from what I’ve found). For Intel, the Platinum/Gold SKUs offer Adaptive Double DRAM Device Correction (ADDDC).

Other weird OEM options – You will find stuff like hot spare DIMMs, and exotic additional bit ECC, how often scrubbing is performed etc. Be careful with this stuff and talk to your OEM about the expected performance impact.

Address Range Partial Memory Mirroring – This is an intel specific technology with a bit of variety on the implementation depending on the OEM. Unlike DIMM mirroring (which is transparent) this requires a OS –> Firmware interface for the OS to be aware of. In this case this is what the vSphere reliable memory feature enables. How this works under the hood is kernel processes flagged for usage of this will use this memory and be protected up to and including a full DIMM failure. This feature requires Intel Xeon Platinum and Gold processors SKUs.

Let’s see what this looks like in VMware!

You can look up how much memory is considered reliable by using the ESXCLI hardware memory get command.

Before turning on feature:

[root@h2:~] esxcli hardware memory get
   Physical Memory: 549657530368 Bytes
   Reliable Memory: 0 Bytes
   NUMA Node Count: 2

After turning it on:

[root@h2:~] esxcli hardware memory get
   Physical Memory: 480938061824 Bytes
   Reliable Memory: 68619579392 Bytes
   NUMA Node Count: 2

On boot I can see that a bit of DRAM has been borrowed for “reliable memory” (54GB). Given this is 1/8th of the memory in the host that is a trade off (12.5%) but it’s is something for mission critical applications that might be worth considering. Digging around 12.5% is normal for a 13Gen server. Note this memory overhead comes 100% out of the first NUMA node (ESXTOP will confirm this).

It’s worth noting that more than the kernel can use this feature. Virutal Machines can be configured for it by following KB2146595 and using the VMX flag sched.mem.reliable = "True"

If you reserve 30% or more of a CPU you could trip an alarm called “Significant imbalance between NUMA nodes detected”.

How much should be reserved? The guidance in 5.5 was for at least 3GB, but if virtual machines are using this, or extensive services are being used ona host more may be needed.

There is a bit of variety in the OEM implementations.

Dell – The servers I tested this with by default grab either 12.5% or 25% of my hosts memeory depending on if I use the fault resilient mode, or the NUMA fault resilient mode.

HPE – offers overs 4GB reservation or 10% or 20% of memory above 4 GB.

Lenovo – Offers Mirroring of 4GB.

Fujitsu – Supports 4GB + %. Can be defined in UEFI (which makes me think other OEM’s might have unsuppoprted hacks to alter this in UEFI basd on Intel’s documentation).

SuperMicro – I found no mention of partial memory mirroring in their RAS guide. Note their RAS guide is a great read on the other parity based protections.

Should I configure reliable memory?

You fundamentally have to ask yourself a few questions:

What am I paying for RAM, and what is the overhead going to be? In the case of the Dell functionality I tested it appears the BIOS is reserving 64GB. Looking at 3rd party memory prices this is going to run me in the us about $439. Looking at the spot price of memory recently it seems DRAM pricing is hitting new lows. Maybe it is worth sparing some to increase residency for mission critical clusters?
What is my tolerance for a host failing from a bad DIMM? For this you need to look at your estimated DIMM failure rate, and consider the odds that something important in kernel space is crashed by the DIMM failure (most user world things and virtual machines will reboot if they hit a non-recoverable memory error). If this is Test/Dev I might not care, if I’m running an Oracle RAC cluster that backs the ERP for a fortune 50 company I might have more sensitivity.
Is HA and/or application clustering “good enough protection?
Should I extend this to Virtual Machines? The VMX flag allows configuration of virtual machines to attempt to fit into the reliable memory space. I’m not sure what happens when the host is given 64GB of reliable memory and I try configuring a 100GB reliable memory VM (More testing needed).
Is 4GB good enough? Some platforms (HPE) offer the ability to configure 4GB or 4GB + xx%. By lowering what is protected (but lowering the cost overhead) a blend of risk mitigation and cost control may be “good enough” for many.
Would I rather mirror a virtual machine between two hosts (SMP-FT) and just pay the extra overhead?
Is there a particle accelerator or evil supervillain lab next door? If the server will be operating near a major source of alpha particle radiation it may be worth considering full mirroring (or shielding the server!)

Nov 14

Improving NIC and switch performance for vSAN (and other IP storage)

This is going to be a short post collecting a few tricks to unlock some bottlenecks in storage networking that may grow over time:

Unfortunetly a lot of troubeshooting of networking performance stops earlier than it should. Two common incomplete troubleshooting workflows I’ve seen:

Someone checks that network utilization on a host isn’t near the link speed and says “network not the bottleneck”.
Someone calls the networking team and they look at the switchports utilization based on SNMP polling, or do a quick “Show interface” and don’t see obvious port errors (CRC, drops, giants etc). They proudly close the ticket as “Switches are fine!”

Buffer Configuration Considerations

In one of my labs where we have a Nexus 9000 series switch we found performance was looking a bit limited. Seeing higher than expected re-ransmits we dug into it deeper. We dug deeper into buffer utilization. Discovering that the default mesh configuration was limiting buffer access to 500 KB per port, we adjusted the buffers using the qos ns-buffer-profile ultra-burst command. This signifigantly opened up performance, reducing TCP incast inssues (which cause retransmits) and brought performance more in line with what we would expect for the cluster. For anyone looking for more inormation on this command (and how to look at buffers) see the QoS Guide. Note for solving buffer contention different switches will have different options for configuring buffers, priortizing which flows to drop first, and allocating buffer to ports. In other cases it may be simpler to just buy switches with deeper buffers to begin with. Rather than trying to chop apart a 12MB-40MB buffer, simply purchasing a switch with an 8GB buffer can avoid a lot of the need for buffer management consideration.

I”ve been asked about the HPE 5950 series switch. Digging into the CoS/QoS guide I found a few things:

You can detect how often you exceed a buffer with the display buffer usage interface command.

<switchname> Display buffer usage interface hundredgige 1/0/1

This command will be more useful than the “display buffer usage” as that only tracks usage over a 5 second rolling outage, vs the violation tracker that the interface counter can track (Which will detect very short microbursts that may be causing buffer full and retransmits conditions nad latency). Note the default buffer threashold is 70%.

burst-mode enable appears to be a similar command to the ultra-burst buffer configuration and is recomended for cases that include “Traffic enters a device from multiple same-rate interfaces and goes out of an interface with the same rate.” Given this scenario is exactly what we would see from TCP incast (multiple vSAN hosts trying to talk to the same host and filling a buffer), this is likely something you would want to turn on. As I don’t have one of these switches in my lab, I’d love any feedback anyone has from trying this command. If anyone from HPE Networking is reading this, feel free to reach out.

TCP dispatch queues tuning

In another example in the lab, a test of raw throughput was coming up short. A review of the back end disk groups showed a lack of congestion (latency was low, write cache fill rate was low). A review of network utilization showed only 30% utilization on the link speed, but high latency (20ms+ between the nodes).

Investigation showed that the throughput was bottlenecking on a single threaded TCP process (CPU for attached world at 100%). By raising the TCP RX queues from the default of 1 to 4, this eleminated this as a bottleneck and returned performance to expected levels.

Steps to set this are:

Set the advanced setting on the host

$ esxcfg-advcfg -s 4 /Net/TcpipRxDispatchQueues

Or for PowerCLI:

Get-AdvancedSetting -Entity <esxi host> -Name Net.TcpipRxDispatchQueues | Set-AdvancedSetting -Value ‘4’

Reboot the host once this is set.

To validate this setting:

$ esxcfg-advcfg -g /Net/TcpipRxDispatchQueues

It’s worth noting that Niel’s blog on vMotion tuning reported higher throughput per stream than I saw (His blog reports 15Gbps per stream). This may be a result of my lab hosts using inexpensive cheaper Intel 5xx series NICs that lack the advanced offloads that the Intel 700 or 800 series cards have. Mellanox CX Series cards also have similar capabilities. Without these offloads, more CPU is needed to push the same throughput and this would compound together to bring the performance ceiling even lower with the cheaper NICs.

Summery

For anyone seeing bottlenecks on lower-cost NIC’s, or wanting to push more than 15Gbps per host of vSAN traffic, keep an eye on this setting, and talk to GSS if you are concerned this default may be causing a bottleneck. For new hosts I’d strongly consider smarter NIC’s that contain hardware LRO/TSO, RSS VxLAN GENEVE offload capabilities and make sure that your driver and firmware are both up to date. Note in a future release this default may change.

If you have any feedback on these commands (or questions on other commands or switches!) reach out to me on twitter @Lost_Signal

Jul 22

How to pick NICs for VMware vSAN powered HCI

This post has been a long time coming, as Network Interface Cards are an often overlooked component. Many people have mistakenly assumed all NIC’s are the same or are simple commodities. Note, that most of this advice will also apply as generic advise for vSphere, and ethernet storage. Here is a short list of features and considerations when picking a network interface card. I’m hoping this series will spawn out into more information on picking switches, as well as troubleshooting of NICs and switches.

Offload Features

LRO/LSO – Large Receive Offload, and Large Send Offload allow for packets to be broken up when transmitting and consolidated. Note: TCP segmentation offload (TSO) is a very common form of LSO and you will often see these terms used interchangeably. This provides improvements in CPU overhead. The VMware Performance Team has a great blog showcasing what this looks like for virtual machines. LRO can benefit CPU overhead on 64KB workloads by as much as 90%.

Receive Side Scaling (RSS) – Helps distributed load across multiple CPU cores. At higher throughput operations it is possible that a single CPU thread can not fully saturate larger network interfaces. In sample testing, a 40Gbps NIC could only use 15Gbps when using a single core. RSS is also critical for VxLAN/NSX performance. Note RSSv2 is supported by a limited subset of cards (This appears to allow balancing at a more granular level).

Geneve/VxLAN encapsulation support – For customers using NSX, hardware support for overlays again helps increase performance and should be added to the shopping list when selecting a NIC.

Converged over Ethernet (RCoEv1/RCoEv2) – While vSAN doesn’t yet, support RDMA, vSphere VM’s, iSER (iSCSI RDMA Extensions) support is shipping today, and hopefully in the future additional support will come for other traffic classes. RDMA significantly lowers latency, lowers CPU usage and increases throughput while keeping CPU usage down. Note, RCoE avoids the use of TCP, while iWARP (a competing standard) runs RDMA over TCP. RCoE requires not only NICs that support it, but also physical switches to support it. Mellanox is a popular vendor for both NICs (X4) and switches as they have a long history of working with RDMA.

NSX-T Virtual Distributed Switch (N-VDS). NICs that support this will feature “N-VDS Enhanced Data Path” support in the vSphere VCG. This includes the ability for Traffic flow over NUMA aware Enhanced Data Path. While this is normally something reserved for NFV workloads, the gains in throughput are massive. This Blog is a good starting point.

https://blogs.vmware.com/networkvirtualization/files/2018/06/enhanced.datapath.png — **Turn this path up to 11**

MISC CNA/Storage Options.

iSCSI HBAs have come in and out of vogue over the years. In general, I have some concerns with the quality of the QA the vendors are doing for this feature that sees such little usage now that the software initiator is the overwhelming majority of the market. FCoE has been removed/deprecated from some Intel NICs and in general, is falling out out favor. A software

FCoE software initiator now exists, but in general, the fad of FCoE (Which was always a bridge technology) seems to be slowly going away. For those still using FCoE CNA’s be sure to make sure that vVols secondary LUN support is available.

NVMe over Fabrics (NVMe-oF) – Support for NVMe over ethernet fabric is slowing gaining interest in customers looking for the lowest latency possible. While possible to run over Fibre Channel, 100Gbps RDMA Ethernet is also a promising option.

Other Conisderations

Supported Maximums – Not a huge issue for most, but some NIC’s have caveats on the maximum supported number that can be placed in a host. This is often tied to driver memory allocations. This information will be found in the vSphere maximums as well as in the notes on the VCG entry for a NIC.

Link Layer Discovery Protocol LLDP – Technically normally a software feature, but on some NIC families this is bizarrely made a hardware feature. This can have…. interesting results depending on the implementation.

Nothing to see here, rogue ARPs are normal!

Stable/Fast Driver Firmware and engineers to maintain this – I’m not sure why I have to list this as a “NIC feature” but it is. There are NIC’s out there that have had hilariously unstable firmware for years (Note this isn’t a vSphere issue, but a general issue across OS platforms). Some vendors will take an issue and rapidly RCA it in their testing lab. Others will ask you to be a crash test dummy and run debug drivers in production “until it happens again” Questions to ask an OEM are “If we have an issue do you have the hardware to recreate this, and were physical? is this lab?”. If you are repeatedly being asked to run async drivers (that are not tested/validated by VMware) this may be a sign that this vendor doesn’t have adequate engineering behind this card.

Flow control – This is something you really only should be using on 1Gbps, otherwise turn it off. It’s not a problem, and there are frankly are better ways (CoS/DSCP) to prioritize traffic under contention.

Management APIs for CIM Provider and OCSD/OCBB Support – This can allow for better out of band monitoring of this NIC. If there are not good ways to interigate health and pull logging of the card, recreating issues can be painful.

Wake on LAN – Really you should be waking servers using the out of band management, but ocasionally there is a use for this.

So what do these features mean for a HCI Architect?

Lower host CPU usage means more CPU available for processing storage and running virtual machines enabling increased virtualization consolidation.

Higher Throughput per core (as a result of LRO/TSO) means that higher performance per core can be achieved by reducing uncessary CPU usage. This allows faster resync operations (commonly 64KB), as well as higher throughput available. LRO/TSO/RSS help prevents single-threaded networking processes from becoming bottlenecks.

Lower Packets Per Second (PPS) – By consolidating packets with TSO, fewer packets must transverse the physical switches. Many Switch ASICs will have limits as to how many PPS they can process and will be forced to delay packets negatively impacting performance.

Caveat Emptor – Some NIC’s have had a troubled history with these features, and may require driver/firmware updates to make stable. Some vendors may label a feature as offload, but in reality still, process them in CPU. Some features may only be supported in specific driver versions, or might even be quietly deprecated and scrubbed from the datasheets. Note, the suubtle humor of release note writers can not be understated. “may lead to connectivity issues” may translate as “cause the host to crash, and cause a plague of locusts to infest the datacenter”. As always, trust but verify.

Feedback – Did I get something wrong? Did I not properly abreviate NVM over Fabrics? Is FCoE still the future of storage 10 years later? Reach out to me on twitter @Lost_Signal and lets continue this conversation.

May 4

1 Comment

Should I use a Nexus 2000 series (Cisco FEX) with VMware vSAN?

This question has come up a few times with customer networking teams and it’s one that I must admit confuses me that we are having to have in 2019.

It’s 2019. FEXs are not switches and you realllllly should stop buying/deploying/using them.

— John Nicholson (@Lost_Signal) May 2, 2019

The short response is no. You should avoid using these devices with vSAN, and in general with virtualization or storage traffic.

They were designed for a time when low utilization of physical servers or low-density virtualization was the norm. At the same time, the price for 10Gbps ports on fast switches was incredibly expensive.

Cisco’s troubleshooting notes on Cisco FEX make a few statements.

Move any servers with bursty traffic flows such as storage arrays and video endpoints off of the FEX and connect them directly to the base ports of the parent switch.

Common questions that have come up at VMworld and other discussions:

Q: why should I listen to a guy who does storage and virutalization about networking?

A: I don’t disagree. How about one of the Co-Founders of the company that built the FEX?

As a co-founder of Nuova Systems, where we invented FEX, I heartily agree with this sentiment.

— Tom Lyon (@aka_pugs) May 4, 2019

Q: What is VMware doing to fix this with vSAN

A: This isn’t really a VMware problem. Storage or other large traffic flows like vMotion suffer on Cisco FEX devices. Note other east/west heavy traffic flows suffer in light buffered oversubscribed environments. vMotion, and NSX are also not going to perform there best without real switch ports.

Q: What are some model numbers for the device?

A: Devices:

N2K-C2148T-1GE

N2K-C2224TP-1GE

N2K-C2248TP-1GE

N2K-C2232PP-10GE

N2K-C2232TM-10GE

N2K-C2248TP-E-1GE

N2K-C2232TM-E-10GE

N2K-C2232TM-E-10GE

N2K-C2248PQ-10GE

N2K-C2348UPQ-10GE

B22

Q: My networking team told me they are just like an external line card for a switch chassis?

A: Your networking team is incorrect. A real switch port can send traffic to another port without hair-pinning through another device. It’s arguable that a hub would provide a more direct route for packets from one port to another than what the FEX product line offers. Modern switches also offer much larger buffers that can help mitigate TCP incast and other issues that you will see at scale.

Q: How do I determine if my networking teams have deployed Cisco FEX devices?

A: This can be difficult without physical inspection to known issues with Cisco Discovery Protocol) not working correctly with some configurations of the devices. One sign is if the port on the switch has incredibly high designations 100/1/1 you may be looking at a FEX. It’s best to have your data center operation teams inspect the racks, and take note of model numbers in the same way you would have them physically inspect for cardboard or other things you don’t want in your datacenter. Ultimately the best solution is preventative. Talk to your networking teams about the risks of using FEX devices before they are deployed.

Q: What are some alternatives to look at?

I’m happy to take comments from other networking people about this but I’ve seen two general choices that customers use instead.

For Cisco customers looking for a device that need FCoE, the Nexus 56xx, 6000, and 7000 offer real switch ports as well as larger buffers. Note: older Nexus 50xx and 55xx have relatively small VoQ buffers that tend to not scale well with larger clusters.

For customers not needing FCoE support (which should be most customers in 2019), the C36180YC-R offers:

10/25Gbps access ports
A massive 8 GB of port buffer
A fast modern multi-core ASIC

It’s unsafe at most speeds for storage. pic.twitter.com/smMM0i5T1R

— John Nicholson (@Lost_Signal) May 2, 2019

Jan 23

vSAN The Case for Single Socket

I had previously discussed in a rather click bait title of an article how falling storage prices were making vSAN cheaper. As the economics of flash + perpetual socket licensing made the cost per GB for vSAN (and cost per VM) go down over time. It’s 2019 and I wanted to example what those old hosts that need to be replaced might look like.

2014/2015 host

2 x 8 core processors (Intel Xeon v3)
3.6TB to 6TB RAW capacity (3 or 5 x 1.2TB 10K RPM drives) with 400-800GB of SATA/SAS SSD Cache.
128 GB RAM.
4 host cluster

Now, this host would require two sockets of vSAN and vSphere licensing for a total of 8 sockets.
Now going into 2019, we are ready to replace these nodes and deploy new ReadyNodes.

Moving forward to 2019 a host might look something like this:

1 x 16 core processors (Xeon Scalable, or AMD EYPC now as a very viable option).
256 GB of RAM
32TB of RAW NVMe with 2 x 800GB of NVMe Cache.
8 host cluster

Let’s look at a few improved density capabilities:

Single Socket, because realistically this system should be able to run more than twice the workloads as the older processor.
Less overhead. The overhead for CPU as a % of the cluster for N+1 capacity planning drops from 25% for a four-host cluster, to a 12.5% for an 8 host cluster.
AMD EPYC allows for use of all memory channels and PCI lanes so for workloads that will fit within a processor scaling out has some natural efficiency.
Single socket cuts the licensing count in half for hosts, allowing me to run twice the hosts for the same licensing cost.
All Flash lets me use Erasure Codes along with Deduplication and compression and the 5-8x more RAW capacity mean I can store a LOT more data. RAID 6 vs FTT=2 mirroring cuts overhead in half.
Performance Improvements as a result of the all NVMe configuration along with code improvements going from older to newer vSAN IO path code.

A couple other big wins here:

1. Unlike a storage array or appliance I didn’t have to re-buy licensing. Nothing was “thrown away” in this transition. No sales guy came out and scared me with a scary renewal before offering me some insane trade in play to some new platform I’m not familiar with. At no point in the purchase of new hardware did I feel the process should be described as a “Barking Carnival” or a “Goat Rodeo”. I might upgrade some license versions, but the existing value and that delta cost remained the same and no one was trying to force me to make a transaction.

2. I doubled my host count without doubling my licensing or support renewals. Say it with me “I should expect MORE VALUE from my HCI solution as the investment ages, not less!”.

3. I increased my compute capabilities with the same socket count by 4x or more. Inversely if I was a large shop not growing, I might even consolidate my fleet, shift some licenses to cover DR, or other environments to expand on my vSAN success. I can even split up these nodes to do this. A traditional storage frame unfortunately can’t be cut into three pieces and reused at 3 sites.

4. I massively increased by storage performance and capacity footprint. The old hardware may get phased out for a lab, used for DR, or sent to recycling but my vSAN investment carried forward through this growth. No one at VMware “tax’d” me for adding hosts, cores, performance, or capacity to expand my environment.

VSAN is now up to 30% cheaper!

Apr 3

1 Comment

Where did my host go….

UPDATE: https://kb.vmware.com/s/article/53749
VMware and Intel have a KB for workarounds on this issue.

I was reading Bob Plankers colorful complaints about his Intel X710/XL710/X722/XXV710 family of NICs and figured I’d do some digging and ask around on people I know who have them as well as summarize some things I learned from using them as a customer.

A few observations:

These problems are not specific to vSphere. People running Linux and Windows on bare metal ran into these issues
While a lot has been focused on the LRO/TSO issue, there is another separate issue tied to LLDP and duplicate mac addresses being created.

First Issue LRO/TSO

This KB Sums up the issue quite well by pointing out that these features can cause PSODs. Checking with some friends who used to be able to reproduce this at the drop of a hat the newest driver/firmware is a lot more stable in this regard, but it can still happen. Some people are leaving these disabled to stay safe, while others are hungry for the small CPU gains these features deliver. How do I remediate it? Beyond manually setting it on the hosts Jase Mccarty has a great script that will do this in bulk for a cluster.

Next up: The case of the disappearing host!

The common symptom is that management on a host will cease to function (Pings will drop) and the host will disappear from vCenter. Sometimes something more catastrophic happens (HA triggers, host isolation is triggered, storage or vMotion fails). If you pay attention closely to LogInsight, you will see your switches are reporting Mac Address Flapping (You are sending your switches syslog into LogInsight, RIGHT?!?)

Sow what’s going on here?

How is VMK0 special

This goes back to the special behavior for VMK0 where it steals the mac address from a physical port. This is handy for new cluster setups where people know the MAC addresses from the OEM providing them before delivery being able to put this in their DHCP reservations and get started without needing to physically touch the hosts to know which one is which etc.

Why is this card special?

This card is unique in that there is a special LLDP agent that runs on the card and intercepts LLDP packets. Previously I associated LLDP with simply sending information on what’s plugged in where (which is why you should turn it on for send/receive with your VDS). In this case, though the LLDP agent will also update where a MAC is located.

Why together does this happen?

The challenge comes when VMK0 moves to a different physical switch port and tries to move the MAC address with it. You get a fun ARP battle between the LLDP agent of the physical port and the VMK0 that is behind a different physical port. A good old fashioned duplicate MAC entry ARP battle ensues, and this is going to manifest itself as a host going offline completely, or flipping back and forth based on the update hold-down interval on the switch. (Side note, any real networking people feel free to correct me on my terminology here I dropped out of my CCNA class in 2008).

Why did I loose more than management (or what am I doing wrong!)?

Given most people use VMK0 for management by vCenter (and for non-VSAN clusters HA heartbeats happen here) this can have a lot of interesting behaviors like loss of management, host isolation response being triggered. This is another great reminder of why you should use datastore heart beating, or VSAN which will not depend on VMK0 for heartbeats.

Also if you are running EVERYTHING on VMK0 (Storage vMotion) which is NOT a recommended practice (isolate storage and vMotion networks!) you could see all of the virtual machines crash and other fun things.

Workarounds?

So there are a few ways to possibly work around this.

You could simply avoid using VMK0 with this card. Either disconnecting it and using a new VMK4 or so forth for whatever it was being used for. This is simple, it’s easy (outside of disconnecting and reconnecting hosts) and doesn’t require you touch the network beyond having one extra IP on the management network to make it easier.
You could change the mac address manually to something in the random VMware MAC address space (Need to clarify if this is supported, but it’s simple enough and avoids this issue). Note that the MAC would be set back if you ever remote and recreate VMK0.
If you trust your networking team, you could try asking they hardcode the MAC address to specific ports in the CAM tables of the switch. I would look at this only as a last resort if operationally you can’t physically change anything on the hosts but need an extreme workaround
*EDIT* It looks like running LACP across the origional physical port and another port will work around the issue. The switch isn’t going to care where the frame comes from, and so this should reduce or ignore the chance of an arp fight. Balancing for VMK0 across physical ports will not be great, but as long as it is is management only you will likely not care too much. (Thanks to Simon for this discussion).
*EDIT* Try putting VMK0 on a tagged NON-Native VLAN. It can’t get in a fight with the LLDP agent for the MAC address if it’s on a completely different broadcast domain (Thanks to Broc Yanda for this idea).

What else is going on that I don’t know about vSphere Networking?

This week I also learned about shadow vmnics.

Feb 13

Tango Eagle Bravo

*Coming to a vSAN support call near you*

“Sir, It looks like Tango Eagle Bravo is the problem”.

Why does this sound like something out of a Nickolas Cage movie? Let me explain.

Today vSAN out of the box can phone home Performance, Configuration, and health telemetry to support and engineering using the vSAN Support Insight functionality. Note this phone home data builds an obfuscation map by default so that hostnames, virtual machine names, and network information are not exposed in the phone home. By using your vCenter UUID support and engineering can further drill into the environment and diagnose many common issues without necessarily needing a full manual log collection.

If you want to inspect a sample of what it looks like you can read through this JSON file here.

What happens when Support finds an issue and explains the secret code name for the Virtual Machine or host that is the problem? Where do you find a secret decoder ring to make sense of this?

In the vSphere Web Client, navigate to the vSAN Cluster > Configure > vSAN > Health and Performance > Online Health Check. Click on the Download Obfuscation Map

In the CLI on the VCSA?

SSH into vCenter Server Appliance.
Run command: cd /var/log/vmware/vsan-health/
The obfuscation mapping file is <uuid>_obfuscationTableForHuman.json.gz.

Windows Environment:
1. Login to Windows vCenter Server machine.
2. Open C:\Program Files\VMware\vCenter Server\logs\vsan-health
3. The obfuscation mapping file is <uuid>_obfuscationTableForHuman.json.gz.

What if you are not phoning home CEIP data?

It’s time to turn it on. It’s less information than a normal log collection would include, and by having it phone home regularly you are in a better situation to get faster support should you need it. For setup and network requirements check out this storage hub section.

What happens if you do not have compliance needs to speak in code, and would rather VMware just have direct access to your Virtual machine and Hostnames? You can email, or upload and attach it to the ticket. Support can bind this in vSAN Support Insight, but it will expire in 7 days.

What is in the obfuscation map?

Here is a sample.map file.

Jan 4

Enter your password to view comments.

Protected: The performance problem is in another castle

Jan 3

So you are thinking about taking an offer… What do you need to know?

It’s a new year and I’m sure some of you had resolutions to look at a new job. As budgets “Unfreeze” new jobs are opening.

This post has some history. Previous to coming to VMware I worked for an IT consultancy and used to at one time be a hiring manager. It was always interesting seeing why people chose to stay, leave, or join our shop. Even when people left, I heard about their future moves by being a common reference used for the previous manager. On top of this new hires would often ask me a million different questions about a job trying to compare the old company with the new company regarding benefits (both compensation and non-compensation related).

From this, I’ve amassed an interesting list of:

1. Compensation that often is overlooked
2. Things you want to know about a job before you take it for the quality of life reasons
3. How to know if the grass is greener (or not) on the other side

While this list isn’t something you would send in whole to a recruiter, it’s information that through various sources you might want to try to understand before making a jump to a new job. The first half is Job questions; the second half is compensation questions.

The Job Questions…

What’s the team/dept/companies view on Training?

If they don’t have a training programme or allow time for training/skills improvement that could be a red flag.

Why is the position open?

Growth, backfill, etc. This is the reverse of “why are you looking to leave your last job?”. If it’s the 3rd time, they tried to fill a roll something may be off…

What are the expected hours? What are the exceptions, holidays, etc.?

I once worked an outage till 4 AM then was expected to walk into the office by 8:30 AM I was happy to leave that place. School districts might do four day work weeks in the summer; some oil/gas companies do 4 x 10’s or other weird schedules. Occasionally I have to take calls early or late (to deal with people in EMEA, ANZ, etc.).

Are there SLA’s in place?

What is expected of your team, and are they equipped to meet it?

What is the annual IT/Department budget?

Whats the budget for your group look like? What projects have been funded as well as what is planning on being funded can be a proxy for this question. You don’t want to walk into a shop with 8-year-old systems and no budget for replacement.

Who determines the IT budget?

What’s the process, who are the actors involved?

What’s the company’s position on open/capex IT spend?

Lease vs. Buy. Are they balanced, or for financial reporting reasons (ROIC) are they 100% one or the other if possible.

Are they cloud (friendly, neutral, hostile)?

What are they using for cloud now, and what are they planning on migrating?

What does your current infrastructure look like?

Shiney brand new VxRAIL/UCP-HC Cluster, or 200 Physical servers running Windows 2000? How bad is the technical debt? What/where are the datacenters, who are the providers, what is the networking, (WAN, and edge/campus gear). What storage vendors and hypervisors are in play.

What is the spread of the tasks expected and are they reasonable?

There’s nothing like being hired to be a data center architect and discovering that fixing printers is in your responsibilities. Skill growth requires you focus on things that matter. Also, if managers see you fixing printers or doing other lower end work, they tend to mentally associate what you should be paid with the bottom 10% rather than the top 10% valuable work.

What Services are outsourced?

Does someone manage printers, the WAN circuits, the storage, backups, the DR, etc.? Beware shops that don’t believe in outsourcing anything as they tend to view in-house labor as a “free” commodity.

What are they doing for DR?

This question is a mix of what is their plan, and what is a reality. How often is it tested? Do they hit the SRM failover button once a quarter, or do they have an out of date binder?

What is the targeted refresh cycle for Network/Servers/Storage?

Do they run stuff five years, ten years? Do they run gear beyond its natural life, or beyond support agreements?

What is the maintenance schedule?

Do they patch at all, is there automation in patching.

What Compliance do they have?

PCI/HITECH/etc.

The Team Questions

Who will be the manager? Can I meet them? – It’s a red flag if you can not meet your line manager. You will want to know the person who will assess your performance, impact your bonus, assign you good (or bad) projects, etc…

What is your biggest daily/weekly frustration?

Key things to note is if this is something you can stand, or if this is something that’s fixable. Bonus points if you bring unique skills, or you will be working on a project to fix it. “Our Fibre Channel network is slow, but the HCI project you will be on should fix that!”

Ask about how success is measured?

Is there a forced Stack rank? Are there general metrics that you target (uptime, on-time delivery of projects?).

Who is on the team? Can I meet them? Knowing who you work with is crucial. Are they talented, Friendly, cooperative?

How does the team communicate? Are there daily meetings, do they use Slack, do they just use email, is everyone in the same building? What percentage of the teamwork remotely?

How is documentation handled? (Well documented Wiki, vs. the last guy, torched Jira on the way out and you will be guessing passwords).

What are the platforms and Vendors? Are you a CCIE and it’s an all Juniper shop? Don’t be scared! The key is knowing what area’s you will need training.

What is the new employee onboarding process? – Will it be two days of well-orchestrated events, or will you still be waiting for a phone and computer 30 days later?

What are expectations for the first 90, 180, 365 days? Is there a project, or milestone or education path they need you to have accomplished. How long do they expect you to fit into the shoes?

What is the cross coverage?

Is there only one person who knows how to restore from backup? Is there cross training? This can be bad if you want to go on vacation…

What is the upward mobility?

What are the expectations for moving up in title, rank, role/responsibility? Are there defined elements to your career path and claim or will you be “IT Dude” until “Head IT Dude” retires?

*What about the Company*

What’s the companies roadmap?

If they don’t know where they want to go then it’s going to difficult to help steer them there.

What is the YoY growth? Is the company growing, or is it holding on for life support? Some industries are cyclical (Oil/Gas) some are past their prime (Sun Microsystems was a different company to work for in 1994, and 2001).

How many Employees are at the company? At a five man company, you might have to put toner in the printer. At IBM you likely will not know that person name. Some people like large companies, some like smaller. There are pros and cons to both.

Is the company profitable under GAAP? Companies sometimes do crazy things like claim they are profitable if you exclude employee compensation. If a company is a tech startup growing 100% year over year, don’t expect this one to be true, but if it’s a mature public company, this is something you can look up.

If not, what is the timeline or pathway towards profitability? If it’s a startup, it may be planning on exiting soon, or taking more VC and growing to the moon. Both have their risks, make sure you understand them. What is the runway (how long at the current burn rate will they survive)?

What is the companies competitive advantage? Is it low cost? Is it Intellectual property? Is it market saturation/penetration? This can shed some light on how the company operates. A ruthless lean manufacturing company might give employee’s 8-year-old laptops because they are cheap on capital spending.

What is the biggest roadblock to scaling the company?

Is it sales, marketing, operations, R&D?

What challenges does the company have at the moment? What do you foresee coming?

This can be quite telling; it can show that they’ve taken the time to identify and address challenges. Identifying key competitors here can help quite a bit.

Compensation Questions

1099 or W-2 (US). Contractor? The contractor who’s a W-2 of the contracting company? Full-time employee of end customer? LOTS of ways to chop this. There are tax implications of being 1099. Note, there are potential issues with being a 1099 as a tech worker if you are treated like a full-time employee.

Pay Cycle – You shouldn’t be living paycheck to paycheck, but knowing the cycle makes sense if your rolling from a weekly to a monthly you may need to move some things around to handle the change in cash flow.

Salary Base and it’s growth – can it grow? Is there an org chart with clear steps to moving up and getting bumps in pay? Does everyone get 1% raises and stagnate till they leave? A company that hasn’t given raises in 5 years has given everyone a pay cut.

OTE Bonus. Cash value or is it a multiplier based on base pay? Tied to metrics or your boss and directors random fancy? (This isn’t that bad, but you need to know who decides it). While there is an “On Target Earnings” nothing stops you from getting over 100%. The biggest way to see how real this is is to check with GlassDoor and existing employees who’ve been there 4-5 years. Sometimes a bonus is real; sometimes they are “Virtual”. For bonus how often is it paid out, and will they pro-rate a partial bonus for a new employee joining mid-cycle? I once had a co-worker leave for a job that he thought made 10% more but he forgot to ask about if they had a bonus. At the end of the year, he learned they didn’t have them (or raises) and discovered he didn’t make more money.

Insurance – PPO/HSA/HMO/EPO/POS all have different issues. What’s in network vs. out of network? Also Dental and Insurance. What about medications? Eyecare health insurance is a scam/pre-payment program. Use EyebuyDirrect or some online place to buy glasses, or max our HSA and get LASIK if you can. Reddit has a good thread explaining the difference here and how to compare.

Education

School, College, Certifications, Classes. – Do they pay for certification tests, if so how many attempts? The key one to test the seriousness of this is to ask others in the department what they have spent in the past year.

Conferences – Tacking onto certifications do they pay for VMworld? Do they cover travel and hotels? Are you banned from events in Vegas even if they are a lower cost than San Francisco? (not uncommon in SLED).

Sabbatical In our company you can apply for 3-month transfers to wildly different jobs to learn about how that role functions? You can do a 1-week education track (take education in something unrelated).

Stock and Investment Compensation

RSU (Restricted Stock Units)’s – If you keep getting these every year on a standard 2-5 (Depends on company and grant window) year vestment schedule, you eventually end up with a rather nice kicker. This also is nice if your stock doubles within a given year (Well except for capital gains). The longer you stay, the stickier these become, and the more a company likes you, the more they will give you to “handcuff” you to the company. The more a company wants you to stay the more you get these. A decent 6 figure pile of this is nice and can be used in leverage with a company who wants to poach on you why they better give you a bigger base (or a bigger pile of them!).

Stock Options – Inversely if you work for a startup, you might get stock options. These are a LONG shot gambling game (like 2% pay off), but I know some guys who their stock is trading in the 30’s and their options were in the $2 range so assuming they make it to lockout I expect to get a call to hang out on their yacht. Personally, there are so many options to screw the employee like clawbacks/ratchet clauses I don’t put much faith in these. https://tldroptions.io

ESPP – Buy stock at a discount (See above comments). Note these are bought at a 10-15% discount based on the beginning or ending window (Whichever is lower) so its a game of heads I win, tails you lose against the market and can pay pretty well (or just be a nice couple grand of cash). I’ve had windows where I made 15%, sometimes I’ve made 115%. These are structured where you make money no matter once but read the fine print.

ESOP – The weird retirement type cousin of ESPP. I hear these are more common overseas.

Flexibility in work

Paternity leave – Some places do partial pay, some to maternal OR paternal, and some do maternal only AFTER you burn out your PTO. Note maternity, paternity, and adoption leave may have different rules. I’ve got a family member whose company policy is six months. Wife is a pediatrician at a children’s hospital. She gets Zero. This is all over the place in the US.

Vacation – My first job I had zero vacation for the first year. Note some companies this is more negotiable than salary; sometimes it’s less. Are Sick days different? Do you need a doctor note? Are there back out times for vacation (VMworld I’m pretty sure is a non-starter in my current role). Do they make you take a vacation for conferences (Yes I’ve seen this a lot sadly…)

Flex Time/Overtime pay – Can you turn overtime into time off? If you come in early can you leave early? Do you get paid for overtime (even if your an exempt employee some places will still pay if approved)? Does the company miscategorize helpdesk as exempt or other questionable legal practices?

Commute Costs – Company Car, parking pass, bus pass, toll pass? What’s the non-reimbursed depreciation? What is the $ per mile they allow for trips to the datacenter? Do you get a car allowance (EMEA this is more common)?

Work from home/anywhere Can I just leave town on Wed/Thursday and go to a beach house to finish working out the week? There are HUGE costs savings to working from home, but do pay attention if you need to supply your desk, chairs, monitors, etc.

Expense
Do they let you do your booking, do they require a corporate credit card (no points can be brutal, to the point of $20-30K easily for some people in compensation) Can you expense travel lounges on long flights. Can you expense more than $15 for lunch with a customer? Using Lift instead of downtown and airport parking has cut my mileage to non-existent for my car.

Travel

Travel Points and status – Traveling for work a lot adds up. Note this is a NON-taxable (Weird exclusion). So when traveling, I can get hotel points and airline points. With Southwest, I have a companion pass (My wife flies free with me), and with Marriot, I get free cocktails and appetizers in the afternoon and breakfast in the morning in the executive lounge. I get free upgrades with Marriot when traveling so that $150 small room can turn into a 40th-floor suite sometimes.
Travel Policy – Do they make you fly 18 hours, five hops to save $100?

Do they put you in first class if the flight is over 4 hours?
Do you stay in the Motel 8 and have to share a room (or PAY for your spouse’s 1/2 of the room if they happen to travel with you!). Do they make you fly in the morning you are presenting when it’s 12 times zones away, or do they put you up in the hotel for the weekend to adjust to the time zone, and be a tourist for the weekend?

Team Offsite, outings, parties, etc. – Got a team offsite and can you expense going snowmobiling or something cool? Beer bash for finishing release? If you are on campus are there free movie nights and other things. Does the boss cover happy hour on Friday?

Retirement stuff

401K – What’s the match? Is it partial? Does it take a while to get vested? What can you invest in? Are the default options all garbage or can you keep fees low and put money into low fee index?

401A – Like a 401K match but you don’t have to put money in, they just put x% of your salary. Common in Education and non-profits.
457(b) – Can withdraw from it without early penalty if you no longer work for the said employer. This one carries risks if the employer goes insolvent.
403B – A lower overhead 401K plan with no match. Common in Education and non-profits.

Pension – These do exist in a few places still in the US. More common overseas.

MISC.

Equipment allowance. My wife spends money on books of stethoscopes. Some people can expense screens, laptops, mice. We have vending machines for phone chargers, mice, etc. around our offices.

Telecom – Will they cover your cell phone or data plan? Did they buy you a pager to get out of paying your cell phone bill (I had one of these in 2008)?

Gym reimbursement – Do they pay for Gym memberships.

Negotiating Compensation

Look, I’m not an ex-recruiter (But I’m friends with some). I did run across this video for an ex-tech company recruiter talking about how to deal with some common situations.

Dec 19

ESXI 6.5 Patch 2 – vSAN Support Insight!

ESXi 6.5 Patch 2 is out, and with it comes a product improvement that I’ve been excited about for quite some time. The KB for what’s new can be found here.

Three storage improvements came out with this release.

vSAN Support Insight (including a dedicated customer bulletin with more details on this feature)
Adaptive resynchronization (Previously released for 6.0) – Adaptive Resync adjusts the bandwidth share allocated to Resync I/O to minimize impact to client I/O. With this feature, Resync speed will adaptively adjust during off peak and high peak I/O cycles. During off-peak cycles Resync will speed up and during high peak cycles Resync will slow down. This ensures Resync continue to make progress while minimizing impact to the client I/O.
Multipath support for SAS systems – “vSAN now enables multiple redundant paths from server to storage with no setup required, when used with a supported multipath driver. An example of such a system is HPE Synergy.”

vSAN support insight is revolutionary in it’s ability to change the support experiance, accelerate product improvements. Support for vSphere has typically revolved around a predictable script. You call in, and if your issue isn’t easily triagable you may need to export logs. This process has some challenges because:

1. It takes time to pull logs and upload them.

2. If the issue your cluster has impacts avalability to the logs this can drag out getting a resolution.

3. Additional Logs may be needed to compare before/after with the issue.

On the support side of things, the inital call often begins with you trying to articulate your issue, describe your enviroment and any releavent details. The support staff essentailly being “blind” on that initial call until you can describe enough of the enviroment, push logs, or setup a webex/remote sessions to show the issue.

vSAN Support Insight helps with these challenges by automatically pushing configuration, health, and performance telemtry to VMware. Removing these delays is critical to improving support outcomes.This phone home data set also provides a framework for future product improvements, future support enhancements, and better cross corelation of issues for engineering.

Blog
blogs.vmware.co…upport-insight/

Video
storagehub.vmwa…-demonstration/

StorageHub Documentation
storagehub.vmwa…support-insight

Musing of a Storage guy in a Virtual World..