Skip to content

Archive for

vSAN Sizing and RVtools Tips

VMware has released a new vSAN sizing tool!

Some guidance for the tool has been included on how to use it are in the design and sizing guide on StorageHub.

A lot of partners like using RVtools (A great way to make a simple capture of an inventory, health, and configuration) as a means to collect storage capacity information, as well as a snapshot of compute allocations.

  • If you have a large number of powered off VM’s have a serious discussion if they will all be started or needed at any time. If not, consider excluding them from compute sizing.
  • Use the health tab and look for Zombie VM’s and see if these cold VM’s can be deleted or migrated out.
  • Look for open snapshots, and see if these need to be collapsed (which can save space).
  • Be aware of the difference in the two storage metrics (allocated vs. consumed MB). If you intend to keep using thin provisioning, you do not need to size for all of the allocated. In the video, this is a significant capacity difference.
  • If the existing solution has VM’s tied to storage demands (Storage management VMs, VSA’s) that will be deprecated by vSAN be sure to exclude them.
  • Have a serious discussion on if the vCPU to physical core ratio is “working” or if they see performance issues. I’ve seen both people be too conservative (1:1 in test dev) and too aggressive (20:1 for databases!). You can see the existing ratio’s on the host tab.
  • Pay attention to CPU generations. Vintage Xeon 5500 will be crushed clock for clock by new EPYC processors.
  • Realize you can change the CPU configuration (Cluster advanced options). Some people may want to optimize their CPU model for licensing (commonly 16 core for windows, or possibly lower core but higher clock for Oracle). You can change these assumptions.
  • Be sure to check out the health tab, and look through the host configs. Make sure NTP is set up on hosts! Use this as an opportunity to see if the existing environment is even healthy.

Have any more tips and tricks? Check out the comments section below!

 

 

 

Where did my host go….

UPDATE: https://kb.vmware.com/s/article/53749
VMware and Intel have a KB for workarounds on this issue.

I was reading Bob Plankers colorful complaints about his Intel X710/XL710/X722/XXV710 family of NICs and figured I’d do some digging and ask around on people I know who have them as well as summarize some things I learned from using them as a customer.

A few observations:

  • These problems are not specific to vSphere. People running Linux and Windows on bare metal ran into these issues
  • While a lot has been focused on the LRO/TSO issue, there is another separate issue tied to LLDP and duplicate mac addresses being created.

First Issue LRO/TSO

This KB Sums up the issue quite well by pointing out that these features can cause PSODs. Checking with some friends who used to be able to reproduce this at the drop of a hat the newest driver/firmware is a lot more stable in this regard, but it can still happen. Some people are leaving these disabled to stay safe, while others are hungry for the small CPU gains these features deliver.   How do I remediate it? Beyond manually setting it on the hosts Jase Mccarty has a great script that will do this in bulk for a cluster.

Next up: The case of the disappearing host!

The common symptom is that management on a host will cease to function (Pings will drop) and the host will disappear from vCenter. Sometimes something more catastrophic happens (HA triggers, host isolation is triggered, storage or vMotion fails). If you pay attention closely to LogInsight, you will see your switches are reporting Mac Address Flapping (You are sending your switches syslog into LogInsight, RIGHT?!?)

Sow what’s going on here?

How is VMK0 special

This goes back to the special behavior for VMK0 where it steals the mac address from a physical port. This is handy for new cluster setups where people know the MAC addresses from the OEM providing them before delivery being able to put this in their DHCP reservations and get started without needing to physically touch the hosts to know which one is which etc.

Why is this card special?

This card is unique in that there is a special LLDP agent that runs on the card and intercepts LLDP packets.  Previously I associated LLDP with simply sending information on what’s plugged in where (which is why you should turn it on for send/receive with your VDS). In this case, though the LLDP agent will also update where a MAC is located.

Why together does this happen?

The challenge comes when VMK0 moves to a different physical switch port and tries to move the MAC address with it. You get a fun ARP battle between the LLDP agent of the physical port and the VMK0 that is behind a different physical port. A good old fashioned duplicate MAC entry ARP battle ensues, and this is going to manifest itself as a host going offline completely, or flipping back and forth based on the update hold-down interval on the switch. (Side note, any real networking people feel free to correct me on my terminology here I dropped out of my CCNA class in 2008).

Why did I loose more than management (or what am I doing wrong!)?

Given most people use VMK0 for management by vCenter (and for non-VSAN clusters HA heartbeats happen here) this can have a lot of interesting behaviors like loss of management, host isolation response being triggered. This is another great reminder of why you should use datastore heart beating, or VSAN which will not depend on VMK0 for heartbeats.

Also if you are running EVERYTHING on VMK0 (Storage vMotion) which is NOT a recommended practice (isolate storage and vMotion networks!) you could see all of the virtual machines crash and other fun things.

Workarounds?

So there are a few ways to possibly work around this.

  1. You could simply avoid using VMK0 with this card. Either disconnecting it and using a new VMK4 or so forth for whatever it was being used for. This is simple, it’s easy (outside of disconnecting and reconnecting hosts) and doesn’t require you touch the network beyond having one extra IP on the management network to make it easier.
  2. You could change the mac address manually to something in the random VMware MAC address space (Need to clarify if this is supported, but it’s simple enough and avoids this issue). Note that the MAC would be set back if you ever remote and recreate VMK0.
  3. If you trust your networking team, you could try asking they hardcode the MAC address to specific ports in the CAM tables of the switch. I would look at this only as a last resort if operationally you can’t physically change anything on the hosts but need an extreme workaround
  4. *EDIT* It looks like running LACP across the origional physical port and another port will work around the issue. The switch isn’t going to care where the frame comes from, and so this should reduce or ignore the chance of an arp fight. Balancing for VMK0 across physical ports will not be great, but as long as it is is management only you will likely not care too much. (Thanks to Simon for this discussion).
  5. *EDIT* Try putting VMK0 on a tagged NON-Native VLAN. It can’t get in a fight with the LLDP agent for the MAC address if it’s on a completely different broadcast domain (Thanks to Broc Yanda for this idea).

What else is going on that I don’t know about vSphere Networking?

This week I also learned about shadow vmnics.