Tuesday 2 October 2012

vSphere vmotion network outages

During heavy vmotion operations I was experiencing intermittent network outages of ESXi hosts and some vms running within the cluster.
This seemed to get progressively worse over a period of several weeks until it became almost every time a host was placed into maintenance mode there would be some network outage of some vms and even other ESXi hosts within the same cluster.

After initially looking at the network infrastructure we noticed that there was a large flood of unicast traffic on the vlan which was being shared by vmotion, and some windows based vms, around the time of the vmotions (to be expected in vmotion operations).

Now VMware best practice is to have vmotion and ESXi Management on their own separate vlans or networks but this had never been an issue previously with this cluster which was about 4 years old and had been upgraded over that time from ESX 3.5 to ESXi 5.0 u1 (it's current state). There had been no significant network changes during this period also which could have had a waggling finger pointed at them so it was not obvious how we had come to this issue.
It seemed obvious to start thinking that the gradual changes and growth of the cluster had started to cause this issue for us.  Over the various versions which these hosts have been running the vmotion feature has been greatly enhanced and improved and the amount of simultaneous vmotions a host can support has also increased from 2 to 4 (or 8 with 10Gbe) as can be seen here:

(taken from the vSphere 5.1 Documentation center here)
Network Limits for Migration with vMotion
Operation
ESX/ESXi Version
Network Type
Maximum Cost
vMotion
3.x
1GigE and 10GigE
2
vMotion
4.0
1GigE and 10GigE
2
vMotion
4.1, 5.0
1GigE
4
vMotion
4.1, 5.0, 5.1
10GigE
8

There had also been a sizable growth in the number of hosts and virtual machines in this cluster and the hosts had been increased in capacity etc along the way too.  This all resulted in a much heavier demand for vmotion during the process of placing a host into maintenance mode as often I would be looking at somewhere between 20-50 virtual machines being migrated across the cluster.

It turned out that this was in fact our issue and so as we had spare capacity within our hosts, due to the recent removal of some iSCSI connections to this cluster, we were able to hook up a couple of dedicated vmotion nics per host and placed them into their own vlan away from the management and any other systems.
vSphere 5.0 gives us the ability to utilize more than 1 vmotion nic per host. All that was needed was to create 2 new vmkernel ports on a new vSwitch and have vmkernel port 1 bound to vmnicX as active and vmnicY as standby, then just reverse the configuration for the second vmkernel port.
Once the two new vmotion ports were created and assigned IPs on the new vlan, I just removed the old vmotion port which was in the shared vlan and then that was all of the configuration which was needed.

I performed a few test migrations after that and the performance improvement was easily visible even without measuring it.  We used to have windows vms with 4GB ram move between hosts within 1-2 mins and now we are getting them within 30 seconds.

Best of all, when now entering maintenance mode on a host, even one which is running many vms, we are no longer getting any network outages and the process is a lot speedier too. Happy times again!



No comments: