Nothing kills a Virtualization initiative faster than an inadequate or poorly implemented storage platform, here’s how ESXTOP can help you track down an issue…
Nothing will hurt a virtual infrastructure, or rather the perceived experience of a virtual infrastructure like a poorly designed, poorly implemented or misconfigured (or should I say, ‘sub-optimally configured’) storage platform. It’s the cornerstone of any virtual infrastructure and essential for any of the features that really drives a vSphere implementation- vMotion (and therefore DRS, HA), Snapshots, Backup strategies, you name it and your storage platform is key to it being effective in your environment. In an ideal world your storage design will incorporate capacity and workload requirements and will work right off the bat, however this is a less than perfect world and we don’t always get the luxury of green fields.
So, chances are at some point in the career of every vSphere Administrator, or Consultant, that you’ll need to investigate some storage performance issues. Now typically it isn’t always evident initially that an issue lies with your storage platform- a high Guest CPU usage for example, could well be caused by a storage bottleneck somewhere in the stack, so before you run off on a hiding to nothing to your storage team it pays to first gather some evidence. Enter ESXTOP. Now there are many different issues you could be facing, in general latency isn’t a bad place to start your investigations.
Either from the ESXi Shell or via SSH, enter ‘esxtop’ to bring up the interface.
What we’re interested in today is disk latency stats, so hit ‘U’ to bring up the disk statistics.
Next let’s focus on the latency, so hit ‘f’ to edit the fields.
The active fields are the ones with an * next to them, so here we’ll de-select the ID field by hitting ‘b’. In fact all we want for now is A and I, to give us the device and the latency stats. Press space when done and you’ll have your latency stats.
As the below VMware diagram shows, ESXTOP will help you find out exactly where in the stack your issue lies, giving you stats for DAVG, KAVG, QAVG, and GAVG…
So what do these values mean?? Let’s have a look.
GAVG – ‘Guest’ Average – This is the value that will be most indicative of any potential performance issues that might affect your VMs. GAVG represents the average amount of time an IO request takes to complete, as perceived by the VM Guest OS. Essentially this value is the sum of the other three AVG values combined, and anything higher than 5ms could indicate that the VM may be suffering due to storage latency.
KAVG – ‘Kernel’ Average – The KAVG value represents the average time each IO request spends in the vkernel from the Virtual Machine Monitor (VMM) and out the other side before it hits the HBA driver.
QAVG – ‘Queue’ Average – This is the average time each IO request spends in the host’s storage queue. In a properly configured environment this value should never be more than 1, any higher and you will be looking at increasing the queue size to ease your issues.
DAVG – ‘Device’ Average – This indicates the average time each IO request spends on the storage hardware itself. The DAVG value is the one you need to take to your storage team as evidence that there’s something not quite right. By the time the IO reaches the DAVG counter it’s out of the vSphere stack and onto the physical storage hardware, so this could be an issue with drivers, the storage network, the controller, the array, the disks in the array and so on.
There are so many facets to vSphere storage that ESXTOP will prove invaluable if someone comes to you and says ‘My VM is running slow’, storage stats and latency is a good place to start in pinpointing any issues. I hope this has been useful, for any further specific diagnostics please feel free to get in touch!