Queue Limits
I/O data goes through several storage queues on its way to disk drives. VMware is responsible for VM queue, LUN queue and HBA queue. VM and LUN queues are usually equal to 32 operations. It means that each ESX host at any moment can have no more than 32 active operations to a LUN. Same is true for VMs. Each VM can have as many as 32 active operations to a datastore. And if multiple VMs share the same datastore, their combined I/O flow can’t go over the 32 operations limit (per LUN queue for QLogic HBAs has been increased from 32 to 64 operations in vSphere 5). HBA queue size is much bigger and can hold several thousand operations.
Queue
Monitoring
You can monitor
storage queues of ESX host from the console. Run “esxtop”, press “d” to view
disk adapter stats, then press “f” to open fields selection and add Queue Stats
by pressing “d”.
AQLEN column will
show the queue depth of the storage adapter. CMDS/s is the real-time number of
IOPS. DAVG is the latency which comes from the frame traversing through the
“driver – HBA – fabric – array SP” path and should be less than 20ms. Otherwise
it means that storage is not coping. KAVG shows the time which operation spent
in hypervisor kernel queue and should be less than 2ms.
Press “u” to see
disk device statistics. Press “f” to open the add or remove fields dialog and
select Queue Stats “f”. Here you’ll see a number of active (ACTV) and queue
(QUED) operations per LUN. %USD is the queue load. If you’re hitting 100 in
%USD and see operations under QUED column, then again it means that your storage
cannot manage the load an you need to redistribute your workload between
spindles.
Other
Disk
Statistics
·
CMDS/s
Number
of commands issued per second. READS/s
Number of read commands issued per second.
·
WRITES/s
Number
of write commands issued per second.
·
MBREAD/s
Megabytes
read per second.
·
MBWRTN/s
Megabytes
written per second.
Latency
values are reported for all IOs, read IOs and all write IOs. All values are
averages over the measurement interval.
·
All
IOs: KAVG/cmd, DAVG/cmd, GAVG/cmd, QAVG/cmd
·
Read
IOs: KAVG/rd, DAVG/rd, GAVG/rd, QAVG/rd
·
Write
IOs: KAVG/wr, DAVG/wr, GAVG/wr, QAVG/wr
·
GAVG
This
is the round-trip latency that the guest sees for all IO requests sent to the
virtual storage device.
Q:
What is the relationship between GAVG, KAVG and DAVG?
A:
GAVG = KAVG + DAVG
·
KAVG
These
counters track the latencies due to the ESX Kernel's command.
The
KAVG value should be very small in comparison to the DAVG value and should be
close to zero. When there is a lot of queuing in ESX, KAVG can be as high, or
even higher than DAVG. If this happens, please check the queue statistics, which
will be discussed next.
·
DAVG
This
is the latency seen at the device driver level. It includes the roundtrip time
between the HBA and the storage.
DAVG
is a good indicator of performance of the backend storage. If IO latencies are
suspected to be causing performance problems, DAVG should be examined. Compare
IO latencies with corresponding data from the storage array. If they are close,
check the array for misconfiguration or faults. If not, compare DAVG with
corresponding data from points in between the array and the ESX Server, e.g., FC
switches. If this intermediate data also matches DAVG values, it is likely that
the storage is under-configured for the application. Adding disk spindles or
changing the RAID level may help in such cases.
·
QAVG
The
average queue latency. QAVG is part of KAVG.
Response
time is the sum of the time spent in queues in the storage stack and the service
time spent by each resource in servicing the request. The largest component of
the service time is the time spent in retrieving data from physical storage. If
QAVG is high, another line of investigation is to examine the queue depths at
each level in the storage stack.
Queue Statistics:
· AQLEN
The
storage adapter queue depth. This is the maximum number of ESX Server VMKernel
active commands that the adapter driver is configured to support.
·
LQLEN
The
LUN queue depth. This is the maximum number of ESX Server VMKernel active
commands that the LUN is allowed to have.
·
WQLEN
The
World queue depth. This is the maximum number of ESX Server VMKernel active
commands that the World is allowed to have. Note that this is a per LUN maximum
for the World.
·
ACTV
The
number of commands in the ESX Server VMKernel that are currently active. This
statistic is only applicable to worlds and LUNs.
Please
refer to %USD.
·
QUED
The
number of commands in the VMKernel that are currently queued. This statistic is
only applicable to worlds and LUNs.
Queued
commands are commands waiting for an open slot in the queue. A large number of
queued commands may be an indication that the storage system is overloaded. A
sustained high value for the QUED counter signals a storage bottleneck which may
be alleviated by increasing the queue depth.
Note
that there are queues in different storage layers. You might want to check the
QUED stats for devices, and worlds.
·
LOAD
The
ratio of the sum of VMKernel active commands and VMKernel queued commands to the
queue depth. This statistic is only applicable to worlds and LUNs.
The
sum of the active and queued commands gives the total number of outstanding
commands issued by that virtual machine.
Error Statistics:
·
ABRTS/s
The
number of commands aborted per second.
It
can indicate that the storage system is unable to meet the demands of the guest
operating system. Abort commands are issued by the guest when the storage system
has not responded within an acceptable amount of time, e.g. 60 seconds on some
windows OS’s. Also, resets issued by a guest OS on its virtual SCSI adapter will
be translated to aborts of all the commands outstanding on that virtual SCSI
adapter.
·
RESETS/s
The
number of commands reset per second.
For
NFS datastores press "u":
Reads/s,
Writes/s, MBreads/s, MBwrites/s, cmd/s, GAVG/s (guest
latency).
GAVG
is the round-trip response time as it is perceived by the guest operating
system. the number is calculated with the formula: DAVG+KAVG=GAVG.
No comments:
Post a Comment