Thursday, October 2, 2014

Windows Perfmon counters I use to locate possible resources bottlenecks

Disk counters:

·         PhysicalDisk\Avg. Disk Sec/Read This measures the average time, in seconds, to read data from the disk. If the number is larger than 25ms, that means the disk system is experiencing latency when reading from the disk. For mission-critical servers, the acceptable threshold is much lower. The most logical solution here is to replace the current disk system with a faster disk system.

·         PhysicalDisk\Avg. Disk Sec/Write This measures the average time, in seconds, it takes to write data to the disk. If the number is larger than 25ms, the disk system experiences latency when writing to the disk. For mission-critical servers the acceptable threshold is much lower. The likely solution here is to replace the disk system with a faster disk system.

 
Summary:
The most common performance monitoring metric people quote is usually disk queue length.  While this is important counter, for SAN systems it is almost impossible to use it as an accurate metric.

Why?
Because the usual “rule of thumb ”for“ bad performance is a queue length, greater than 2 for a disk drive. However, when you have a SAN with 100 drives, you have no idea how many are being used for your drive.

When you start focusing on response time, which is what really matters most, the queue length starts to become irrelevant...

1.  When you read the perfmon data and see a number like “.010” this means 10 milliseconds.

3. Note, many times the performance problems are tied to firmware revisions, HBA configuration, and/or BIOS issues.

4. I use the following table to determine the meaning of the data:

<10ms  excellent
<20ms  reasonable
>20ms  bad

 
CPU counters:

·        System\Processor Queue Length:  number of threads in the processor queue. The server doesn't have enough processor power if the value is more than two times the number of CPUs.

·        Processor\% Processor Time: percentage of elapsed time the processor spends executing a non-idle thread. If the percentage is greater than 85 percent, the processor is overwhelmed and the server may require more processing power.


MEM counters:

·         Memory\Cache Bytes This indicates the amount of memory being used for the file system cache. There may be a disk bottleneck if this value is greater than 300MB

·         Memory / Available MBytes - minimum 10% of memory should be free and available. Less than that usually indicating there is insufficient memory which can increase paging activity. You should consider adding more RAM if that happens

·         Memory / Pages/sec – should not be higher than 1000. A number higher than that, usually indicates there may be a memory leak happening.

·         Paging File / % Usage – should not be greater than 10%.

·         Memory\% Committed Bytes in Use This measures the amount of virtual memory in use. This indicates insufficient memory if the number is greater than 80 percent. The solution for this is to add more memory.


Network counters:

·         Network Interface / Output Queue Length - measures the length of the output packet queue in packets.

healthy – 0
caution – 1-2
critical – >2

No comments:

Post a Comment