Tuesday, July 28, 2009

iostat: (r/s + w/s) * svctm = %util on Linux

iostat -x is very useful to check disk i/o activities. Sometimes it is said that "check %util is less than 100%" or "check svctm is less than 50ms", but please do not fully trust these numbers. For example, the following two cases (DBT-2 load on MySQL) used same disks (two HDD disks, RAID1) and reached almost 100% util, but performance numbers were very different (no.2 was about twice as fast as no.1).
# iostat -xm 10
avg-cpu: %user %nice %system %iowait %steal %idle
21.16 0.00 6.14 29.77 0.00 42.93

Device: rqm/s wrqm/s r/s w/s rMB/s wMB/s
sdb 2.60 389.01 283.12 47.35 4.86 2.19
avgrq-sz avgqu-sz await svctm %util
43.67 4.89 14.76 3.02 99.83

# iostat -xm 10
avg-cpu: %user %nice %system %iowait %steal %idle
40.03 0.00 16.51 16.52 0.00 26.94

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
sdb 6.39 368.53 543.06 490.41 6.71 3.90
avgrq-sz avgqu-sz await svctm %util
21.02 3.29 3.20 0.90 92.66

100% util does not mean disks can not be faster anymore. For example, command queuing (TCQ/NCQ) or battery backed up write cache can often boosts performance significantly. For random i/o oriented applications(in most cases), I pay attention to r/s and w/s. r/s is the number of read requests that were issued to the device per second. w/s is the number of write requests that were issued to the device per second (copied from man). r/s + w/s is the total number of i/o requests per second (IOPS) so it is easier to check whether disks work as expected or not. For example, a few thousands of IOPS can be expected on single Intel SSD drive. For sequential i/o operations, r/s and w/s can be significantly affected by Linux parameters such as max_sectors_kb even though throughput is not different, so I check different iostat status variables such as rrqm/s, rMB/s.

What about svctm? Actually Linux's iostat calculates svctm automatically from r/s, w/s and %util. Here is an excerpt from iostat.c .

...
nr_ios = sdev.rd_ios + sdev.wr_ios;
tput = ((double) nr_ios) * HZ / itv;
util = ((double) sdev.tot_ticks) / itv * HZ;
svctm = tput ? util / tput : 0.0;
...
/* rrq/s wrq/s r/s w/s rsec wsec rkB wkB rqsz qusz await svctm %util */
printf(" %6.2f %6.2f %5.2f %5.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %6.2f %6.2f\n",
((double) sdev.rd_merges) / itv * HZ,
((double) sdev.wr_merges) / itv * HZ,
((double) sdev.rd_ios) / itv * HZ,
((double) sdev.wr_ios) / itv * HZ,
...

The latter means the following.
r/s = sdev.rd_ios / itv * HZ
w/s = sdev.wr_ios / itv * HZ

The former means the following.
svctm = util / ((sdev.rd_ios + sdev.wr_ios) * HZ / itv)

If %util is 100%, svctm is just 1 / (r/s + w/s) seconds, 1000/(r/s+w/s) milliseconds. This is an inverse number of IOPS. In other words, svctm * (r/s+w/s) is always 1000 if %util is 100%. So checking svctm is practically as same as checking r/s and w/s (as long as %util is close to 100%). The latter (IOPS) is much easier, isn't it?

6 comments:

PaulM said...

So the take home message is:

Use iostat -xm, look at svctm when %util is near 100%. Smaller values is better as smaller IOPS is better :)

If you look at the two outputs you produced the 2nd set clearly shows this.

Thanks for the post. One of those nice posts which explain some commonly used tools on unix in more detail

Have Fun

famzah said...

This is so not accurate :)

"%util" is the total time the device spent in doing I/O (sdev.tot_ticks) divided by the wall clock time (itv). This gives us an idea about how busy a device is.

Example: If we are doing measurements every 1000 ms and for these 1000 ms the device spent 950 ms in doing I/O, then %util is 95%. It is very likely that this device cannot do any faster with the *current* I/O load.

"svctm" is the average service time for I/O requests. This means how much time the device spent on average for every I/O operation. This is the hardware time an I/O operation took to finish.

"svctm" is equal to the total time the device spent in doing I/O (sdev.tot_ticks) divided by the total number of I/O operations (nr_ios). Every device has its mechanical characteristics on how quickly it can do an I/O operation.

For example, a typical hard disk device performs an I/O operation in less than 5 ms. If you have high values for "svctm", this usually means that your device performs poorly in regards to hardware and probably needs replacement.

---

Note that all the above applies for the devices (ex. /dev/sda), and not for the partitions (ex. /dev/sda2). As you can see in "Documentation/iostats.txt" there are also statistics for the partitions but a discussion about them would require even more explanations.

Yoshinori Matsunobu said...

famzah,

If disk configurations are not optimal, disks might reach util=100% very quickly, but that is not an uppermost limit. Check two iostat results in my blog entry again. The difference is only "whether write cache on disks is enabled or not". People might decide to purchase additional H/W components when seeing the first iostat result (because %util is close to 100% and svctm seems small enough(3.02)), but in this case this is a wrong decision. Checking multiple iostat parameters (IOPS + %util, or svctm + %util) and understanding typical IOPS in database applications (sometimes IOPS on HDD can be high thanks to write cache and command queuing) are important and this is what I want to mention in this post.

famzah said...

Yoshinori,

I fully agree that if %util=100%, it doesn't always mean that the device cannot perform faster. It means that it cannot perform faster with the current "configuration" and "load".

Tweaks like enabling the cache can surely increase the performance of a device. This does not make %util less trustful though :) And yes, you should check at least IOPS + %util to see if you reached your hardware limits or there is some configuration issue.

I however still cannot grasp your idea about "svctm"... We should check "svctm" at any time, not only when %util=100%. And regardless of the device %util or IOPS load, "svctm" should always be below a certain value which is hardware dependent (like below 5 ms for most hard disks).

P.S. I do not want to start a war here, I'm just commenting on the semantics of "%util" and "svctm".

Yoshinori Matsunobu said...

famzah,

Yes, I agree with that checking svctm is a good practice. svctm might be high if IOPS is low, even though %util is low(svctm = %util / IOPS so low IOPS makes svctm high). http://www.itworld.com/UIR961001perf is interesting though I have not encountered this on Linux yet.

famzah said...

I haven't encountered this recently either, and I'm currently monitoring the "svctm" values of more than 230 Linux servers.

This seems like a (rare) bug, not a rule. In any case when there is a high "svctm", people should determine the following:
* is my device failing (by examining the IOPS + util + benchmarks)
* am I encountering the iostat bug (in 99.9% it's the first case, a hardware failure)

My recent experience shows that there is no way that "svctm" is higher than the expected maximum hardware service time and this is normal :)

Anyway, all is clear now and I have no further comments. Thanks for the article, it helps people understand "%util" and "svctm" better.

Post a Comment