Monday, March 10, 2014

Why buffered writes are sometimes stalled

 Many people think buffered write (write()/pwrite()) is fast because it does not do disk access. But this is not always true. Buffered write sometimes does disk access by itself, or waits for some disk accesses by other threads. Here are three common cases where write() takes longer time (== causing stalls).

1. Read Modify Write

Suppose the following logic. Opening aaa.dat without O_DIRECT/O_SYNC, writing 1000 bytes sequentially for 100,000 times, then flushing by fsync().
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 100000; i++) {
    write(fd, buf, 1000);
 You might think each write() will finish fast enough (at least less than 0.1ms) because it shouldn't do any disk access. But it is not always true.
 Operating System manages I/O by page. It's 4KB for most Linux environments. If you'd modify 1000 bytes of the 4KB page from offset 0, Linux first needs to read the 4KB page, modify first 1000 bytes, then write the page back. The page will be sooner or later written to disk. Yes, reading the page is needed. This is called RMW (Read Modify Write). If the page was not cached in filesystem cache (page cache), reading the page from disk is needed, which may take tens of milliseconds on HDD.
 This problem often happens when overwriting large files. You can easily repeat the problem by the following steps.
  • 1. Creating a large file (cached in fs cache)
 dd if=/dev/zero of=aaa bs=4096 count=1000000
  • 2. Uncache the file (i.e. echo 3 > /proc/sys/vm/drop_caches)
  • 3. Writing to the file (using write()/pwrite()) => the target page does not exist in fs cache. So reading from disk. You can verify that by iostat.
 There are a couple of solutions to avoid slow Read Modify Write.

Appending a file, not updating in place

 Appending a file means newly allocated pages are always cached, so slow Read Modify Write issue doesn't happen.
 In MySQL, binary logs are appended, not overwritten. InnoDB log files are always overwritten so this workaround can't be used.
 Note that if you need full durability (calling fsync/fdatasync() per each write()), appending is much more expensive than overwriting for most filesystems. It is well known that sync_binlog=1 is very slow in MySQL ~5.5, and the main reasons were group commit was broken and appending + fsync() was not fast. In 5.6 group commit was supported so multi-threaded write throughput improved a lot.

Always cache target files within filesystem cache

 If target pages are cached in filesystem cache (page cache), write() doesn't hit disk reads.
The obvious disadvantage is that this approach wastes memory. RAM is expensive. If you have 128GB InnoDB log files, you won't like to give 128GB RAM for InnoDB log files. The RAM should be allocated for InnoDB buffer pool.

Aligning write() I/O unit size by 4KB multiple

 If you always write with Linux page size aligned (4KB multiple), Read Modify Write issue can be avoided.
 One approach to do aligned write is zero-filling. Below is an example.
  memset(buf, 0, 4096); // zerofill
  memset(buf, data1, 1000); // set application data (1000 bytes)
  pwrite(fd, buf, 4096, 0); // write 4KB

  memset(buf2, 0, 4096);
  memset(buf2, data2, 1000);
  pwrite(fd, buf2, 4096, 4096);
 In this example, you write 1000 bytes of application data twice, but actually writing 4KB data twice, 8KB in total. 8192-2000=6192 bytes are zero-filled data. Disk reads don't happen by this approach, even if they are not cached in filesystem cache.
 This approach needs more space. In the above case you wasted 6192 bytes.
 Another approach is remembering application data offset, but writing by 4KB aligned.
  memset(buf, 0, 4096); // zerofill
  memset(buf, data1, 1000); // set application data (1000 bytes)
  pwrite(fd, buf, 4096, 0);

  memset(buf, 0, 4096);
  memset(buf+1000, data2, 1000); // set next application data from real offset
  pwrite(fd, buf, 4096, 0);
This approach doesn't waste space.
InnoDB log file write unit is not aligned by 4KB, so Read Modify Write issue exists. Some MySQL fork distributions fix the problem by writing 4KB aligned data. Facebook MySQL takes the latter approach, and we're using in production. Percona Server also supports aligned write but it's stated as beta quality.

2. write() may be blocked for "stable page writes"

Dirty pages

 Most write commands don't flush to disk immediately. write()/pwrite() functions write to Linux page cache and mark them dirty, unless the target file is opened with O_DIRECT/O_SYNC. write() and pwrite() are basically the same except that pwrite() has an option to set offset. I mean write() here as dirty page write.
 Dirty pages are sooner or later flushed to disk. This is done by many processes/functions, such as bdflush, fsync(), sync_file_range(). Behavior to flush to disk highly depends on filesystem. I consider only XFS here.
 When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done. This is called "Stable Page Write". This may cause write() stalls, especially when using slower disks. Without write cache, flushing to disk takes ~10ms usually, ~100ms in bad cases.

 Suppose the following example. There are two clients. One is writing 10 bytes repeatedly via write(), the other is calling fsync() to the same file.
Thread 1:
  fd=open("aaa.dat", O_WRONLY);
  while(1) {
    t1 = get_timestamp
    write(fd, buf, 10);
    t2 = get_timestamp

Thread 2:
  fd=open("aaa.dat", O_WRONLY);
  while(1) {
 If you run on a slower disks (HDD with write cache disabled), you may notice sometimes write() (t2-t1) takes more than 10ms. Taking time for fsync() is expected because fsync() flushes dirty pages and metadata into disk, but buffered write also sometimes takes time due to stable page write.

 Another annoying issue is while write() is blocked, the write() holds an exclusive inode mutex. It blocks all writes and disk reads to/from the same file (all pages within the same file).

Disabling Stable Page Writes

 Hopefully there is a patch to disable Stable Page Write, and some Linux distributions support it. 
With this patch, on most filesystems write() no longer waits for dirty page writeback. It helps not to cause write() latency spikes.

3. Waiting for journal block allocation in ext3/4

 If you are using ext3 or ext4, you may still suffer from occasional write() stalls, even if disabling stable page writes. This happens when write() waits for journal block allocation. Example stack trace is as follows. Easier workaround is using xfs, which gives no such stalls.

  Returning from: 0xffffffff81167660 : __wait_on_buffer+0x0/0x30 [kernel]
  Returning to : 0xffffffff811ff920 : jbd2_log_do_checkpoint+0x490/0x4b0 [kernel]
  0xffffffff811ff9bf : __jbd2_log_wait_for_space+0x7f/0x190 [kernel]
  0xffffffff811fab00 : start_this_handle+0x3b0/0x4e0 [kernel]
  0xffffffff811faceb : jbd2__journal_start+0xbb/0x100 [kernel]
  0xffffffff811fad3e : jbd2_journal_start+0xe/0x10 [kernel]
  0xffffffff811db37e : ext4_journal_start_sb+0x7e/0x1d0 [kernel]
  0xffffffff811bd757 : ext4_da_write_begin+0x77/0x210 [kernel]
  0xffffffff810deaea : generic_file_buffered_write+0x10a/0x290 [kernel]
  0xffffffff810e0171 : __generic_file_aio_write+0x231/0x440 [kernel]
  0xffffffff810e03ed : generic_file_aio_write+0x6d/0xe0 [kernel]
  0xffffffff811b732f : ext4_file_write+0xbf/0x260 [kernel]
  0xffffffff8113907a : do_sync_write+0xda/0x120 [kernel]
  0xffffffff8113993e : vfs_write+0xae/0x180 [kernel]
  0xffffffff81139df2 : sys_pwrite64+0xa2/0xb0 [kernel]
  0xffffffff8156a57b : system_call_fastpath+0x16/0x1b [kernel]


 Question: Why does buffered write() sometimes stall? It just writes to kernel buffer and doesn't hit disk.
  1. write() does disk read when needed. To avoid this issue you need to append a file, not overwrite. Or use OS page aligned writes.
  2. write() may be blocked for "stable page writes". To avoid this issue you need to use newer Linux kernel supporting disabling stable page writes.
  3. If you really care about latency, I don't recommend using ext. Use xfs instead.

 Many of the issues can be mitigated by using battery/flash backed write cache on raid controller, but this is not always possible, and battery often expires.
 In the next post, I'll describe about why sync_file_range(SYNC_FILE_RANGE_WRITE) sometimes stalls.

No comments:

Post a Comment