Thursday, March 13, 2014

How sync_file_range() really works

 There is a relatively new and platform dependent flushing function called sync_file_range(). Some databases (not MySQL) use sync_file_range() internally.
  Recently I investigated stall issues caused by buffered write and sync_file_range(). I learned a lot during investigation but I don't think these behaviors are well known to the public. Here I summarize my understandings.

Understanding differences between sync_file_range() and fsync()/fdatasync()

 sync_file_range() has some important behavior differences from fsync().
  • sync_file_range() has a flag to flush to disk asynchronously. fsync() always flushes to disk synchronously. sync_file_range(SYNC_FILE_RANGE_WRITE) does async writes (async sync_file_range()), sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does sync writes (sync sync_file_range()). With async sync_file_range(), you can *usually* call sync_file_range() very quickly and let Linux flush pages to disk later. As I describe later, async sync_file_range() is actually not always asynchronous, and is sometimes blocked for writeback. It is also important that I/O errors can't be notified when using async sync_file_range().
  • sync_file_range() allows to set file ranges (starting offset and size) to flush to disk. fsync() always flushes all dirty pages of the file. Ranges are rounded to page unit size. For example, sync_file_range(fd, 100, 300) will flush from offset 0 to 4096 (flushing page#1), not limited from offset 100 to 300. This is because minimum I/O unit is page.
  • sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does not wait for metadata flushing. fsync() waits until flushing both data and metadata are done. fdatasync() skips flushing metadata if file size does not change (fsync() also skips flushing metadata in that case, depending on filesystem). sync_file_range() does not wait metadata flushing even though file size changes. If a file is not overwritten (=appended), sync_file_range() does not guarantee the file can be recovered after crash, while fsync()/fdatasync() guarantee that.
 sync_file_range() behavior highly depends on kernel version and filesystem.
  • xfs does neighbor page flushing, in addition to specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16). This works very well for HDD because I/O unit size becomes bigger. In general, synchronously writing 1MB * 1000 times is much faster than writing 4KB * 256,000 times. ext3 and ext4 don't do neighbor page flushing.

  sync_file_range() is generally faster than fsync() because it can control dirty page ranges and skips waiting for metadata flushing. But sync_file_range() can't be used for guaranteeing durability, especially when file size changes.

  Practical usage of the sync_file_range() is where you don't need full durability but you want to control(reduce) dirty pages. For example, Facebook's HBase uses sync_file_range() for compactions and HLog writes. HBase does not need full durability (fsync()) per write because HBase relies on HDFS and HDFS can recover from HDFS replicas. Compactions write huge volume of data so periodically calling sync_file_range() makes sense to avoid burst writes. Calling sync_file_range() 1MB * 1000 times periodically gives more stable workloads than flushing 1GB at one time. RocksDB also uses sync_file_range().

Async sync_file_range is not always asynchronous

 Sometimes you might want to flush pages/files more earlier than relying on kernel threads (bdflush), in order to avoid burst writes. fsync() and sync sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)) can be used for that purpose, but both takes longer time (~10ms) on HDD if RAID write cache is disabled. You probably don't want to execute from user-facing thread.
 How about using async sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE)) from user-facing thread? It's supposed not to wait for i/o, so latency should be minimal. But I don't recommend using sync_file_range() from user facing thread like that. This is actually not always asynchronous, and there are many cases it takes time for waiting for disk i/o.
  I'm showing a couple of examples where async sync_file_range() takes longer time. In the following examples, I assume stable page writes are already disabled.

Stall Example 1: Small range sync_file_range()

single thread
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 100000; i++) {
    write(fd, buf, 1000); // not aligned page write
    sync_file_range(fd, i*1000, 1000, SYNC_FILE_RANGE_WRITE); // async
  }
 In example 1, with stable page write fix, write() won't wait for dirty pages written to disk(writeback). But sync_file_range() actually waits for writeback.
 When stable page write is disabled, there is a possibility that a page is both writeback in progress and marked dirty. Below is an example scenario.
1. write() -> marking page 1 dirty
2. sync_file_range(SYNC_FILE_RANGE_WRITE) -> sending writeback request on page 1
3. write() -> making page 1 dirty (not waiting with stable page write fix)
4. sync_file_range(SYNC_FILE_RANGE_WRITE) -> waiting until page 1 is written back

 In this case, the second sync_file_range(SYNC_FILE_RANGE_WRITE) is blocked until flushing to disk triggered by the first sync_file_range() is done, which may take tens of milliseconds.
 Here is an example stack trace when sync_file_range() is blocked.
                sleep_on_page
                __wait_on_bit
                wait_on_page_bit
                write_cache_pages
                generic_writepages
                xfs_vm_writepages
                do_writepages
                __filemap_fdatawrite_range
                filemap_fdatawrite_range
                SyS_sync_file_range
                tracesys
                sync_file_range
                __libc_start_main

Stall example 2: Bulk sync_file_range()

 What happens if calling write() multiple times then call sync_file_range(SYNC_FILE_RANGE_WRITE) for multiple pages at once? In below example, calling write() 21 times then triggering flush by sync_file_range().
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 21; i++) {
    write(fd, buf, 1000); // not aligned page write
  }
  sync_file_range(fd, 0, 16384, SYNC_FILE_RANGE_WRITE);
  for(i=22; i< 42; i++) {
    write(fd, buf, 1000);
  }
  sync_file_range(fd, 16384, 32768, SYNC_FILE_RANGE_WRITE);
 Unfortunately, sync_file_range() also may take time in this case too. It works as below in xfs. Since xfs does neighbor page flushing via sync_file_range(), there is a possibility that a page is both under writeback in progress and marked dirty.
1. write -> page 1~6 become dirty
2. sync_file_range (page 1~4) -> triggering page 1~4 and 5, 6 for flushing (in xfs)
3. write -> page 6~11 become dirty
4. sync_file_range (page 5..8) -> waiting for page 6 to be flushed to disk

 Note that if write volume (and overall disk busy rate) is lower enough than disk speed, page 6 should be flushed to disk before starting second sync_file_range(). In that case it shouldn't wait anything.

Stall example 3: Aligned page writes

 The main reason why async sync_file_range() was blocked is that write() was not aligned by page size. What if we are doing fully aligned page write (writing 4KB multiple)?
 With aligned page write, async sync_file_range() does not wait shown at Example 1 and 2, and gives much better throughput. But, even with aligned page write, sometimes async sync_file_range() waits for disk i/o.
 sync_file_range() submits page write i/o requests to disks. If there are many outstanding i/o read/write requests in a disk queue, new i/o requests are blocked until there is a free slot available in the queue. This blocks sync_file_range() too.
 Queue size is managed under /sys/block/sdX/queue/nr_requests. You may increase to larger values.
echo 1024 > /sys/block/sda/queue/nr_requests
 This mitigates stalls at sync_file_range() on busy disks. But this won't solve problems entirely. If you submit many more write i/o requests, read requests take more time to serve (write-starving-reads) which very negatively affects user-facing query latency.

Solution for the stalls

 Make sure use Linux kernels supporting disabling stable page write. Otherwise write() would be blocked. My previous post covers this topic. sync_file_range(SYNC_FILE_RANGE_WRITE) is supposed to by asynchronous, but is actually blocked for writeback in many patterns, so it's not recommended calling sync_file_range() from user-facing thread, if you really care about latency. Calling sync_file_range() from a background (not user-facing) thread would be better solution here.
 Buffered write and sync_file_range() are important for some databases like HBase and RocksDB. For HBase/Hadoop, using JBOD is one of the well known best practices. HLog writes are buffered, and not flushed to disk per write(put operation). There are some HBase/Hadoop distributions supporting sync_file_range() to reduce outstanding dirty pages. From Operating System point of view, HLog files are appended, and file size is not small (64MB by default). This means all HLog writes go to a single disk with JBOD configuration, which means the single disk tends to be overloaded. An overloaded disk takes longer time for flushing dirty pages (via sync_file_range or bdflush), which may block further sync_file_range(). To get better latency, using Linux Kernel supporting to disable stable page write, and calling sync_file_range() from background threads (not from user-facing thread) are important.


No comments:

Post a Comment