Thursday, March 13, 2014

How sync_file_range() really works

 There is a relatively new and platform dependent flushing function called sync_file_range(). Some databases (not MySQL) use sync_file_range() internally.
  Recently I investigated stall issues caused by buffered write and sync_file_range(). I learned a lot during investigation but I don't think these behaviors are well known to the public. Here I summarize my understandings.

Understanding differences between sync_file_range() and fsync()/fdatasync()

 sync_file_range() has some important behavior differences from fsync().
  • sync_file_range() has a flag to flush to disk asynchronously. fsync() always flushes to disk synchronously. sync_file_range(SYNC_FILE_RANGE_WRITE) does async writes (async sync_file_range()), sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does sync writes (sync sync_file_range()). With async sync_file_range(), you can *usually* call sync_file_range() very quickly and let Linux flush pages to disk later. As I describe later, async sync_file_range() is actually not always asynchronous, and is sometimes blocked for writeback. It is also important that I/O errors can't be notified when using async sync_file_range().
  • sync_file_range() allows to set file ranges (starting offset and size) to flush to disk. fsync() always flushes all dirty pages of the file. Ranges are rounded to page unit size. For example, sync_file_range(fd, 100, 300) will flush from offset 0 to 4096 (flushing page#1), not limited from offset 100 to 300. This is because minimum I/O unit is page.
  • sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does not wait for metadata flushing. fsync() waits until flushing both data and metadata are done. fdatasync() skips flushing metadata if file size does not change (fsync() also skips flushing metadata in that case, depending on filesystem). sync_file_range() does not wait metadata flushing even though file size changes. If a file is not overwritten (=appended), sync_file_range() does not guarantee the file can be recovered after crash, while fsync()/fdatasync() guarantee that.
 sync_file_range() behavior highly depends on kernel version and filesystem.
  • xfs does neighbor page flushing, in addition to specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16). This works very well for HDD because I/O unit size becomes bigger. In general, synchronously writing 1MB * 1000 times is much faster than writing 4KB * 256,000 times. ext3 and ext4 don't do neighbor page flushing.

  sync_file_range() is generally faster than fsync() because it can control dirty page ranges and skips waiting for metadata flushing. But sync_file_range() can't be used for guaranteeing durability, especially when file size changes.

  Practical usage of the sync_file_range() is where you don't need full durability but you want to control(reduce) dirty pages. For example, Facebook's HBase uses sync_file_range() for compactions and HLog writes. HBase does not need full durability (fsync()) per write because HBase relies on HDFS and HDFS can recover from HDFS replicas. Compactions write huge volume of data so periodically calling sync_file_range() makes sense to avoid burst writes. Calling sync_file_range() 1MB * 1000 times periodically gives more stable workloads than flushing 1GB at one time. RocksDB also uses sync_file_range().

Async sync_file_range is not always asynchronous

 Sometimes you might want to flush pages/files more earlier than relying on kernel threads (bdflush), in order to avoid burst writes. fsync() and sync sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)) can be used for that purpose, but both takes longer time (~10ms) on HDD if RAID write cache is disabled. You probably don't want to execute from user-facing thread.
 How about using async sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE)) from user-facing thread? It's supposed not to wait for i/o, so latency should be minimal. But I don't recommend using sync_file_range() from user facing thread like that. This is actually not always asynchronous, and there are many cases it takes time for waiting for disk i/o.
  I'm showing a couple of examples where async sync_file_range() takes longer time. In the following examples, I assume stable page writes are already disabled.

Stall Example 1: Small range sync_file_range()

single thread
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 100000; i++) {
    write(fd, buf, 1000); // not aligned page write
    sync_file_range(fd, i*1000, 1000, SYNC_FILE_RANGE_WRITE); // async
 In example 1, with stable page write fix, write() won't wait for dirty pages written to disk(writeback). But sync_file_range() actually waits for writeback.
 When stable page write is disabled, there is a possibility that a page is both writeback in progress and marked dirty. Below is an example scenario.
1. write() -> marking page 1 dirty
2. sync_file_range(SYNC_FILE_RANGE_WRITE) -> sending writeback request on page 1
3. write() -> making page 1 dirty (not waiting with stable page write fix)
4. sync_file_range(SYNC_FILE_RANGE_WRITE) -> waiting until page 1 is written back

 In this case, the second sync_file_range(SYNC_FILE_RANGE_WRITE) is blocked until flushing to disk triggered by the first sync_file_range() is done, which may take tens of milliseconds.
 Here is an example stack trace when sync_file_range() is blocked.

Stall example 2: Bulk sync_file_range()

 What happens if calling write() multiple times then call sync_file_range(SYNC_FILE_RANGE_WRITE) for multiple pages at once? In below example, calling write() 21 times then triggering flush by sync_file_range().
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 21; i++) {
    write(fd, buf, 1000); // not aligned page write
  sync_file_range(fd, 0, 16384, SYNC_FILE_RANGE_WRITE);
  for(i=22; i< 42; i++) {
    write(fd, buf, 1000);
  sync_file_range(fd, 16384, 32768, SYNC_FILE_RANGE_WRITE);
 Unfortunately, sync_file_range() also may take time in this case too. It works as below in xfs. Since xfs does neighbor page flushing via sync_file_range(), there is a possibility that a page is both under writeback in progress and marked dirty.
1. write -> page 1~6 become dirty
2. sync_file_range (page 1~4) -> triggering page 1~4 and 5, 6 for flushing (in xfs)
3. write -> page 6~11 become dirty
4. sync_file_range (page 5..8) -> waiting for page 6 to be flushed to disk

 Note that if write volume (and overall disk busy rate) is lower enough than disk speed, page 6 should be flushed to disk before starting second sync_file_range(). In that case it shouldn't wait anything.

Stall example 3: Aligned page writes

 The main reason why async sync_file_range() was blocked is that write() was not aligned by page size. What if we are doing fully aligned page write (writing 4KB multiple)?
 With aligned page write, async sync_file_range() does not wait shown at Example 1 and 2, and gives much better throughput. But, even with aligned page write, sometimes async sync_file_range() waits for disk i/o.
 sync_file_range() submits page write i/o requests to disks. If there are many outstanding i/o read/write requests in a disk queue, new i/o requests are blocked until there is a free slot available in the queue. This blocks sync_file_range() too.
 Queue size is managed under /sys/block/sdX/queue/nr_requests. You may increase to larger values.
echo 1024 > /sys/block/sda/queue/nr_requests
 This mitigates stalls at sync_file_range() on busy disks. But this won't solve problems entirely. If you submit many more write i/o requests, read requests take more time to serve (write-starving-reads) which very negatively affects user-facing query latency.

Solution for the stalls

 Make sure use Linux kernels supporting disabling stable page write. Otherwise write() would be blocked. My previous post covers this topic. sync_file_range(SYNC_FILE_RANGE_WRITE) is supposed to by asynchronous, but is actually blocked for writeback in many patterns, so it's not recommended calling sync_file_range() from user-facing thread, if you really care about latency. Calling sync_file_range() from a background (not user-facing) thread would be better solution here.
 Buffered write and sync_file_range() are important for some databases like HBase and RocksDB. For HBase/Hadoop, using JBOD is one of the well known best practices. HLog writes are buffered, and not flushed to disk per write(put operation). There are some HBase/Hadoop distributions supporting sync_file_range() to reduce outstanding dirty pages. From Operating System point of view, HLog files are appended, and file size is not small (64MB by default). This means all HLog writes go to a single disk with JBOD configuration, which means the single disk tends to be overloaded. An overloaded disk takes longer time for flushing dirty pages (via sync_file_range or bdflush), which may block further sync_file_range(). To get better latency, using Linux Kernel supporting to disable stable page write, and calling sync_file_range() from background threads (not from user-facing thread) are important.


liang xie said...

Nice post ! Yoshinori, could you share your SystemTap script to grab the kernel stack trace? I modified watchdog.stp then tried to get the kernel stack trace if sys_write cost >= xx ms be found, but got nothing:) Thanks in advance!

Yoshinori Matsunobu said...

@liang, Here is an example.

global PROCNAME = "benchmark"
global i=0
global tb=0
global te=0
probe kernel.function("sync_file_range") {
if(execname() == PROCNAME) {
probe kernel.function("sync_file_range").return {
if(execname() == PROCNAME) {
if(te-tb > 10000) {
printf("time_us: %d\n", te-tb);
if(i > 5) {

mim shams said...

All branches, with the exception of Mira Mesa, offer check cashing, cash orders, utility bill payment, sell and loading Netspend Pre-Paid Debit Cards, offer Direct Deposit on to your Netspend Pre-Paid personal line payday loans direct lender costa mesa of credit, get gift cards, and choose stores offer official Services. The downtown metropolis Branch on exchanges foreign currency.

md sukria said...

This is the foremost instant very have} glimpsed you’re secured and do love declared you – it's very persuading to seem at that i'm appreciative for your toil gift card mall. however if you presumably did it in associate exceedingly} extremely easy procedure that's ready to be extremely gracious. but over all i actually elective you and positive will keep for additional mails like this. articulate feeling you most.

Sohidul Islam said...

This is the first instant I even have glimpsed your content and do would like to provide notice you – it's extremely nice to glimpse which i appreciate your exertions. but if you most likely did it throughout a simple methodology which will be terribly pleasant payday loans. but over all I extraordinarily prompt you and certain will delay for lots of mails like this. several thanks most.

md sukria said...

To start with the foremost necessary line of my statement – I do yearn to produce a vast as a results of the periodical esteem. terribly it's pleasing laurels wondrous work by him that i detected out laurels dependable facilitate by his/her Brobdingnagian detail and figures. I merely have to be compelled to be compelled to be compelled to be compelled to broadcast car title loans in montgomery, Brobdingnagian delight comprise it up your work. usually I’ll sit down at the facet of your posting and alter. esteem ahead to your a allotment of mails.

stephane hooks said...

I simply have a glimpse here and feel pleasant to hunt out out this journal. flush content writing hand and intensely cooperative science system. i might like most of we've a inclination to organization quadrilateral assess to appear out these types of things, here we've got got got a bent to tend to tend to face persist the brink of acknowledge everything payday loans chicago. i am with the content respect associate degreed do respect him as associate honest provider. Thanks for your diligence and you too.

Post a Comment