Thursday, March 13, 2014

How sync_file_range() really works

 There is a relatively new and platform dependent flushing function called sync_file_range(). Some databases (not MySQL) use sync_file_range() internally.
  Recently I investigated stall issues caused by buffered write and sync_file_range(). I learned a lot during investigation but I don't think these behaviors are well known to the public. Here I summarize my understandings.

Understanding differences between sync_file_range() and fsync()/fdatasync()

 sync_file_range() has some important behavior differences from fsync().
  • sync_file_range() has a flag to flush to disk asynchronously. fsync() always flushes to disk synchronously. sync_file_range(SYNC_FILE_RANGE_WRITE) does async writes (async sync_file_range()), sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does sync writes (sync sync_file_range()). With async sync_file_range(), you can *usually* call sync_file_range() very quickly and let Linux flush pages to disk later. As I describe later, async sync_file_range() is actually not always asynchronous, and is sometimes blocked for writeback. It is also important that I/O errors can't be notified when using async sync_file_range().
  • sync_file_range() allows to set file ranges (starting offset and size) to flush to disk. fsync() always flushes all dirty pages of the file. Ranges are rounded to page unit size. For example, sync_file_range(fd, 100, 300) will flush from offset 0 to 4096 (flushing page#1), not limited from offset 100 to 300. This is because minimum I/O unit is page.
  • sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does not wait for metadata flushing. fsync() waits until flushing both data and metadata are done. fdatasync() skips flushing metadata if file size does not change (fsync() also skips flushing metadata in that case, depending on filesystem). sync_file_range() does not wait metadata flushing even though file size changes. If a file is not overwritten (=appended), sync_file_range() does not guarantee the file can be recovered after crash, while fsync()/fdatasync() guarantee that.
 sync_file_range() behavior highly depends on kernel version and filesystem.
  • xfs does neighbor page flushing, in addition to specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16). This works very well for HDD because I/O unit size becomes bigger. In general, synchronously writing 1MB * 1000 times is much faster than writing 4KB * 256,000 times. ext3 and ext4 don't do neighbor page flushing.

  sync_file_range() is generally faster than fsync() because it can control dirty page ranges and skips waiting for metadata flushing. But sync_file_range() can't be used for guaranteeing durability, especially when file size changes.

  Practical usage of the sync_file_range() is where you don't need full durability but you want to control(reduce) dirty pages. For example, Facebook's HBase uses sync_file_range() for compactions and HLog writes. HBase does not need full durability (fsync()) per write because HBase relies on HDFS and HDFS can recover from HDFS replicas. Compactions write huge volume of data so periodically calling sync_file_range() makes sense to avoid burst writes. Calling sync_file_range() 1MB * 1000 times periodically gives more stable workloads than flushing 1GB at one time. RocksDB also uses sync_file_range().

Async sync_file_range is not always asynchronous

 Sometimes you might want to flush pages/files more earlier than relying on kernel threads (bdflush), in order to avoid burst writes. fsync() and sync sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)) can be used for that purpose, but both takes longer time (~10ms) on HDD if RAID write cache is disabled. You probably don't want to execute from user-facing thread.
 How about using async sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE)) from user-facing thread? It's supposed not to wait for i/o, so latency should be minimal. But I don't recommend using sync_file_range() from user facing thread like that. This is actually not always asynchronous, and there are many cases it takes time for waiting for disk i/o.
  I'm showing a couple of examples where async sync_file_range() takes longer time. In the following examples, I assume stable page writes are already disabled.

Stall Example 1: Small range sync_file_range()

single thread
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 100000; i++) {
    write(fd, buf, 1000); // not aligned page write
    sync_file_range(fd, i*1000, 1000, SYNC_FILE_RANGE_WRITE); // async
  }
 In example 1, with stable page write fix, write() won't wait for dirty pages written to disk(writeback). But sync_file_range() actually waits for writeback.
 When stable page write is disabled, there is a possibility that a page is both writeback in progress and marked dirty. Below is an example scenario.
1. write() -> marking page 1 dirty
2. sync_file_range(SYNC_FILE_RANGE_WRITE) -> sending writeback request on page 1
3. write() -> making page 1 dirty (not waiting with stable page write fix)
4. sync_file_range(SYNC_FILE_RANGE_WRITE) -> waiting until page 1 is written back

 In this case, the second sync_file_range(SYNC_FILE_RANGE_WRITE) is blocked until flushing to disk triggered by the first sync_file_range() is done, which may take tens of milliseconds.
 Here is an example stack trace when sync_file_range() is blocked.
                sleep_on_page
                __wait_on_bit
                wait_on_page_bit
                write_cache_pages
                generic_writepages
                xfs_vm_writepages
                do_writepages
                __filemap_fdatawrite_range
                filemap_fdatawrite_range
                SyS_sync_file_range
                tracesys
                sync_file_range
                __libc_start_main

Stall example 2: Bulk sync_file_range()

 What happens if calling write() multiple times then call sync_file_range(SYNC_FILE_RANGE_WRITE) for multiple pages at once? In below example, calling write() 21 times then triggering flush by sync_file_range().
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 21; i++) {
    write(fd, buf, 1000); // not aligned page write
  }
  sync_file_range(fd, 0, 16384, SYNC_FILE_RANGE_WRITE);
  for(i=22; i< 42; i++) {
    write(fd, buf, 1000);
  }
  sync_file_range(fd, 16384, 32768, SYNC_FILE_RANGE_WRITE);
 Unfortunately, sync_file_range() also may take time in this case too. It works as below in xfs. Since xfs does neighbor page flushing via sync_file_range(), there is a possibility that a page is both under writeback in progress and marked dirty.
1. write -> page 1~6 become dirty
2. sync_file_range (page 1~4) -> triggering page 1~4 and 5, 6 for flushing (in xfs)
3. write -> page 6~11 become dirty
4. sync_file_range (page 5..8) -> waiting for page 6 to be flushed to disk

 Note that if write volume (and overall disk busy rate) is lower enough than disk speed, page 6 should be flushed to disk before starting second sync_file_range(). In that case it shouldn't wait anything.

Stall example 3: Aligned page writes

 The main reason why async sync_file_range() was blocked is that write() was not aligned by page size. What if we are doing fully aligned page write (writing 4KB multiple)?
 With aligned page write, async sync_file_range() does not wait shown at Example 1 and 2, and gives much better throughput. But, even with aligned page write, sometimes async sync_file_range() waits for disk i/o.
 sync_file_range() submits page write i/o requests to disks. If there are many outstanding i/o read/write requests in a disk queue, new i/o requests are blocked until there is a free slot available in the queue. This blocks sync_file_range() too.
 Queue size is managed under /sys/block/sdX/queue/nr_requests. You may increase to larger values.
echo 1024 > /sys/block/sda/queue/nr_requests
 This mitigates stalls at sync_file_range() on busy disks. But this won't solve problems entirely. If you submit many more write i/o requests, read requests take more time to serve (write-starving-reads) which very negatively affects user-facing query latency.

Solution for the stalls

 Make sure use Linux kernels supporting disabling stable page write. Otherwise write() would be blocked. My previous post covers this topic. sync_file_range(SYNC_FILE_RANGE_WRITE) is supposed to by asynchronous, but is actually blocked for writeback in many patterns, so it's not recommended calling sync_file_range() from user-facing thread, if you really care about latency. Calling sync_file_range() from a background (not user-facing) thread would be better solution here.
 Buffered write and sync_file_range() are important for some databases like HBase and RocksDB. For HBase/Hadoop, using JBOD is one of the well known best practices. HLog writes are buffered, and not flushed to disk per write(put operation). There are some HBase/Hadoop distributions supporting sync_file_range() to reduce outstanding dirty pages. From Operating System point of view, HLog files are appended, and file size is not small (64MB by default). This means all HLog writes go to a single disk with JBOD configuration, which means the single disk tends to be overloaded. An overloaded disk takes longer time for flushing dirty pages (via sync_file_range or bdflush), which may block further sync_file_range(). To get better latency, using Linux Kernel supporting to disable stable page write, and calling sync_file_range() from background threads (not from user-facing thread) are important.


11 comments:

liang xie said...

Nice post ! Yoshinori, could you share your SystemTap script to grab the kernel stack trace? I modified watchdog.stp then tried to get the kernel stack trace if sys_write cost >= xx ms be found, but got nothing:) Thanks in advance!

Yoshinori Matsunobu said...

@liang, Here is an example.

global PROCNAME = "benchmark"
global i=0
global tb=0
global te=0
probe kernel.function("sync_file_range") {
if(execname() == PROCNAME) {
tb=gettimeofday_us();
}
}
probe kernel.function("sync_file_range").return {
if(execname() == PROCNAME) {
te=gettimeofday_us();
if(te-tb > 10000) {
i++;
printf("time_us: %d\n", te-tb);
print_backtrace();
if(i > 5) {
exit();
}
}
}
}

mim shams said...

All branches, with the exception of Mira Mesa, offer check cashing, cash orders, utility bill payment, sell and loading Netspend Pre-Paid Debit Cards, offer Direct Deposit on to your Netspend Pre-Paid personal line payday loans direct lender costa mesa of credit, get gift cards, and choose stores offer official Services. The downtown metropolis Branch on exchanges foreign currency.

md sukria said...

This is the foremost instant very have} glimpsed you’re secured and do love declared you – it's very persuading to seem at that i'm appreciative for your toil gift card mall. however if you presumably did it in associate exceedingly} extremely easy procedure that's ready to be extremely gracious. but over all i actually elective you and positive will keep for additional mails like this. articulate feeling you most.

Sohidul Islam said...

This is the first instant I even have glimpsed your content and do would like to provide notice you – it's extremely nice to glimpse which i appreciate your exertions. but if you most likely did it throughout a simple methodology which will be terribly pleasant payday loans. but over all I extraordinarily prompt you and certain will delay for lots of mails like this. several thanks most.

md sukria said...

To start with the foremost necessary line of my statement – I do yearn to produce a vast as a results of the periodical esteem. terribly it's pleasing laurels wondrous work by him that i detected out laurels dependable facilitate by his/her Brobdingnagian detail and figures. I merely have to be compelled to be compelled to be compelled to be compelled to broadcast car title loans in montgomery, Brobdingnagian delight comprise it up your work. usually I’ll sit down at the facet of your posting and alter. esteem ahead to your a allotment of mails.

stephane hooks said...

I simply have a glimpse here and feel pleasant to hunt out out this journal. flush content writing hand and intensely cooperative science system. i might like most of we've a inclination to organization quadrilateral assess to appear out these types of things, here we've got got got a bent to tend to tend to face persist the brink of acknowledge everything payday loans chicago. i am with the content respect associate degreed do respect him as associate honest provider. Thanks for your diligence and you too.

deny andoid said...

Michael Kors Michael Gros is delighted to announce, vogue blogger Michelle dressed in Michael Kors 2014 summertime sequence navy white striped shorts as well as a windbreaker when using the exact same Goyard Tote Outlet paragraph typical Michael Kors Hamilton handbags, Haji Lane stroll the streets of Kazakhstan pose Lane Singapore. Michael Kors handbags scorching brands Sutton, delicate brown and white strike colour design and style, either hand, can shoulder, with Michael Kors wave position sleeveless dress and gold see, casual style devoid of getting rid of the arrogance of stylish Gucci Handbags Outlet Canada urban a lot of women temperament. Aid us express that Michael Kors Outlet Canada you've observed, Michael Kors has now become a ubiquitous model Longchamp Tote Bags Canada. Appears like Michael Kors manner market place has grown to be general public enemy. Lately, the American luxury brand Goyard Tote Outlet Canada introduced a quarterly light ended June 28 finished the reporting period of time the company's profits rose forty three per cent, this really is Michael Kors thirty third consecutive quarter of earnings expansion.

deny andoid said...

Brand turned his consideration towards East this time. Discovered from detailed define of classic Japanese inspiration, Karen Millen Karen Milian utilising potential factors, to show the impression of contemporary gals. Each and every Karen Millen Sale service or Japanese artwork of origami-style folds of ruffles, or from exaggerated metallic materials, or inspiration in the overall floral print, a unique ornamental topic from. Satin kimono material and Karen Millen Outlet construction fashion dresses and jackets, wide-sleeved shirt and double-breasted coat; exaggerated rosebuds rich black lace Mont Blanc Pens Sale UK overlaid on pale green, provides a solid female suit for ornamental skirts and trousers; temptation underwear the folds, dots and lightweight nude coloration in to the boudoir glamor Chrome Hearts Online; beautifully etched floral oriental ornament for promenade dresses included.

deny andoid said...

Set versions most fitted to the workplace commuter equipment is worn, not merely saves time but will also together with the self into a basic, to make certain that women's gown impeccable taste. Slim classy design and style appears to be really neat Parajumpers Jackets and but female, retro edition accompanied the two elegant and contemporary and shiny, supplying solid stuff will always be a way of eternal existence over and above the restrictions of your time. Three-dimensional brilliant, full of a feeling of Moose Knuckles Canada profile of wool coat, each outstanding and warm. H-type outfits Canada Goose Women Jackets and system appear neat ambiance, you can find no limit to the sizing of, but did Canada Goose Sale Canada not see the bloated, so sense added more comfortable and assured carrying. The cherry pink sweater trip to the bleak autumn delivers brilliant and vitality, with basic white skirt, exhibiting quiet natural beauty. Temperament is mostly a compliment to the lady anxious, it arises from self-confident, proficient, clever. Exceptional confront in due course as time goes by, and temperament is simply the opposite, aided by the passage of your time far more apparent. It is a visitors convincing gesture, straightforward although not very easy.

deny andoid said...

Nike Shox Undergo + iD sequence of trainers working with an impressive cushioning program, which mixes human motion engineering engineering, Max Air cushioning know-how, too as traditional Shox collection moves smoother Nike Shox Norge in contrast to Nike Shox and Nike Zoom cushioning defense technologies. which Nike Free run Sverige groundbreaking NIKE SHOX SHOX cushioning system deserted the friends and family while in the standard foam ankle design, changed using NIKE AIR airbag create, consequently providing a more light-weight comfort and ease and impact security Nike Free Run Sverige Billiga even from the ankle to the toe force distribution. total palm CUSHLON sole Men Vans Half Cab Pro cushioning system deserted songs SHOX collection Vans Shoes Outlet crafted inside the regular foam ankle, changed the usage of NIKE AIR airbag layout, consequently furnishing a far more lightweight influence security and comfort from your ankle into the toe by a uniform power distribution.

Post a Comment