Monday, March 10, 2014

Why buffered writes are sometimes stalled

 Many people think buffered write (write()/pwrite()) is fast because it does not do disk access. But this is not always true. Buffered write sometimes does disk access by itself, or waits for some disk accesses by other threads. Here are three common cases where write() takes longer time (== causing stalls).

1. Read Modify Write

Suppose the following logic. Opening aaa.dat without O_DIRECT/O_SYNC, writing 1000 bytes sequentially for 100,000 times, then flushing by fsync().
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 100000; i++) {
    write(fd, buf, 1000);
  }
  fsync(fd);
 You might think each write() will finish fast enough (at least less than 0.1ms) because it shouldn't do any disk access. But it is not always true.
 Operating System manages I/O by page. It's 4KB for most Linux environments. If you'd modify 1000 bytes of the 4KB page from offset 0, Linux first needs to read the 4KB page, modify first 1000 bytes, then write the page back. The page will be sooner or later written to disk. Yes, reading the page is needed. This is called RMW (Read Modify Write). If the page was not cached in filesystem cache (page cache), reading the page from disk is needed, which may take tens of milliseconds on HDD.
 This problem often happens when overwriting large files. You can easily repeat the problem by the following steps.
  • 1. Creating a large file (cached in fs cache)
 dd if=/dev/zero of=aaa bs=4096 count=1000000
  • 2. Uncache the file (i.e. echo 3 > /proc/sys/vm/drop_caches)
  • 3. Writing to the file (using write()/pwrite()) => the target page does not exist in fs cache. So reading from disk. You can verify that by iostat.
 There are a couple of solutions to avoid slow Read Modify Write.

Appending a file, not updating in place

 Appending a file means newly allocated pages are always cached, so slow Read Modify Write issue doesn't happen.
 In MySQL, binary logs are appended, not overwritten. InnoDB log files are always overwritten so this workaround can't be used.
 Note that if you need full durability (calling fsync/fdatasync() per each write()), appending is much more expensive than overwriting for most filesystems. It is well known that sync_binlog=1 is very slow in MySQL ~5.5, and the main reasons were group commit was broken and appending + fsync() was not fast. In 5.6 group commit was supported so multi-threaded write throughput improved a lot.

Always cache target files within filesystem cache

 If target pages are cached in filesystem cache (page cache), write() doesn't hit disk reads.
The obvious disadvantage is that this approach wastes memory. RAM is expensive. If you have 128GB InnoDB log files, you won't like to give 128GB RAM for InnoDB log files. The RAM should be allocated for InnoDB buffer pool.

Aligning write() I/O unit size by 4KB multiple

 If you always write with Linux page size aligned (4KB multiple), Read Modify Write issue can be avoided.
 One approach to do aligned write is zero-filling. Below is an example.
  memset(buf, 0, 4096); // zerofill
  memset(buf, data1, 1000); // set application data (1000 bytes)
  pwrite(fd, buf, 4096, 0); // write 4KB

  memset(buf2, 0, 4096);
  memset(buf2, data2, 1000);
  pwrite(fd, buf2, 4096, 4096);
  ....
 In this example, you write 1000 bytes of application data twice, but actually writing 4KB data twice, 8KB in total. 8192-2000=6192 bytes are zero-filled data. Disk reads don't happen by this approach, even if they are not cached in filesystem cache.
 This approach needs more space. In the above case you wasted 6192 bytes.
 Another approach is remembering application data offset, but writing by 4KB aligned.
  memset(buf, 0, 4096); // zerofill
  memset(buf, data1, 1000); // set application data (1000 bytes)
  pwrite(fd, buf, 4096, 0);

  memset(buf, 0, 4096);
  memset(buf+1000, data2, 1000); // set next application data from real offset
  pwrite(fd, buf, 4096, 0);
  ....
This approach doesn't waste space.
InnoDB log file write unit is not aligned by 4KB, so Read Modify Write issue exists. Some MySQL fork distributions fix the problem by writing 4KB aligned data. Facebook MySQL takes the latter approach, and we're using in production. Percona Server also supports aligned write but it's stated as beta quality.

2. write() may be blocked for "stable page writes"

Dirty pages

 Most write commands don't flush to disk immediately. write()/pwrite() functions write to Linux page cache and mark them dirty, unless the target file is opened with O_DIRECT/O_SYNC. write() and pwrite() are basically the same except that pwrite() has an option to set offset. I mean write() here as dirty page write.
 Dirty pages are sooner or later flushed to disk. This is done by many processes/functions, such as bdflush, fsync(), sync_file_range(). Behavior to flush to disk highly depends on filesystem. I consider only XFS here.
 When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done. This is called "Stable Page Write". This may cause write() stalls, especially when using slower disks. Without write cache, flushing to disk takes ~10ms usually, ~100ms in bad cases.

 Suppose the following example. There are two clients. One is writing 10 bytes repeatedly via write(), the other is calling fsync() to the same file.
Thread 1:
  fd=open("aaa.dat", O_WRONLY);
  while(1) {
    t1 = get_timestamp
    write(fd, buf, 10);
    t2 = get_timestamp
  }

Thread 2:
  fd=open("aaa.dat", O_WRONLY);
  while(1) {
    fsync(fd);
    sleep(1);
  }
 If you run on a slower disks (HDD with write cache disabled), you may notice sometimes write() (t2-t1) takes more than 10ms. Taking time for fsync() is expected because fsync() flushes dirty pages and metadata into disk, but buffered write also sometimes takes time due to stable page write.

 Another annoying issue is while write() is blocked, the write() holds an exclusive inode mutex. It blocks all writes and disk reads to/from the same file (all pages within the same file).

Disabling Stable Page Writes

 Hopefully there is a patch to disable Stable Page Write, and some Linux distributions support it. 
With this patch, on most filesystems write() no longer waits for dirty page writeback. It helps not to cause write() latency spikes.

3. Waiting for journal block allocation in ext3/4

 If you are using ext3 or ext4, you may still suffer from occasional write() stalls, even if disabling stable page writes. This happens when write() waits for journal block allocation. Example stack trace is as follows. Easier workaround is using xfs, which gives no such stalls.

  Returning from: 0xffffffff81167660 : __wait_on_buffer+0x0/0x30 [kernel]
  Returning to : 0xffffffff811ff920 : jbd2_log_do_checkpoint+0x490/0x4b0 [kernel]
  0xffffffff811ff9bf : __jbd2_log_wait_for_space+0x7f/0x190 [kernel]
  0xffffffff811fab00 : start_this_handle+0x3b0/0x4e0 [kernel]
  0xffffffff811faceb : jbd2__journal_start+0xbb/0x100 [kernel]
  0xffffffff811fad3e : jbd2_journal_start+0xe/0x10 [kernel]
  0xffffffff811db37e : ext4_journal_start_sb+0x7e/0x1d0 [kernel]
  0xffffffff811bd757 : ext4_da_write_begin+0x77/0x210 [kernel]
  0xffffffff810deaea : generic_file_buffered_write+0x10a/0x290 [kernel]
  0xffffffff810e0171 : __generic_file_aio_write+0x231/0x440 [kernel]
  0xffffffff810e03ed : generic_file_aio_write+0x6d/0xe0 [kernel]
  0xffffffff811b732f : ext4_file_write+0xbf/0x260 [kernel]
  0xffffffff8113907a : do_sync_write+0xda/0x120 [kernel]
  0xffffffff8113993e : vfs_write+0xae/0x180 [kernel]
  0xffffffff81139df2 : sys_pwrite64+0xa2/0xb0 [kernel]
  0xffffffff8156a57b : system_call_fastpath+0x16/0x1b [kernel]

Summary

 Question: Why does buffered write() sometimes stall? It just writes to kernel buffer and doesn't hit disk.
 Answer:
  1. write() does disk read when needed. To avoid this issue you need to append a file, not overwrite. Or use OS page aligned writes.
  2. write() may be blocked for "stable page writes". To avoid this issue you need to use newer Linux kernel supporting disabling stable page writes.
  3. If you really care about latency, I don't recommend using ext. Use xfs instead.

 Many of the issues can be mitigated by using battery/flash backed write cache on raid controller, but this is not always possible, and battery often expires.
 In the next post, I'll describe about why sync_file_range(SYNC_FILE_RANGE_WRITE) sometimes stalls.

6 comments:

Vijayakumar Ramdoss said...

Thanks for sharing the excellent information.

Hong Zhang said...

As if this wasn't enough, collectors accord alongside to the cutting bulk of Louis Vuitton handbags and after advertise for gucci replica premiums, like it is with Nike Dunk SB, stamps, baseball cards and coins. However be cautious, 18-carat Louis Vuitton handbags are accessible in finest covering which replica makers acclimate to chanel replica a little admeasurement but they are never fabricated in plastic; watch out, artificial pieces may be absolute affected pieces. Wherever possible, analysis apery handbags with 18-carat Louis Vuitton and amuse yourself that whether they reside up to the brilliant superior the austere replica purses angle for.The replica handbags body of the amount is, getting able to analysis replica handbags for believability takes huge amounts of harder plan from you; sometimes, even the affected masters don't yield affairs with the best of abstracts and accuracy, which accepted humans hardly can tell. The canvas, accessories and assuredly the Louis vuitton replica covering are out of this apple and backpack about the aforementioned attending and feel as aboriginal Louis Vuitton. Replicas are actuality to break and affairs are, your acquirement may be as acceptable as aboriginal Louis Vuitton, if you are in luck.With the banking abatement acute harder and harder aloft us, humans are alpha to feel the pinch.

deny andoid said...

Michael Kors Michael Gros is delighted to announce, vogue blogger Michelle dressed in Michael Kors 2014 summertime sequence navy white striped shorts as well as a windbreaker when using the exact same Goyard Tote Outlet paragraph typical Michael Kors Hamilton handbags, Haji Lane stroll the streets of Kazakhstan pose Lane Singapore. Michael Kors handbags scorching brands Sutton, delicate brown and white strike colour design and style, either hand, can shoulder, with Michael Kors wave position sleeveless dress and gold see, casual style devoid of getting rid of the arrogance of stylish Gucci Handbags Outlet Canada urban a lot of women temperament. Aid us express that Michael Kors Outlet Canada you've observed, Michael Kors has now become a ubiquitous model Longchamp Tote Bags Canada. Appears like Michael Kors manner market place has grown to be general public enemy. Lately, the American luxury brand Goyard Tote Outlet Canada introduced a quarterly light ended June 28 finished the reporting period of time the company's profits rose forty three per cent, this really is Michael Kors thirty third consecutive quarter of earnings expansion.

deny andoid said...

Brand turned his consideration towards East this time. Discovered from detailed define of classic Japanese inspiration, Karen Millen Karen Milian utilising potential factors, to show the impression of contemporary gals. Each and every Karen Millen Sale service or Japanese artwork of origami-style folds of ruffles, or from exaggerated metallic materials, or inspiration in the overall floral print, a unique ornamental topic from. Satin kimono material and Karen Millen Outlet construction fashion dresses and jackets, wide-sleeved shirt and double-breasted coat; exaggerated rosebuds rich black lace Mont Blanc Pens Sale UK overlaid on pale green, provides a solid female suit for ornamental skirts and trousers; temptation underwear the folds, dots and lightweight nude coloration in to the boudoir glamor Chrome Hearts Online; beautifully etched floral oriental ornament for promenade dresses included.

deny andoid said...

Set versions most fitted to the workplace commuter equipment is worn, not merely saves time but will also together with the self into a basic, to make certain that women's gown impeccable taste. Slim classy design and style appears to be really neat Parajumpers Jackets and but female, retro edition accompanied the two elegant and contemporary and shiny, supplying solid stuff will always be a way of eternal existence over and above the restrictions of your time. Three-dimensional brilliant, full of a feeling of Moose Knuckles Canada profile of wool coat, each outstanding and warm. H-type outfits Canada Goose Women Jackets and system appear neat ambiance, you can find no limit to the sizing of, but did Canada Goose Sale Canada not see the bloated, so sense added more comfortable and assured carrying. The cherry pink sweater trip to the bleak autumn delivers brilliant and vitality, with basic white skirt, exhibiting quiet natural beauty. Temperament is mostly a compliment to the lady anxious, it arises from self-confident, proficient, clever. Exceptional confront in due course as time goes by, and temperament is simply the opposite, aided by the passage of your time far more apparent. It is a visitors convincing gesture, straightforward although not very easy.

deny andoid said...

Nike Shox Undergo + iD sequence of trainers working with an impressive cushioning program, which mixes human motion engineering engineering, Max Air cushioning know-how, too as traditional Shox collection moves smoother Nike Shox Norge in contrast to Nike Shox and Nike Zoom cushioning defense technologies. which Nike Free run Sverige groundbreaking NIKE SHOX SHOX cushioning system deserted the friends and family while in the standard foam ankle design, changed using NIKE AIR airbag create, consequently providing a more light-weight comfort and ease and impact security Nike Free Run Sverige Billiga even from the ankle to the toe force distribution. total palm CUSHLON sole Men Vans Half Cab Pro cushioning system deserted songs SHOX collection Vans Shoes Outlet crafted inside the regular foam ankle, changed the usage of NIKE AIR airbag layout, consequently furnishing a far more lightweight influence security and comfort from your ankle into the toe by a uniform power distribution.

Post a Comment