Yoshinori Matsunobu's blog: March 2014

Titles in this page

Monday, March 31, 2014

MHA 0.56 is now available

I released MHA version 0.56 today. Downloads are available here. MHA 0.56 includes below features.

Supporting MySQL 5.6 GTID. If GTID and auto position is enabled, MHA automatically does failover with GTID SQL syntax, not using traditional relay log based failover. You don't need any explicit configuration within MHA to use GTID based failover.
Supporting MySQL 5.6 Multi-Threaded slave
Supporting MySQL 5.6 binlog checksum
MHA supports new section [binlogN]. In binlog section, you can define mysqlbinlog streaming servers. When MHA does GTID based failover, MHA checks binlog servers, and if binlog servers are ahead of other slaves, MHA applies differential binlog events from the binlog server to the new master before recovery. When MHA does non-GTID based (traditional) failover, MHA ignores binlog servers. More details can be found on documentation.
Supporting custom mysql and mysqlbinlog location
Adding ping_type=INSERT for checking connectivity for the master. This is useful if master does not accept any writes (i.e. disk error)
Added --orig_master_is_new_slave, --orig_master_ssh_user and --new_master_ssh_user for master_ip_online_change_script
Added --skip_change_master, --skip_disable_read_only, --wait_until_gtid_in_sync on masterha_manager and masterha_master_switch (failover mode).

Thursday, March 27, 2014

Speaking about MySQL5.6 and WebScaleSQL at Percona Live

At Percona Live, Steaphan Greene and I will talk about MySQL 5.6 and WebScaleSQL at Facebook.

2 April 1:20PM - 2:10PM @ Ballroom E
http://www.percona.com/live/mysql-conference-2014/sessions/mysql-56-facebook-2014-edition

In addition to that, I have two more talks this year.

Performance Monitoring at Scale
3 April 2:00PM - 2:50PM @ Ballroom G
http://www.percona.com/live/mysql-conference-2014/sessions/performance-monitoring-scale

Global Transaction ID at Facebook
4 April 12:50PM - 1:40PM @ Ballroom E
http://www.percona.com/live/mysql-conference-2014/sessions/global-transaction-id-facebook

Many people from Facebook speak at Percona Live this year. Please take a look at an interview from Percona to see what we are going to speak.

I assume many of my blog subscribers have already heard about WebScaleSQL that was announced this morning. MySQL Conference in April is the biggest MySQL conference in the world so it's a perfect timing to release something and collaborate with experts. I hope to meet with many people there.

Thursday, March 13, 2014

How sync_file_range() really works

There is a relatively new and platform dependent flushing function called sync_file_range(). Some databases (not MySQL) use sync_file_range() internally.
Recently I investigated stall issues caused by buffered write and sync_file_range(). I learned a lot during investigation but I don't think these behaviors are well known to the public. Here I summarize my understandings.

Understanding differences between sync_file_range() and fsync()/fdatasync()

sync_file_range() has some important behavior differences from fsync().

sync_file_range() has a flag to flush to disk asynchronously. fsync() always flushes to disk synchronously. sync_file_range(SYNC_FILE_RANGE_WRITE) does async writes (async sync_file_range()), sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does sync writes (sync sync_file_range()). With async sync_file_range(), you can *usually* call sync_file_range() very quickly and let Linux flush pages to disk later. As I describe later, async sync_file_range() is actually not always asynchronous, and is sometimes blocked for writeback. It is also important that I/O errors can't be notified when using async sync_file_range().
sync_file_range() allows to set file ranges (starting offset and size) to flush to disk. fsync() always flushes all dirty pages of the file. Ranges are rounded to page unit size. For example, sync_file_range(fd, 100, 300) will flush from offset 0 to 4096 (flushing page#1), not limited from offset 100 to 300. This is because minimum I/O unit is page.
sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does not wait for metadata flushing. fsync() waits until flushing both data and metadata are done. fdatasync() skips flushing metadata if file size does not change (fsync() also skips flushing metadata in that case, depending on filesystem). sync_file_range() does not wait metadata flushing even though file size changes. If a file is not overwritten (=appended), sync_file_range() does not guarantee the file can be recovered after crash, while fsync()/fdatasync() guarantee that.

sync_file_range() behavior highly depends on kernel version and filesystem.

xfs does neighbor page flushing, in addition to specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16). This works very well for HDD because I/O unit size becomes bigger. In general, synchronously writing 1MB * 1000 times is much faster than writing 4KB * 256,000 times. ext3 and ext4 don't do neighbor page flushing.

sync_file_range() is generally faster than fsync() because it can control dirty page ranges and skips waiting for metadata flushing. But sync_file_range() can't be used for guaranteeing durability, especially when file size changes.

Practical usage of the sync_file_range() is where you don't need full durability but you want to control(reduce) dirty pages. For example, Facebook's HBase uses sync_file_range() for compactions and HLog writes. HBase does not need full durability (fsync()) per write because HBase relies on HDFS and HDFS can recover from HDFS replicas. Compactions write huge volume of data so periodically calling sync_file_range() makes sense to avoid burst writes. Calling sync_file_range() 1MB * 1000 times periodically gives more stable workloads than flushing 1GB at one time. RocksDB also uses sync_file_range().

Async sync_file_range is not always asynchronous

Sometimes you might want to flush pages/files more earlier than relying on kernel threads (bdflush), in order to avoid burst writes. fsync() and sync sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)) can be used for that purpose, but both takes longer time (~10ms) on HDD if RAID write cache is disabled. You probably don't want to execute from user-facing thread.
How about using async sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE)) from user-facing thread? It's supposed not to wait for i/o, so latency should be minimal. But I don't recommend using sync_file_range() from user facing thread like that. This is actually not always asynchronous, and there are many cases it takes time for waiting for disk i/o.
I'm showing a couple of examples where async sync_file_range() takes longer time. In the following examples, I assume stable page writes are already disabled.

Stall Example 1: Small range sync_file_range()

single thread
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 100000; i++) {
    write(fd, buf, 1000); // not aligned page write
    sync_file_range(fd, i*1000, 1000, SYNC_FILE_RANGE_WRITE); // async
  }

In example 1, with stable page write fix, write() won't wait for dirty pages written to disk(writeback). But sync_file_range() actually waits for writeback.
When stable page write is disabled, there is a possibility that a page is both writeback in progress and marked dirty. Below is an example scenario.

1. write() -> marking page 1 dirty
2. sync_file_range(SYNC_FILE_RANGE_WRITE) -> sending writeback request on page 1
3. write() -> making page 1 dirty (not waiting with stable page write fix)
4. sync_file_range(SYNC_FILE_RANGE_WRITE) -> waiting until page 1 is written back

In this case, the second sync_file_range(SYNC_FILE_RANGE_WRITE) is blocked until flushing to disk triggered by the first sync_file_range() is done, which may take tens of milliseconds.
Here is an example stack trace when sync_file_range() is blocked.

                sleep_on_page
                __wait_on_bit
                wait_on_page_bit
                write_cache_pages
                generic_writepages
                xfs_vm_writepages
                do_writepages
                __filemap_fdatawrite_range
                filemap_fdatawrite_range
                SyS_sync_file_range
                tracesys
                sync_file_range
                __libc_start_main

Stall example 2: Bulk sync_file_range()

What happens if calling write() multiple times then call sync_file_range(SYNC_FILE_RANGE_WRITE) for multiple pages at once? In below example, calling write() 21 times then triggering flush by sync_file_range().

  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 21; i++) {
    write(fd, buf, 1000); // not aligned page write
  }
  sync_file_range(fd, 0, 16384, SYNC_FILE_RANGE_WRITE);
  for(i=22; i< 42; i++) {
    write(fd, buf, 1000);
  }
  sync_file_range(fd, 16384, 32768, SYNC_FILE_RANGE_WRITE);

Unfortunately, sync_file_range() also may take time in this case too. It works as below in xfs. Since xfs does neighbor page flushing via sync_file_range(), there is a possibility that a page is both under writeback in progress and marked dirty.

1. write -> page 1~6 become dirty
2. sync_file_range (page 1~4) -> triggering page 1~4 and 5, 6 for flushing (in xfs)
3. write -> page 6~11 become dirty
4. sync_file_range (page 5..8) -> waiting for page 6 to be flushed to disk

Note that if write volume (and overall disk busy rate) is lower enough than disk speed, page 6 should be flushed to disk before starting second sync_file_range(). In that case it shouldn't wait anything.

Stall example 3: Aligned page writes

The main reason why async sync_file_range() was blocked is that write() was not aligned by page size. What if we are doing fully aligned page write (writing 4KB multiple)?
With aligned page write, async sync_file_range() does not wait shown at Example 1 and 2, and gives much better throughput. But, even with aligned page write, sometimes async sync_file_range() waits for disk i/o.
sync_file_range() submits page write i/o requests to disks. If there are many outstanding i/o read/write requests in a disk queue, new i/o requests are blocked until there is a free slot available in the queue. This blocks sync_file_range() too.
Queue size is managed under /sys/block/sdX/queue/nr_requests. You may increase to larger values.

echo 1024 > /sys/block/sda/queue/nr_requests

This mitigates stalls at sync_file_range() on busy disks. But this won't solve problems entirely. If you submit many more write i/o requests, read requests take more time to serve (write-starving-reads) which very negatively affects user-facing query latency.

Solution for the stalls

Make sure use Linux kernels supporting disabling stable page write. Otherwise write() would be blocked. My previous post covers this topic. sync_file_range(SYNC_FILE_RANGE_WRITE) is supposed to by asynchronous, but is actually blocked for writeback in many patterns, so it's not recommended calling sync_file_range() from user-facing thread, if you really care about latency. Calling sync_file_range() from a background (not user-facing) thread would be better solution here.
Buffered write and sync_file_range() are important for some databases like HBase and RocksDB. For HBase/Hadoop, using JBOD is one of the well known best practices. HLog writes are buffered, and not flushed to disk per write(put operation). There are some HBase/Hadoop distributions supporting sync_file_range() to reduce outstanding dirty pages. From Operating System point of view, HLog files are appended, and file size is not small (64MB by default). This means all HLog writes go to a single disk with JBOD configuration, which means the single disk tends to be overloaded. An overloaded disk takes longer time for flushing dirty pages (via sync_file_range or bdflush), which may block further sync_file_range(). To get better latency, using Linux Kernel supporting to disable stable page write, and calling sync_file_range() from background threads (not from user-facing thread) are important.

Monday, March 10, 2014

Why buffered writes are sometimes stalled

Many people think buffered write (write()/pwrite()) is fast because it does not do disk access. But this is not always true. Buffered write sometimes does disk access by itself, or waits for some disk accesses by other threads. Here are three common cases where write() takes longer time (== causing stalls).

1. Read Modify Write

Suppose the following logic. Opening aaa.dat without O_DIRECT/O_SYNC, writing 1000 bytes sequentially for 100,000 times, then flushing by fsync().

  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 100000; i++) {
    write(fd, buf, 1000);
  }
  fsync(fd);

You might think each write() will finish fast enough (at least less than 0.1ms) because it shouldn't do any disk access. But it is not always true.
Operating System manages I/O by page. It's 4KB for most Linux environments. If you'd modify 1000 bytes of the 4KB page from offset 0, Linux first needs to read the 4KB page, modify first 1000 bytes, then write the page back. The page will be sooner or later written to disk. Yes, reading the page is needed. This is called RMW (Read Modify Write). If the page was not cached in filesystem cache (page cache), reading the page from disk is needed, which may take tens of milliseconds on HDD.
This problem often happens when overwriting large files. You can easily repeat the problem by the following steps.

1. Creating a large file (cached in fs cache)

 dd if=/dev/zero of=aaa bs=4096 count=1000000

2. Uncache the file (i.e. echo 3 > /proc/sys/vm/drop_caches)
3. Writing to the file (using write()/pwrite()) => the target page does not exist in fs cache. So reading from disk. You can verify that by iostat.

There are a couple of solutions to avoid slow Read Modify Write.

Appending a file, not updating in place

Appending a file means newly allocated pages are always cached, so slow Read Modify Write issue doesn't happen.
In MySQL, binary logs are appended, not overwritten. InnoDB log files are always overwritten so this workaround can't be used.
Note that if you need full durability (calling fsync/fdatasync() per each write()), appending is much more expensive than overwriting for most filesystems. It is well known that sync_binlog=1 is very slow in MySQL ~5.5, and the main reasons were group commit was broken and appending + fsync() was not fast. In 5.6 group commit was supported so multi-threaded write throughput improved a lot.

Always cache target files within filesystem cache

If target pages are cached in filesystem cache (page cache), write() doesn't hit disk reads.
The obvious disadvantage is that this approach wastes memory. RAM is expensive. If you have 128GB InnoDB log files, you won't like to give 128GB RAM for InnoDB log files. The RAM should be allocated for InnoDB buffer pool.

Aligning write() I/O unit size by 4KB multiple

If you always write with Linux page size aligned (4KB multiple), Read Modify Write issue can be avoided.
One approach to do aligned write is zero-filling. Below is an example.

  memset(buf, 0, 4096); // zerofill
  memset(buf, data1, 1000); // set application data (1000 bytes)
  pwrite(fd, buf, 4096, 0); // write 4KB

  memset(buf2, 0, 4096);
  memset(buf2, data2, 1000);
  pwrite(fd, buf2, 4096, 4096);
  ....

In this example, you write 1000 bytes of application data twice, but actually writing 4KB data twice, 8KB in total. 8192-2000=6192 bytes are zero-filled data. Disk reads don't happen by this approach, even if they are not cached in filesystem cache.
This approach needs more space. In the above case you wasted 6192 bytes.
Another approach is remembering application data offset, but writing by 4KB aligned.

  memset(buf, 0, 4096); // zerofill
  memset(buf, data1, 1000); // set application data (1000 bytes)
  pwrite(fd, buf, 4096, 0);

  memset(buf, 0, 4096);
  memset(buf+1000, data2, 1000); // set next application data from real offset
  pwrite(fd, buf, 4096, 0);
  ....

This approach doesn't waste space.
InnoDB log file write unit is not aligned by 4KB, so Read Modify Write issue exists. Some MySQL fork distributions fix the problem by writing 4KB aligned data. Facebook MySQL takes the latter approach, and we're using in production. Percona Server also supports aligned write but it's stated as beta quality.

2. write() may be blocked for "stable page writes"

Dirty pages

Most write commands don't flush to disk immediately. write()/pwrite() functions write to Linux page cache and mark them dirty, unless the target file is opened with O_DIRECT/O_SYNC. write() and pwrite() are basically the same except that pwrite() has an option to set offset. I mean write() here as dirty page write.
Dirty pages are sooner or later flushed to disk. This is done by many processes/functions, such as bdflush, fsync(), sync_file_range(). Behavior to flush to disk highly depends on filesystem. I consider only XFS here.
When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done. This is called "Stable Page Write". This may cause write() stalls, especially when using slower disks. Without write cache, flushing to disk takes ~10ms usually, ~100ms in bad cases.

Suppose the following example. There are two clients. One is writing 10 bytes repeatedly via write(), the other is calling fsync() to the same file.

Thread 1:
  fd=open("aaa.dat", O_WRONLY);
  while(1) {
    t1 = get_timestamp
    write(fd, buf, 10);
    t2 = get_timestamp
  }

Thread 2:
  fd=open("aaa.dat", O_WRONLY);
  while(1) {
    fsync(fd);
    sleep(1);
  }

If you run on a slower disks (HDD with write cache disabled), you may notice sometimes write() (t2-t1) takes more than 10ms. Taking time for fsync() is expected because fsync() flushes dirty pages and metadata into disk, but buffered write also sometimes takes time due to stable page write.

Another annoying issue is while write() is blocked, the write() holds an exclusive inode mutex. It blocks all writes and disk reads to/from the same file (all pages within the same file).

Disabling Stable Page Writes

Hopefully there is a patch to disable Stable Page Write, and some Linux distributions support it.
With this patch, on most filesystems write() no longer waits for dirty page writeback. It helps not to cause write() latency spikes.

3. Waiting for journal block allocation in ext3/4

If you are using ext3 or ext4, you may still suffer from occasional write() stalls, even if disabling stable page writes. This happens when write() waits for journal block allocation. Example stack trace is as follows. Easier workaround is using xfs, which gives no such stalls.

  Returning from: 0xffffffff81167660 : __wait_on_buffer+0x0/0x30 [kernel]
  Returning to : 0xffffffff811ff920 : jbd2_log_do_checkpoint+0x490/0x4b0 [kernel]
  0xffffffff811ff9bf : __jbd2_log_wait_for_space+0x7f/0x190 [kernel]
  0xffffffff811fab00 : start_this_handle+0x3b0/0x4e0 [kernel]
  0xffffffff811faceb : jbd2__journal_start+0xbb/0x100 [kernel]
  0xffffffff811fad3e : jbd2_journal_start+0xe/0x10 [kernel]
  0xffffffff811db37e : ext4_journal_start_sb+0x7e/0x1d0 [kernel]
  0xffffffff811bd757 : ext4_da_write_begin+0x77/0x210 [kernel]
  0xffffffff810deaea : generic_file_buffered_write+0x10a/0x290 [kernel]
  0xffffffff810e0171 : __generic_file_aio_write+0x231/0x440 [kernel]
  0xffffffff810e03ed : generic_file_aio_write+0x6d/0xe0 [kernel]
  0xffffffff811b732f : ext4_file_write+0xbf/0x260 [kernel]
  0xffffffff8113907a : do_sync_write+0xda/0x120 [kernel]
  0xffffffff8113993e : vfs_write+0xae/0x180 [kernel]
  0xffffffff81139df2 : sys_pwrite64+0xa2/0xb0 [kernel]
  0xffffffff8156a57b : system_call_fastpath+0x16/0x1b [kernel]

Summary

Question: Why does buffered write() sometimes stall? It just writes to kernel buffer and doesn't hit disk.
Answer:
  1. write() does disk read when needed. To avoid this issue you need to append a file, not overwrite. Or use OS page aligned writes.
  2. write() may be blocked for "stable page writes". To avoid this issue you need to use newer Linux kernel supporting disabling stable page writes.
  3. If you really care about latency, I don't recommend using ext. Use xfs instead.

Many of the issues can be mitigated by using battery/flash backed write cache on raid controller, but this is not always possible, and battery often expires.
In the next post, I'll describe about why sync_file_range(SYNC_FILE_RANGE_WRITE) sometimes stalls.

Yoshinori Matsunobu's blog