Wednesday, May 27, 2009

Overwriting is much faster than appending

Writing small volume of data (Bytes-MBs) with sync (fsync()/fdatasync()/O_SYNC/O_DSYNC) is very common for RDBMS and is needed to guarantee durability. For transactional log files, sync happens per commit. For data files, sync happens at checkpoint etc. Typically RDBMS does syncing data very frequently. In this case, overwriting is much faster than appending for most filesystems/storages. Overwriting does not change file size, while appending does. Increasing file size requires a lot of overheads such as allocating space within the filesystem, updating & flushing metadata. This really matters when you writes data with fsync() very frequently. The following are simple benchmarking results on ext3 RHEL5.3.


1. creating an empty file, then writing 8KB 128*1024 times with fdatasync()
fdatasync per second: 3085.94321
(Emulating current binlog (sync-binlog=1) behavior)

2. creating a 1GB data file, then writing 8KB 128*1024 times with fdatasync()
fdatasync per second: 12330.47268
(Emulating current InnoDB log file behavior)

3. zero-filling 1GB, then writing 8KB 128*1024 times with fdatasync() immediately
fdatasync per second: 6569.00072
(Zero-filling causes massive disk writes so killing application performance)

4. zero-filling 1GB, sleeping 20 seconds, then writing 8KB 128*1024 times with fdatasync()
fdatasync per second: 11669.81157
(Zero-filling finished within 20 seconds. This is actually does the same thing with no.2)


Apparently no.2(and no.4) are much faster than no.1 and no.3. The difference between no.1 and no.2 is just appending or overwriting. Four times difference is amazing but this is real and I got similar results on other filesystems except zfs (I tested xfs, reiserfs and zfs). (Updated in May 28: I got about 7,000 fsync/sec for both appending/overwriting on zfs. There is no outstanding difference because zfs is Copy On Write filesystem as comment #3)

This is one of the reason why sync-binlog=1 is very slow. Binlog is appended, not overwritten. Default sync-binlog value is 0 (no sync happens at commit) so appending does not cause serious performance drop. But there are cases that sync-binlog=1 is absolutely needed. I am currently directly working on this and implementing "preallocating binlog" functionality.

The difference between no.3 and no.4 is also interesting. Overwriting requires preallocation : allocating space before writing. If preallocation happens *dynamically during heavy loads* (no.3), application performance is seriously degraded.

No.3 is close to current InnoDB system tablespace (ibdata) mechanism. InnoDB extends tablespace size by innodb_autoextend_increment MBs dynamically, then doing overwriting. But as you see above, dynamically preallocating with zero is not good for performance.

Using posix_fallocate() instead of zero-filling will fix this issue. posix_fallocate() preallocates space without any overhead. Unfortunately currently most of enterprise Linux distributions/filesystems/glibc don't behave as expected, but internally doing zero-filling instead(including RHEL5.3).
Preallocating large enough space before going into production or low-load hours (midnight etc) is a current workaround.

Talking about InnoDB deployment, innodb_file_per_table is popular and easy-to-manage, but it's currently not overwriting architecture(Updated in May 28: Preallocating GBs of data beforehand is not possible unless you explicitly load & delete data into/from tables. The maximum (auto)extension of an .ibd file is 4MB at one time, regardless of innodb_autoextend_increment setting. See bug#31592 for detail.) Not using innodb_file_per_table, but preallocating large InnoDB tablespaces before going into production (i.e. ibdata1,2,3.., over 100GB in total) can often get better performance.

8 comments:

Baron said...

"innodb_file_per_table is popular and easy-to-manage, but it's currently not overwriting architecture" -- I think you are mixing things a little. It does preallocation in blocks of innodb_autoextend_increment size.

There is some javascript in this comment form that is driving me crazy. Arrow keys don't work, can't ctrl-v to paset, can't right click to paste...

Anonymous said...

Baron, almost... there's a bug:
http://bugs.mysql.com/bug.php?id=31592

qu1j0t3 said...

The generalisation is probably not valid for COW filesystems such as ZFS.

Yoshinori Matsunobu said...

Baron,

I needed to explain more. Thanks. What I meant was preallocating GBs of data beforehand is not possible unless you explicitly load & delete data into/from tables. The maximum (auto)extension of an .ibd file is currently 4MB at one time, regardless of innodb_autoextend_increment setting, as Morgan says.

I don't do anything special setting in this comment form. I can copy & paste..

Yoshinori Matsunobu said...

qu1j0t3,

You are right. I did the same test on ZFS last August then got very close numbers (about 7,000 fsync/sec in both appending and overwriting). I forgot to mention this. I have updated the post. Thanks.

Bradley C. Kuszmaul said...

I don't get it. For ordinary disk drives how can you do anything like 7000 fsync/sec? A disk can do a couple of hundred. That's assuming that you actually want the fsync to write data to disk. Even SSD can only do a few thousand fsync/sec.

Yoshinori Matsunobu said...

Bradley,

I tested on H/W RAID 1, two SAS 15,000RPM HDDs, write cache enabled, protected by battery. Using battery backed up write cache (BBWC) enables to get thousands of fsync/sec with durability and this is one of the best practices for DBAs. You also might be interested in my previous posts:
http://yoshinorimatsunobu.blogspot.com/2009/05/tables-on-ssd-redobinlogsystem.html
http://yoshinorimatsunobu.blogspot.com/2009/05/make-sure-write-cache-is-enabled-on.html

Sildenafil Citrate said...

I have overwrite some and you are very right, overwriting is much faster than appending, but sometimes you need to do some appending too, it is slower but you can't avoid it, anyway nice post!

Post a Comment