Writing small volume of data (Bytes-MBs) with sync (fsync()/fdatasync()/O_SYNC/O_DSYNC) is very common for RDBMS and is needed to guarantee durability. For transactional log files, sync happens per commit. For data files, sync happens at checkpoint etc. Typically RDBMS does syncing data very frequently. In this case, overwriting is much faster than appending for most filesystems/storages. Overwriting does not change file size, while appending does. Increasing file size requires a lot of overheads such as allocating space within the filesystem, updating & flushing metadata. This really matters when you writes data with fsync() very frequently. The following are simple benchmarking results on ext3 RHEL5.3.
1. creating an empty file, then writing 8KB 128*1024 times with fdatasync()
fdatasync per second: 3085.94321
(Emulating current binlog (sync-binlog=1) behavior)
2. creating a 1GB data file, then writing 8KB 128*1024 times with fdatasync()
fdatasync per second: 12330.47268
(Emulating current InnoDB log file behavior)
3. zero-filling 1GB, then writing 8KB 128*1024 times with fdatasync() immediately
fdatasync per second: 6569.00072
(Zero-filling causes massive disk writes so killing application performance)
4. zero-filling 1GB, sleeping 20 seconds, then writing 8KB 128*1024 times with fdatasync()
fdatasync per second: 11669.81157
(Zero-filling finished within 20 seconds. This is actually does the same thing with no.2)
Apparently no.2(and no.4) are much faster than no.1 and no.3. The difference between no.1 and no.2 is just appending or overwriting. Four times difference is amazing but this is real and I got similar results on other filesystems except zfs (I tested xfs, reiserfs and zfs). (Updated in May 28: I got about 7,000 fsync/sec for both appending/overwriting on zfs. There is no outstanding difference because zfs is Copy On Write filesystem as comment #3)
This is one of the reason why sync-binlog=1 is very slow. Binlog is appended, not overwritten. Default sync-binlog value is 0 (no sync happens at commit) so appending does not cause serious performance drop. But there are cases that sync-binlog=1 is absolutely needed. I am currently directly working on this and implementing "preallocating binlog" functionality.
The difference between no.3 and no.4 is also interesting. Overwriting requires preallocation : allocating space before writing. If preallocation happens *dynamically during heavy loads* (no.3), application performance is seriously degraded.
No.3 is close to current InnoDB system tablespace (ibdata) mechanism. InnoDB extends tablespace size by innodb_autoextend_increment MBs dynamically, then doing overwriting. But as you see above, dynamically preallocating with zero is not good for performance.
Using posix_fallocate() instead of zero-filling will fix this issue. posix_fallocate() preallocates space without any overhead. Unfortunately currently most of enterprise Linux distributions/filesystems/glibc don't behave as expected, but internally doing zero-filling instead(including RHEL5.3).
Preallocating large enough space before going into production or low-load hours (midnight etc) is a current workaround.
Talking about InnoDB deployment, innodb_file_per_table is popular and easy-to-manage, but it's currently not overwriting architecture(Updated in May 28: Preallocating GBs of data beforehand is not possible unless you explicitly load & delete data into/from tables. The maximum (auto)extension of an .ibd file is 4MB at one time, regardless of innodb_autoextend_increment setting. See bug#31592 for detail.) Not using innodb_file_per_table, but preallocating large InnoDB tablespaces before going into production (i.e. ibdata1,2,3.., over 100GB in total) can often get better performance.