Wednesday, August 12, 2009

Great performance effect of fixing broken group commit

Yesterday InnoDB Plugin 1.0.4 was released by Innobase. This version contains one of the most important performance fixes - "Fix for broken group commit". After MySQL5.0, InnoDB breaks group commit when using with binary log (or with other transactional storage engines), even though setting innodb_support_xa=0. This was really serious because fsync() (called at transaction commit when setting innodb_flush_log_at_trx_commit=1) is very expensive. The initial bug report about this was submitted by Peter four years ago. Last year David Lutz submitted a bug report with his prototype patch. It is great news that this bug fix has been finally implemented in the official InnoDB release.
I did a simple benchmarking by mysqlslap. mysqlslap has functionality to run concurrent inserts from multiple connections. The result is as follows.
mysqlslap --concurrency=1,5,10,20,30,50 --iterations=1 --engine=innodb \
--auto-generate-sql --auto-generate-sql-load-type=write \
--number-of-queries=100000


H/W is Sun Fire X4450, Intel Xeon X5560 Nehalem 2.80GHz * 16cores, 12GB RAM, SAS HDD with write cache. my.cnf configuration is as follows. log-bin is enabled.

[mysqld]
basedir=/usr/mysql5137
datadir=/data/mysql5137/data
ignore_builtin_innodb
plugin-load=innodb=ha_innodb.so;innodb_trx=ha_innodb.so;
innodb_locks=ha_innodb.so;innodb_lock_waits=ha_innodb.so;
innodb_cmp=ha_innodb.so;innodb_cmp_reset=ha_innodb.so;
innodb_cmpmem=ha_innodb.so;innodb_cmpmem_reset=ha_innodb.so
innodb_log_files_in_group=2
innodb_buffer_pool_size=2G
innodb_flush_method=O_DIRECT
innodb_log_file_size=512M
innodb_data_file_path=ibdata1:500M:autoextend
innodb_file_per_table
log-bin
table_cache=8192

Apparently InnoDB Plugin 1.0.4 outperforms normal InnoDB (6.1 times faster on 30 connections, innodb_support_xa=1). Normal InnoDB doesn't scale with connections but InnoDB Plugin 1.0.4 does. What is the difference? Normal InnoDB does the following at transaction commit.
pthread_mutex_lock(&prepare_commit_mutex)
writing into InnoDB logfile for prepare, then fsync
(skipped if innodb_support_xa=0)
writing into binlog
writing into InnoDB logfile for commit, then fsync
pthread_mutex_unlock(&prepare_commit_mutex)

Under the critical section protected by prepare_commit_mutex, only one thread can do operation. So when 100 threads do commit at the same time, fsync() is called 100 times for prepare, 100 times for commit (200 in total). Group commit is totally broken. As you see the above graph, innodb_support_xa=0 is effective (though it still breaks group commit), but in general innodb_support_xa=0 is not recommended because it will break consistency between binlog and InnoDB in case of a crash.
In InnoDB Plugin 1.0.4, the behavior has changed as follows.
writing into InnoDB logfile for prepare, then fsync
(skipped if innodb_support_xa=0)
pthread_mutex_lock(&prepare_commit_mutex)
writing into binlog
writing into InnoDB logfile for commit
pthread_mutex_unlock(&prepare_commit_mutex)
fsync to the InnoDB logfile

fsync, the most expensive operation, is called outside the critical section, so group commit is possible and concurrency is much more improved. The following graph shows how much Innodb_data_fsyncs was increased after executing mysqlslap(committing 100,000 transactions).


In 5.1.37+Builtin(support_xa=1), 2 fsyncs happens per transaction, regardless of # of concurrent connections. In 5.1.37+Builtin(support_xa=0), 1 fsync happens per transaction, regardless of # of concurrent connections. These mean group commit is broken. In both cases about 10,000 fsyncs were executed per second, which seems upper limit for regular HDD with BBU. On the other hand, InnoDB Plugin 1.0.4 greatly reduces the number of fsyncs(i.e. 200251 to 26092 on 30 connections(innodb_support_xa=1): 87% decreased). This shows group commit works well.

Write ordering between binlog and InnoDB logfile is still guaranteed. Write ordering for InnoDB prepare is not same as the ordering of binlog, but this is fine. Prepared entries are used only for recovery and not visible to applications. When doing crash recovery, mysqld reads binlog at first(picking up xids), then checking prepared but not committed entries(xids) in InnoDB logfile, then applying these entries in the order of binlog xids. So in the end write ordering is guaranteed.
Note that if you set sync_binlog=1, it is still very slow because writing into binlog is protected by mutex (prepare_commit_mutex and LOCK_log).
This is my understanding. InnoDB Plugin 1.0.4 also has other awesome features (i.e. multiple i/o threads) so it's worth trying.

2 comments:

Stewart Smith said...

Awesome news. We're really hoping to merge the new plugin into Drizzle very shortly too.

Sildenafil Citrate said...

when I tried InnoDB Plugin 1.0.4 I did notice the great improvement because I was using the other old versions and in this one the major bugs that were a headache int the previous releases were fixed and I was very happy about it!

Post a Comment