Titles in this page

Wednesday, August 26, 2009

Accessing MySQL tables from UDF + storage engine API

A couple of weeks ago, at a MySQL study group in Tokyo I presented about various kinds of MySQL hacking techniques, such as debugging with MS Visual Studio or gdb, tracing with DTrace, writing information schema plugins, accessing tables from UDF and storage engine API, and extending MySQL server code itself. This 90-minute session went successful. Nearly 100 people attended, including famous MySQL developers such as Kazuho Oku-san and Kentoku Shiba-san, having productive discussions with attendees and quite a lot of people seemed interested in MySQL hacking. The slides are written in Japanese, but sample codes can be understandable and can be downloaded here.
What audiences were most interested in was accessing MySQL tables directly from Plugins(currently UDFs) and storage engine API. I showed a sample UDF code, based on what Kazuho-san released over one year ago, which is over 10 times faster than SQL or stored procedures for complex operations. Excerpted codes are as follows(skipping error handling).

#define MYSQL_SERVER 1
/* initializing tables, fields and indexes */
THD *thd = current_thd;
TABLE_LIST tables;
TABLE *table;
tables.init_one_table("test","message", TL_READ);
simple_open_n_lock_tables(thd, &tables);
table= tables.table;
message_id_fld = get_field(table, "id");
message_user_id_fld = get_field(table, "user_id");
message_key = index_init(table, "user_id", true);

/* accessing table. Equivalent query is
SELECT id FROM message WHERE user_id=? ORDER BY id DESC LIMIT 1; */
uchar* key_buff= (uchar*) thd->alloc(message_key->key_length);
bzero(key_buff, message_key->key_length);
int null_offset= message_user_id_fld->null_bit;
int4store(key_buff + null_offset, user_id);
err= table->file->index_read_last_map(table->record[0], key_buff, 1);
fprintf(stderr, "id=%lld\n", message_id_fld->val_int());

/* closing objects */

By defining MYSQL_SERVER to 1, you can access to MySQL internal structures.
You can build the sample UDF on development environment as follows. MySQL source code is required to build when accessing to internal structures.

-Wall -fno-rtti -fno-exceptions -fPIC -shared \
-I/debug-built-mysql-src/include \
-I/debug-built-mysql-src/regex \
-I/debug-built-mysql-src/sql \
-o udf_sample.so udf_sample.cc

On production environment, skip using -DSAFE_MUTEX -DSAFEMALLOC and include MySQL source built with configurations for production. When using some macros(i.e. current_thd) that behave differently between debug and prod configurations, you need to include appropriate source directly.

g++ -g \
-Wall -fno-rtti -fno-exceptions -fPIC -shared \
-I/prod-built-mysql-src/include \
-I/prod-built-mysql-src/regex \
-I/prod-built-mysql-src/sql \
-o udf_sample.so udf_sample.cc

This snippet skips error handling etc for good legibility, but you need to carefully handle errors, otherwise mysqld crashes. Please check bench_udf.cc, build_udf_debug.sh, and build_udf_prod.shin my sample code.

MySQL is fast, but it still has many expensive CPU operations such as parsing SQL statements, making execution plans every time (MySQL doesn't have optimizer-plan caching mechanism), a lot of memcpy/malloc operations, etc. If you write a UDF which directly accesses tables through storage engine API, these expensive operations can be skipped so it can be much faster than SQL statements in many cases.
The above code looks like using key value store APIs (i.e. TokyoCabinet APIs). Most KVSs are faster than MySQL, and one of the reasons is they don't need to do extra operations such as making execution plans etc. But it is not impossible to use MySQL like fast KVS as above. You can not only get benefits about better performance, but also utilize traditional RDBMS implementations such as B+Tree (range scan), perfect durability, schema awareness. Data is stored in one place, and you can get the same data regardless of accessing methods. For example, you can use UDF for some special parts (i.e. picking up all friends' latest messages, which is difficult for MySQL to fetch quickly), and use normal SQL statements for others. I assume such two-way client interfaces (fast API and traditional SQL) can be a common implementation for next-generation RDBMS. Third, this is currently not common hacks so there is almost no document. For example, handling two or more tables requires additional codes, but this is not documented (I am still fully not understand to safely handle them).
Of course there are many disadvantages about using UDFs and storage engine APIs directly. First, it highly depends on MySQL versions. Storage engine APIs are standardized but client side operations such as calling table open/lock functions are different from MySQL versions. Second, mysqld easily crashes if UDF has bugs.
But there are a lot of possibilities here and this certainly attracts developers. For example, Kazuho-san yesterday published a very interesting UDF: mycached (memcached server udf). It listens memcached protocol, then directly accessing a table through storage engine API (primary key lookup). According to his benchmark this is twice as fast as SELECT with primary key lookup (if records are cached in buffer pool). I assume this approach will work very well as a durable key-value store, for example storing session data, last access timestamp etc.

Wednesday, August 12, 2009

Great performance effect of fixing broken group commit

Yesterday InnoDB Plugin 1.0.4 was released by Innobase. This version contains one of the most important performance fixes - "Fix for broken group commit". After MySQL5.0, InnoDB breaks group commit when using with binary log (or with other transactional storage engines), even though setting innodb_support_xa=0. This was really serious because fsync() (called at transaction commit when setting innodb_flush_log_at_trx_commit=1) is very expensive. The initial bug report about this was submitted by Peter four years ago. Last year David Lutz submitted a bug report with his prototype patch. It is great news that this bug fix has been finally implemented in the official InnoDB release.
I did a simple benchmarking by mysqlslap. mysqlslap has functionality to run concurrent inserts from multiple connections. The result is as follows.
mysqlslap --concurrency=1,5,10,20,30,50 --iterations=1 --engine=innodb \
--auto-generate-sql --auto-generate-sql-load-type=write \

H/W is Sun Fire X4450, Intel Xeon X5560 Nehalem 2.80GHz * 16cores, 12GB RAM, SAS HDD with write cache. my.cnf configuration is as follows. log-bin is enabled.


Apparently InnoDB Plugin 1.0.4 outperforms normal InnoDB (6.1 times faster on 30 connections, innodb_support_xa=1). Normal InnoDB doesn't scale with connections but InnoDB Plugin 1.0.4 does. What is the difference? Normal InnoDB does the following at transaction commit.
writing into InnoDB logfile for prepare, then fsync
(skipped if innodb_support_xa=0)
writing into binlog
writing into InnoDB logfile for commit, then fsync

Under the critical section protected by prepare_commit_mutex, only one thread can do operation. So when 100 threads do commit at the same time, fsync() is called 100 times for prepare, 100 times for commit (200 in total). Group commit is totally broken. As you see the above graph, innodb_support_xa=0 is effective (though it still breaks group commit), but in general innodb_support_xa=0 is not recommended because it will break consistency between binlog and InnoDB in case of a crash.
In InnoDB Plugin 1.0.4, the behavior has changed as follows.
writing into InnoDB logfile for prepare, then fsync
(skipped if innodb_support_xa=0)
writing into binlog
writing into InnoDB logfile for commit
fsync to the InnoDB logfile

fsync, the most expensive operation, is called outside the critical section, so group commit is possible and concurrency is much more improved. The following graph shows how much Innodb_data_fsyncs was increased after executing mysqlslap(committing 100,000 transactions).

In 5.1.37+Builtin(support_xa=1), 2 fsyncs happens per transaction, regardless of # of concurrent connections. In 5.1.37+Builtin(support_xa=0), 1 fsync happens per transaction, regardless of # of concurrent connections. These mean group commit is broken. In both cases about 10,000 fsyncs were executed per second, which seems upper limit for regular HDD with BBU. On the other hand, InnoDB Plugin 1.0.4 greatly reduces the number of fsyncs(i.e. 200251 to 26092 on 30 connections(innodb_support_xa=1): 87% decreased). This shows group commit works well.

Write ordering between binlog and InnoDB logfile is still guaranteed. Write ordering for InnoDB prepare is not same as the ordering of binlog, but this is fine. Prepared entries are used only for recovery and not visible to applications. When doing crash recovery, mysqld reads binlog at first(picking up xids), then checking prepared but not committed entries(xids) in InnoDB logfile, then applying these entries in the order of binlog xids. So in the end write ordering is guaranteed.
Note that if you set sync_binlog=1, it is still very slow because writing into binlog is protected by mutex (prepare_commit_mutex and LOCK_log).
This is my understanding. InnoDB Plugin 1.0.4 also has other awesome features (i.e. multiple i/o threads) so it's worth trying.

Tuesday, July 28, 2009

iostat: (r/s + w/s) * svctm = %util on Linux

iostat -x is very useful to check disk i/o activities. Sometimes it is said that "check %util is less than 100%" or "check svctm is less than 50ms", but please do not fully trust these numbers. For example, the following two cases (DBT-2 load on MySQL) used same disks (two HDD disks, RAID1) and reached almost 100% util, but performance numbers were very different (no.2 was about twice as fast as no.1).
# iostat -xm 10
avg-cpu: %user %nice %system %iowait %steal %idle
21.16 0.00 6.14 29.77 0.00 42.93

Device: rqm/s wrqm/s r/s w/s rMB/s wMB/s
sdb 2.60 389.01 283.12 47.35 4.86 2.19
avgrq-sz avgqu-sz await svctm %util
43.67 4.89 14.76 3.02 99.83

# iostat -xm 10
avg-cpu: %user %nice %system %iowait %steal %idle
40.03 0.00 16.51 16.52 0.00 26.94

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
sdb 6.39 368.53 543.06 490.41 6.71 3.90
avgrq-sz avgqu-sz await svctm %util
21.02 3.29 3.20 0.90 92.66

100% util does not mean disks can not be faster anymore. For example, command queuing (TCQ/NCQ) or battery backed up write cache can often boosts performance significantly. For random i/o oriented applications(in most cases), I pay attention to r/s and w/s. r/s is the number of read requests that were issued to the device per second. w/s is the number of write requests that were issued to the device per second (copied from man). r/s + w/s is the total number of i/o requests per second (IOPS) so it is easier to check whether disks work as expected or not. For example, a few thousands of IOPS can be expected on single Intel SSD drive. For sequential i/o operations, r/s and w/s can be significantly affected by Linux parameters such as max_sectors_kb even though throughput is not different, so I check different iostat status variables such as rrqm/s, rMB/s.

What about svctm? Actually Linux's iostat calculates svctm automatically from r/s, w/s and %util. Here is an excerpt from iostat.c .

nr_ios = sdev.rd_ios + sdev.wr_ios;
tput = ((double) nr_ios) * HZ / itv;
util = ((double) sdev.tot_ticks) / itv * HZ;
svctm = tput ? util / tput : 0.0;
/* rrq/s wrq/s r/s w/s rsec wsec rkB wkB rqsz qusz await svctm %util */
printf(" %6.2f %6.2f %5.2f %5.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %6.2f %6.2f\n",
((double) sdev.rd_merges) / itv * HZ,
((double) sdev.wr_merges) / itv * HZ,
((double) sdev.rd_ios) / itv * HZ,
((double) sdev.wr_ios) / itv * HZ,

The latter means the following.
r/s = sdev.rd_ios / itv * HZ
w/s = sdev.wr_ios / itv * HZ

The former means the following.
svctm = util / ((sdev.rd_ios + sdev.wr_ios) * HZ / itv)

If %util is 100%, svctm is just 1 / (r/s + w/s) seconds, 1000/(r/s+w/s) milliseconds. This is an inverse number of IOPS. In other words, svctm * (r/s+w/s) is always 1000 if %util is 100%. So checking svctm is practically as same as checking r/s and w/s (as long as %util is close to 100%). The latter (IOPS) is much easier, isn't it?

Wednesday, May 27, 2009

Overwriting is much faster than appending

Writing small volume of data (Bytes-MBs) with sync (fsync()/fdatasync()/O_SYNC/O_DSYNC) is very common for RDBMS and is needed to guarantee durability. For transactional log files, sync happens per commit. For data files, sync happens at checkpoint etc. Typically RDBMS does syncing data very frequently. In this case, overwriting is much faster than appending for most filesystems/storages. Overwriting does not change file size, while appending does. Increasing file size requires a lot of overheads such as allocating space within the filesystem, updating & flushing metadata. This really matters when you writes data with fsync() very frequently. The following are simple benchmarking results on ext3 RHEL5.3.

1. creating an empty file, then writing 8KB 128*1024 times with fdatasync()
fdatasync per second: 3085.94321
(Emulating current binlog (sync-binlog=1) behavior)

2. creating a 1GB data file, then writing 8KB 128*1024 times with fdatasync()
fdatasync per second: 12330.47268
(Emulating current InnoDB log file behavior)

3. zero-filling 1GB, then writing 8KB 128*1024 times with fdatasync() immediately
fdatasync per second: 6569.00072
(Zero-filling causes massive disk writes so killing application performance)

4. zero-filling 1GB, sleeping 20 seconds, then writing 8KB 128*1024 times with fdatasync()
fdatasync per second: 11669.81157
(Zero-filling finished within 20 seconds. This is actually does the same thing with no.2)

Apparently no.2(and no.4) are much faster than no.1 and no.3. The difference between no.1 and no.2 is just appending or overwriting. Four times difference is amazing but this is real and I got similar results on other filesystems except zfs (I tested xfs, reiserfs and zfs). (Updated in May 28: I got about 7,000 fsync/sec for both appending/overwriting on zfs. There is no outstanding difference because zfs is Copy On Write filesystem as comment #3)

This is one of the reason why sync-binlog=1 is very slow. Binlog is appended, not overwritten. Default sync-binlog value is 0 (no sync happens at commit) so appending does not cause serious performance drop. But there are cases that sync-binlog=1 is absolutely needed. I am currently directly working on this and implementing "preallocating binlog" functionality.

The difference between no.3 and no.4 is also interesting. Overwriting requires preallocation : allocating space before writing. If preallocation happens *dynamically during heavy loads* (no.3), application performance is seriously degraded.

No.3 is close to current InnoDB system tablespace (ibdata) mechanism. InnoDB extends tablespace size by innodb_autoextend_increment MBs dynamically, then doing overwriting. But as you see above, dynamically preallocating with zero is not good for performance.

Using posix_fallocate() instead of zero-filling will fix this issue. posix_fallocate() preallocates space without any overhead. Unfortunately currently most of enterprise Linux distributions/filesystems/glibc don't behave as expected, but internally doing zero-filling instead(including RHEL5.3).
Preallocating large enough space before going into production or low-load hours (midnight etc) is a current workaround.

Talking about InnoDB deployment, innodb_file_per_table is popular and easy-to-manage, but it's currently not overwriting architecture(Updated in May 28: Preallocating GBs of data beforehand is not possible unless you explicitly load & delete data into/from tables. The maximum (auto)extension of an .ibd file is 4MB at one time, regardless of innodb_autoextend_increment setting. See bug#31592 for detail.) Not using innodb_file_per_table, but preallocating large InnoDB tablespaces before going into production (i.e. ibdata1,2,3.., over 100GB in total) can often get better performance.

Monday, May 18, 2009

Make sure write cache is enabled on your RAID controller

Using a write cache protected by battery (BBWC) is well known and one of the best practices in RDBMS world. But I have frequently seen situations that people do not set write cache properly. Sometimes they just forget to enable write cache. Sometimes write cache is disabled even though they say they set properly.

Make sure that BBWC is enabled. If not enabled, you will be able to easily get better performance by just enabling it. The following is a DBT-2 example.

Write cache is disabled:

# iostat -xm 10
avg-cpu: %user %nice %system %iowait %steal %idle
21.16 0.00 6.14 29.77 0.00 42.93

Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 2.60 389.01 283.12 47.35 4.86
wMB/s avgrq-sz avgqu-sz await svctm %util
2.19 43.67 4.89 14.76 3.02 99.83

Write cache is enabled:

# iostat -xm 10
avg-cpu: %user %nice %system %iowait %steal %idle
40.03 0.00 16.51 16.52 0.00 26.94

Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 6.39 368.53 543.06 490.41 6.71
wMB/s avgrq-sz avgqu-sz await svctm %util
3.90 21.02 3.29 3.20 0.90 92.66

Both run same applications(DBT-2), but the server activity was significantly different each other. I got 85% better result when write cache is enabled.

Checking write cache and battery status

BBWC is mostly equipped with H/W raid controller so operational command depends on products. Here is an example of "arcconf" command result.

# /usr/StorMan/arcconf GETCONFIG 1 AL
Controller Battery Information
Status : Optimal
Over temperature : No
Capacity remaining : 99 percent
Time remaining (at current draw): 3 days, 1 hours, 11 minutes

Logical device information
Logical device number 0
Read-cache mode : Disabled
Write-cache mode : Enabled (write-back)
Write-cache setting: Enabled (write-back) when
protected by battery

Physical Device information
Device #0
Device is a Hard drive
Size : 140009 MB
Write Cache : Disabled (write-through)

Write cache should be enabled only when battery backup is working. In other words:
- Write cache on logical device (H/W raid controller) is enabled when protected by battery
- Write cache on physical device is disabled
- The battery has enough capacity and long enough remaining time

I recommend DBAs to monitor write cache status regularly (adding this to your monitoring scripts), including battery status checking. Long time ago I was asked for urgent help to fix a problem that application performance suddenly went down. I looked into problems then found that write cache unexpectedly turned off because a battery was expired. If you successfully detected that battery capacity was decreased before write cache was disabled, you would be able to take an action before server performance suddenly goes down (i.e. allocating scheduled down time in order to replace the battery).

Quick health check with mysql commands

If you are not familiar with H/W raid controller specific command but want to check write cache status quickly, using mysqlslap or stored procedure is easy.

$ mysql -e "set global innodb_flush_log_at_trx_commit=1"
$ mysqlslap --concurrency=1 --iterations=1 --engine=innodb \
--auto-generate-sql --auto-generate-sql-load-type=write \

stored procedure:
create table t (c1 int) engine=innodb;
delimiter //
create procedure sp1(IN i INTEGER)
WHILE i > 0 DO
insert into t values (1);
SET i = i - 1;
delimiter ;

set global innodb_flush_log_at_trx_commit=1;
call sp1(100000);

You will be able to insert thousands of records per second if write cache is enabled. If disabled, only hundreds of inserts per second is possible, so you can easily check.

Monday, May 11, 2009

Tables on SSD, Redo/Binlog/SYSTEM-tablespace on HDD

I recently did a disk bound DBT-2 benchmarking on SSD/HDD (MySQL 5.4.0, InnoDB). Now I'm pretty confident that storing tables on SSD, redo/Binlog/SYSTEM-tablespace on HDD will be one of the best practices for the time being.

This post is a detailed benchmarking report.
(This post is very long and focusing on InnoDB only. If you are familiar with HDD/SSD/InnoDB architecture and understand what my blog title means, skipping section 1 (general theory) then reading from section 2 (benchmarking results) would be fine. )

1. General Theory of HDD, SSD and InnoDB

SSD is often called as a disruptive storage technology. Currently storage capacity is much smaller and unit price is much higher than HDD, but the situation is very rapidly changing. In the near future many people will use SSD instead of HDD.

From DBA's standpoint, you have a couple of choices for storage allocation.
- Storing all files on SSD, not using HDD at all
- Storing all files on HDD, not using SSD at all
- Using SSD and HDD altogether (some files on SSD, others on HDD).

Which is the best approach? My favorite approach is storing tables on SSD, storing Redo Log files, Binary Log files, and SYSTEM-tablespace(ibdata) on HDD. I describe a detailed reason and some DBT-2 benchmarking results below.

1.1 HDD is very good at sequential writes if write cache is enabled

Using battery backed up write cache(BBWC) is one of the best practices for RDBMS world. BBWC is normally equipped with hardware raid controller. Normally InnoDB flushes data to disks (redo log files) per each commit to guarantee durability. You can change this behavior by changing innodb_flush_log_at_trx_commit parameter to 0 or 2, but default setting (1) is recommended if you need durability.
Without write cache, flushing to disks require disk rotation and disk seek. Redo logs are sequentially written, so disk seek doesn’t happen when the disk is dedicated for redo log files, but disk rotation still happens. Average disk rotation overhead is 1/2 round. Theoretical maximum throughput is only 500 flushes per second when using single 15000RPM drive (15,000 * 2 / 60 seconds = 500).
If using write cache, the situation is greatly improved. Flushing to write cache does not require disk seek/rotation so finishes very quickly. The data in write cache will be written to disk with optimal order (i.e. writing many data at one time), so total throughput is highly improved. Over 10,000 fsync() per second is not impossible.

Sometimes people say that write cache is not effective because the total write volume is the same. This is not correct. On HDD, disk seek & rotation overhead is huge. By using write cache, internally a lot of write operations are converted into small number of write operations. Then the total number of disk seeks & rotations can be much smaller. For random-write oriented files (i.e. index files), disk seeks & rotations still happen (but reduced so still very effective) , but for sequential-write oriented files (i.e. REDO log files) write cache is very effective.
Write cache needs to be protected by battery in order not to be invalidated by power failure etc.
Using H/W RAID + BBWC + HDD is used for years and now it’s a proven technology. You can setup with pretty reasonable cost.
Note that HDD/SSD storage devices also have write cache themselves, but don't turn on. It's very dangerous because it's not protected by battery so data in cache is destroyed by power failure etc.

1.2 SSD is very good at random reads, good at random writes, not so good at sequential writes, compared to HDD

Currently read/write performance on SSD highly depends on products and device drivers. SSD is very fast for random reads. I could get over 5,000 *random reads* on single Intel X25-E SSD drive. For HDD, normally only a few hundreads of random reads is possible. Over 10 times difference is huge.
For sequential reads, performance difference is smaller but there is still a significant difference on some SSDs(When I tested, single X25-E drive was two times faster than two SAS 15,000RPM RAID1 drives).
For writes, I heard both positive and negative information. I tested Intel X25-E SSD then it worked very well (I got over 2,000 random writes on single drive with write cache). But I heard from many people who tested different SSDs that some SSDs don't perform well for writes. Some cases it was much slower than HDD, some cases write performance gradually dropped after a few months, some cases it freezed for a while. You would also be concerned with a "Write Endurance" issue on SSD when running on production environments.
On HDD, there is a huge difference between random writes and sequential writes. Sequential writes is very fast on HDD with write cache. On the other hand, on SSD, there is not so much difference between random writes and sequential writes. Though performance highly depends on SSD drives themselves, you probably notice that sequential write performance is not so different between HDD and SSD (or even faster on HDD). Since HDD has a longer history and is cheaper than SSD, currently there is not a strong reason to use SSD for sequential write oriented files.

1.3 MySQL/InnoDB files

From these perspectives, it would make sense to locate random i/o oriented files on SSD, sequential write oriented files on HDD. Let's classify MySQL/InnoDB files as follows.

Random i/o oriented:
- Table files (*.ibd)
- UNDO segments (ibdata)

Sequential write oriented:
- REDO log files (ib_logfile*)
- Binary log files (binlog.XXXXXX)
- Doublewrite buffer (ibdata)
- Insert buffer (ibdata)
- Slow query logs, error logs, general query logs, etc

By default, table files (*.ibd) are not created but included in InnoDB's SYSTEM-tablespace(ibdata). By using "innodb_file_per_table" parameter, *.ibd files are created then table/index data are stored there. Table files are of course randomly read/written so storing on SSD is better. Note that write cache is also very effective on SSD so using H/W raid with BBWC + SSD would be nice.

REDO log files and Binary log files are transactional logs. They are sequentially written so you can get very good performance on HDD with write cache. Sequential disk read will happen on recovery, but normally this will not cause a performance problem because log file size is normally much smaller than data files and sequential reads are much faster than random reads (happening on data files).

Doublewrite buffer is a special feature for InnoDB. InnoDB first writes flushed pages to the "doublewrite buffer", then writing the pages to their correct positions on data files. This is to avoid page corruption (Without doublewrite buffer, page might get corrupted if power failure etc happens during writing to disks). Writing to doublewrite buffer is sequential so highly optimized for HDD. Sequential read will happen on recovery. MySQL has a parameter "skip_innodb_doublewrite" to disable doublewrite buffer, but disabling is dangerous. The amount of write data to doublewrite buffer is equivalent to that to data areas, so this is not negligible.

Insert buffer is also a special feature for InnoDB. If non-unique, secondary index blocks are not in memory, InnoDB inserts entries to a "insert buffer" to avoid random disk i/o operations. Periodically, the insert buffer is merged into the secondary index trees in the database. Insert buffer enables to reduce the number of disk i/o operations by merging i/o requests to the same block, and random i/o operations can be sequential. So insert buffer is also highly optimized for HDD. Both sequential writes and reads will happen in normal operations.

UNDO segments are random i/o oriented. To guarantee MVCC, innodb needs to register old images in UNDO segments. Reading previous images from UNDO segments on disk requires random reads. If you run a very long transaction with repeatable read (i.e. mysqldump --single-transaction) or running a long query, a lot of random reads might happen, so storing UNDO segments on SSD would be better in that case. If you run only short transactions/queries, this will not be an issue.

Based on the above, I want to use "innodb_file_per_table", then storing *.ibd files on SSD, storing ib_logfile* (REDO log files), binary logs, ibdata (SYSTEM-tablespace) on HDD.

2 Benchmarking

I ran DBT-2(disk i/o intensive) to verify whether my assumption was correct or not.

2.1 Benchmarking environment

Sun Fire X4150
CPU: Intel Xeon 3.3GHz, quad core, 8 cores in total
HDD: SAS 15,000RPM, 2 disks, RAID 1, write cache enabled on H/W raid controller
SSD: Intel X25-E, Single drive, write cache enabled on drive
(Note that this is dangerous and not recommended for production at all. I didn't have a H/W raid controller working well on Intel X25-E so I just turned on for benchmarking purpose only)
OS: RedHat Enterprise Linux 5.3 (2.6.18-128.el5)
Filesystem: ext3
I/O Scheduler: noop
MySQL version: 5.4.0

DBT-2 option:
./run_workload.sh -n -s 100 -t 1 -w 100 -d 600 -c 20
(Approximate datafile size is 9.5 GB)

(Binlog is disabled)

This is a disk-bound configuration. If you increase innodb_buffer_pool_size higher, you can get better results but I intentionally set lower to do disk i/o intensive loads.

2.2 Benchmarking results

2.2.1 HDD vs SSD

All files on HDD(two drives, RAID1): 3447.25
All files on SSD(single drive) : 14842.44
(The number is NOTPM, higher is better)

I got 4.3 times better result on SSD. As Vadim, Matt, and other many people have already shown, just replacing HDD with SSD works very well for DBT-2 benchmarking.
The following is iostat result.

HDD 3447.25 NOTPM
Device: rrqm/s wrqm/s r/s w/s rMB/s
sdf1 4.60 80.03 363.48 377.30 6.52
wMB/s avgrq-sz avgqu-sz await svctm %util
10.14 46.06 17.89 24.18 1.35 99.98
us sy id wa st
3 1 50 46 0

SSD 14842.44 NOTPM
Device: rrqm/s wrqm/s r/s w/s rMB/s
sda1 0.00 11.90 1738.65 1812.33 30.73
wMB/s avgrq-sz avgqu-sz await svctm %util
46.42 44.49 4.01 1.13 0.27 95.03
us sy id wa st
18 5 57 20 0

2.2.2 Storing some files on SSD, others on HDD

All files on SSD : 14842.44
Redo log on HDD: 15539.8
Redo log and ibdata on HDD: 23358.63
Redo log and ibdata on tmpfs: 24076.43
Redo log and ibdata on separate SSD drives: 20450.78

These are quite interesting results. Let's see one by one. Redo log file on HDD

Storing just redo log files on HDD did not have so good effect (just 4.7% improvement). DBT-2 internally executes 14.5 DML(insert/update/delete) statements per transaction on average, so commit frequency is less than typical web applications. If you execute commit more frequently, performance difference will be bigger.The following is a statistics about how many bytes were written during single DBT-2 benchmarking. You can get this information by iostat -m.

MB written on SSD: 31658
MB written on HDD: 1928

So 31658MB were written to SSD but only 1928MB were written to HDD. 20 times difference is huge so just moving redo logs from SSD to HDD would not have big impact. Note that this statistics highly depends on applications. For DBT-2, this doesn't work anyway. Redo log and ibdata on HDD

By storing redo log and ibdata on HDD, I got much better result (14842.44->23358.63, 57% improvement!). Dirty pages are written twice (one for doublewrite buffer, the other for actual data area) so by moving ibdata, in which doublewrite buffer is allocated, to different drives, you can decrease the amount of writes to SSD by half. Since doublewrite buffer is sequentially written, it fits for HDD very well. Here is a result of iostat -m.

MB written on SSD : 23151
MB written on HDD : 25761

iostat & vmstat results are as follows. Apparently HDD was not so busy so working very well for sequential writes.

Device: rrqm/s wrqm/s r/s w/s rMB/s
sda1(SATA SSD) 0.03 0.00 2807.15 1909.00 49.48
sdf1(SAS HDD) 0.00 547.08 0.38 737.22 0.01
wMB/s avgrq-sz avgqu-sz await svctm %util
35.91 37.08 2.90 0.61 0.18 84.69
40.03 111.17 0.13 0.18 0.11 7.79

us sy id wa st
28 9 49 14 0 Redo log and ibdata on tmpfs

I was interested in whether storing redo log and ibdata on HDD can get the "best" performance or not. To verify this, I tested to store these files on tmpfs. This is not useful in production environment at all, but it's fine to check performance higher limit. tmpfs should be faster than any other high-end SSD(including PCIe SSD). If there is not so big performance difference, moving these files from HDD to very fast SSD is not needed. Here is a result.

Redo log and ibdata on HDD(NOTPM): 23358.63
Redo log and ibdata on tmpfs(NOTPM): 24076.43

I got only 3% performance improvement. Actually this is not suprise because HDD was not maxed out when testing "redo log and ibdata on HDD". Redo log and ibdata on separate SSD drive

Some people might be interested in how performance is different if using two SSD drives, one for *ibd, the other for Redo/ibdata. The following is a result.

Redo log and ibdata on HDD(NOTPM): 23358.63
Redo log and ibdata on separate SSD drives(NOTPM): 20450.78

So I got 12.5% worse result compared to HDD. Does this mean Intel X25-E is worse for sequential writes than SAS HDD? Actually the situation seemed not so simple. When I did a very simple sequential write oriented benchmarking (mysqlslap insert) on single drive, I got very close numbers compared to HDD. So there must be other reasons for performance drop. The following is iostat result.

Device: rrqm/s wrqm/s r/s w/s rMB/s
sda1(SATA SSD) 0.00 0.00 2430.32 1657.15 43.32
sdb1(SATA SSD) 0.00 12.60 0.75 1017.43 0.01
wMB/s avgrq-sz avgqu-sz await svctm %util
31.06 37.27 2.29 0.56 0.16 67.28
34.46 69.33 0.57 0.56 0.27 27.02

Device: rrqm/s wrqm/s r/s w/s rMB/s
sda1(SATA SSD) 0.03 0.00 2807.15 1909.00 49.48
sdf1(SAS HDD) 0.00 547.08 0.38 737.22 0.01
wMB/s avgrq-sz avgqu-sz await svctm %util
35.91 37.08 2.90 0.61 0.18 84.69
40.03 111.17 0.13 0.18 0.11 7.79

sda activity on SSD-SSD test was about 15% lower than on SSD-HDD test even though sdb was not fully busy. I have not identified the reason yet,but currently I assume that SATA interface (Host Bus Adapter port) got a bit saturated. I'll try other H/W components (SATA HDD, using different HBAs which has many more ports, etc) and see what will happen in the near future.
Anyway, any high end SSD won't beat tmpfs numbers(24076.43NOTPM) so HDD(23358.63NOTPM) would be enough.

3 Conclusion

Now I'm pretty confident that for many InnoDB based applications it would be a best practice to store *.ibd files on SSD, and store Redo log files, binary log files, and ibdata (SYSTEM-tablespace) on HDD. I intentionally disabled binary logs for these benchmarking because currently binlog has concurrency issues (breaking InnoDB group-commit). Hopefully this issue will be fixed soon in future 5.4 so I'll be able to show you detailed results at that time. In theory same principles as redo logs(storing on HDD) can be applied.

I have done many other benchmarks (disabling doublewrite buffer, running with AUTOCOMMIT, UNDO log performance difference between HDD and SSD, etc) so I'd like to share some interesting results in other posts.

A very interesting point I have seen is that "The amount of disk reads/writes" really matters on SSD, which does not matter on HDD. This is because disk seeks & rotation overhead is very small on SSD, so data transfer speed relatively got important. For example, doublewrite buffer really influenced performance on SSD, which did not have impact on HDD. This point might force architecture changes for RDBMS. From this perspective, I am highly impressed by PBXT. I'd like to do intensive tests on PBXT in the near future.

Wednesday, April 29, 2009

Linux I/O scheduler queue size and MyISAM performance

At MySQL Conference and Expo 2009, I explained how Linux I/O scheduler queue size affects MyISAM insert performance.

It is well known that Linux implemented four types of I/O schedulers (noop/deadline/anticipatory/cfq) in Linux kernel 2.6.10. The default is cfq in most distributions including RHEL, which is not so good then noop normally outperforms, but I'll talk this in other posts.

Linux I/O scheduler also has a functionality to sort incoming I/O requests in its request-queue for optimization. Queue size is configurable. Queue size is defined as "/sys/block/sdX/queue/nr_requests" then you can change queue length as follows.

# cat /sys/block/sda/queue/nr_requests
# echo 100000 > /sys/block/sda/queue/nr_requests

Changing queue size is even effective for noop scheduler.

Here are benchmarking results about changing i/o scheduler queue size for MyISAM insert-intensive loads. Detailed are written in the slides at the UC.

Apparently increasing queue size was very helpful for HDD, but not helpful for SSD.
Let me explain about backgrounds.

On Linux side, I/O requests are handled by the following order:
system calls(pwrite/etc)
-> Filesystem
-> I/O scheduler
-> Device Driver/Disks

I/O scheduler sorts incoming I/O requests by logical block addresses, then sending them to a device driver.
I/O scheduler does not depend on storage devices so is helpful for some parts (i.e. minimizing disk seek overheads), not helpful for other parts (i.e. minimizing disk rotation overheads).

On the other hand, Command Queuing (TCQ/NCQ) also sorts I/O requests for optimization. SAS and recent SATA II disks support command queuing. The goal is partly duplicate from I/O scheduler. But TCQ can minimize not only disk seeks but also disk rotation overhead (See the link to wikipedia). The disadvantage of TCQ is that queue size is very limited (normally 32-64).

Based on the above, sorting almost all random I/O requests on I/O sheculer then sending them to TCQ would be nice.

Suppose 100,000 random read I/O requests are coming.
When I/O scheduler queue size is 128 (default in many cases), TCQ gets I/O requests by almost random order, so pretty high disk seek overhead happens for each action (requests within single queue is dispersed).

When I/O scheduler queue size is 100,000, TCQ gets I/O requests by fully sorted order, so seek overhead can be much smaller.

Increasing queue size does not have any effect on SSD because no disk seek happens.

This would explain my benchmarking results.

I/O scheduler queue size settings is not helpful for InnoDB because InnoDB internally sorts I/O requests to optimize disk seek overheads, and sending limited number of i/o requests controlled by InnoDB internal i/o threads. So the role is duplicate between InnoDB itself and I/O scheduler queue. Note that TCQ improves InnoDB throughput because disk rotation overheads are significantly reduced and such optimizations can not be done from application/kernel side.

MyISAM does nothing special (highly depending on OS) so this helps.

Updated in Apr 30: Added detailed benchmark conditions for people who are interested..

Here is a test script. I ran a single-threaded stored procedure on a same machine.
create table aa (id int auto_increment primary key,
b1 int,
b2 int,
b3 int,
c varchar(100),
index(b1), index(b2), index(b3)) engine=myisam;

drop procedure sp_aa;
delimiter //
create procedure sp_aa(IN count INTEGER)
DECLARE time_a, time_b BIGINT DEFAULT 0;
WHILE done != 1 DO
insert into aa values (i,rand()*count,rand()*count,rand()*count,repeat(1,40));
SET i = i + 1;
IF i % 1000000 = 1 THEN
SELECT unix_timestamp() into time_a from dual;
SELECT i, from_unixtime(time_a), time_a - time_b from dual;
SET time_b = time_a;
IF i > count THEN
SET done = 1;
delimiter ;

mysql test -vvv -e "call sp_aa(300000000)"

Then wait for a long long time...
# Default insert. no insert-delayed, no disable-keys, no delay-key-write, no mmap

H/W, OS, MySQL settings
Sun Fire X4150
CPU: Intel Xeon, 8 cores
RAM: 32GB (but limit filesystem cache size up to 5GB, no swapping happened)
HDD: SAS 15,000RPM, 2 disks, RAID 1, write cache enabled
SSD: Intel X25-E, Single drive, write cache enabled
OS: RedHat Enterprise Linux 5.3 (2.6.18-128.el5)
Filesystem: ext3
I/O Scheduler: deadline
MySQL 5.1.33
key_buffer_size: 2G

i/o stats:
queue size=128 (default)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.01 0.00 0.08 24.69 0.00 75.22

Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 0.00 0.87 0.00 575.60 0.00

wMB/s avgrq-sz avgqu-sz await svctm %util
2.25 8.01 142.44 247.03 1.74 100.00
(At running 12 hours, 13 mil rows were inserted)

queue size=100000
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
2.06 0.00 5.32 29.66 0.00 62.96

Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 0.00 2487.33 0.00 2042.60 0.00

wMB/s avgrq-sz avgqu-sz await svctm %util
35.11 35.20 84402.36 43582.33 0.49 100.01
(At running 1.5 hours, 41 mil rows were inserted)

avg-cpu: %user %nice %system %iowait %steal %idle
0.07 0.00 0.19 24.82 0.00 74.91

Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 0.00 9.03 1.70 756.03 0.01

wMB/s avgrq-sz avgqu-sz await svctm %util
5.56 15.04 31981.72 127560.45 1.32 100.00
(At running 12 hours, 77 mil rows were inserted,
index size was 4.5 GB)

If running many more hours(days) so that index size exceeds filesystem cache, a lot of disk reads will happen. (I didn't have time to run that test before UC)

Friday, April 24, 2009

Mastering the Art of Indexing - slides available

At the MySQL Conference & Expo 2009, I did a presentation "Mastering the Art of Indexing" , and now you can get the slides from MySQL Conference site .
I was excited that so many people attended to my session. I hope the session will help you to improve indexing techniques. Feedbacks are welcome.

I was asked from some people at the conference where my blog is. I am mostly publishing MySQL technical information in Japanese, but now I believe it's a good time to start publishing in English. I plan to post mainly benchmarking & performance analysis, high availability solutions, internationalization (which sometimes people have problems), etc.