Yoshinori Matsunobu's blog: October 2013

At MySQL Connect 2013, I talked about how we used MySQL 5.6 at Facebook, and explained some of new features we added to our Facebook MySQL 5.6 source tree. In this post, I'm going to talk about how we made full table scan faster in InnoDB.

Faster full table scan in InnoDB

In general, almost all queries from applications are using indexes, and reading very few rows (0..1 on primary key lookups and 0..hundreds on range scans). But sometimes we run full table scans. Typical full table scan examples are logical backups (mysqldump) and online schema changes (SELECT ... INTO OUTFILE).

We take logical backups by mysqldump at Facebook. As you know MySQL offers both physical and logical backup commands/utilities. Logical backup has some advantages against physical backup. For example:

Logical backup size is much smaller. 3x-10x size difference is not uncommon.
Relatively easier to parse backups. On physical backup, if we can't restore a database by some serious reasons such as checksum failure, it is very difficult to dig into InnoDB internal data structures and fix corruptions. We trust logical backups rather than physical backups

Major drawbacks of the logical backup are that full backup and full restore are much slower than physical backup/restore.

The slowness of full logical backup often causes problems. It may take very long time if database size grows a lot and tables are fragmented. At Facebook, we suffered from mysqldump performance problem that we couldn't finish full logical backup within reasonable amount of time on some HDD and Flashcache based servers. We knew that InnoDB wasn't efficient at full table scan because InnoDB did not actually do sequential reads, but did random reads mostly. This was a known issue for long years. As database storage capacity has been growing, the slow full table scan issue has been getting serious to us. So we decided to enhance InnoDB to do faster sequential reads. Finally our Database Engineering team implemented "Logical Readahead" feature in InnoDB. With logical readahead, our full table scan speed improved 9~10 times than before under usual production workloads. Under heavy production workloads, full table scan speed became 15~20 times faster.

Issues of full table scan on large, fragmented tables

When doing full table scan, InnoDB scans pages and rows by primary key order. This applies to all InnoDB tables, including fragmented tables. If primary key pages (pages where primary keys and rows are stored) are not fragmented, full table scan is pretty fast because reading order is close to physical storage order. This is similar to reading files by OS command (dd/cat/etc) like below.

# dd if=/data/mysql/dbname/large_table.ibd of=/dev/null bs=16k iflag=direct

You may find that even on commodity HDD servers, you can read more than 100MB/s multiplied by "number of drives" from disks. Over 1GB/s is not uncommon.

Unfortunately, in many cases primary key pages are fragmented. For example, if you need to manage user_id and object_id mappings, primary key will be (user_id, object_id). Insert ordering does not match with user_id ordering, so new inserts/updates very often cause page splits. New split pages will be allocated at far from current pages. This means pages get fragmented.

If primary key pages are fragmented, full table scan becomes very slow. Fig 1 illustrates the problem. After InnoDB reads leaf page #3, it has to read page #5230, and after that it has to read page #4. Page #5230 is far from page #3 and #4, so disk read ordering becomes almost random, not sequential. It is very well known that random reads on HDD is much slower than sequential reads. One very effective approach to improve random reads is using SSD, but per-GB cost on SSD is still more expensive than HDD so using SSD is not always possible.

Fig 1. Full table scan is not actually doing sequential reads

Does Linear Read Ahead really help?

InnoDB supports prefetching feature called "Linear Read Ahead". With linear read ahead, InnoDB reads an extent (64 consecutive pages: 1MB if not compressed) at one time if N pages are accessed sequentially (N can be configured by innodb_read_ahead_threshold parameter, default is 56). But actually this does not help so much. One extent (64 pages) is very small range. For most large fragmented tables, next page does not exist in the same extent. See above fig 1 for example. After reading page#3, InnoDB needs to read page#5230. Page#3 does not exist in the same extent as #5230, so linear read ahead won't help here. This is pretty common case for large fragmented tables. That's why Linear read ahead does not help much to improve full table scan performance.

Optimization approach 1: Physical Read Ahead

As described above, the problem of slow full table scan was because InnoDB did mostly random reads. To make it faster, making InnoDB do sequential reads was needed. The first approach I came up with was creating an UDF (Use Defined Function) to read ibd file (InnoDB data file) sequentially. After executing the UDF, pages in the ibd file should be within InnoDB buffer pool, so no random read happens when doing full table scan. Here is an example usage.

mysql> SELECT buf_warmup ("db1", "large_table"); /* loading into buf pool */
mysql> SELECT * FROM large_application_table; /* in-memory select */

buf_warmup() is a udf that reads entire ibd file of database "db1", table "large_table". It takes time to read the entire ibd file from disk, but reads are sequential so much faster than random reads. When I tested, I could get ~5x overall faster time than normal linear read ahead.
This proved that sequentially reading ibd files helped to improve throughput, but there were a couple of disadvantages:

If table size is bigger than InnoDB buffer pool size, this approach does not work
Reading entire ibd file means that not only primary key pages, but also secondary index pages and unused pages have to be read and cached into InnoDB buffer pool, even though only primary key pages are needed for full table scan. This is very inefficient if you have many secondary indexes.
Applications have to be changed to call UDF.

This looked "good enough" solution, but our database engineering team came up with a better solution called "Logical Read Ahead", so we didn't choose UDF approach.

Logical Read Ahead

Logical Read Ahead (LRA) works as below.

Reading many branch pages of the primary key
Collecting leaf page numbers
Reading many (configurable amount of) leaf pages by page number order (mostly sequential disk reads)
Reading rows by primary key order (same as usual full table scan, but buffer pool hit rate should be very high)

This is illustrated at fig 2 below.

Fig 2: Logical Read Ahead

Logical Read Ahead (LRA) solved issues of Physical Read Ahead. LRA enables InnoDB to read only primary key pages (not reading seconday index pages), and to prefetch configurable number of pages (not entire table) at one time. And LRA does not require any SQL syntax changes.
We added two new session variables to make LRA work. One is "innodb_lra_size" to control how many MBs to prefetch leaf pages. The other is "innodb_lra_sleep" session variable to control how many milliseconds to sleep per prefetch. We tested around 512MB ~ 4096MB prefetch size and 50 milliseconds sleep, and so far we haven't encountered any serious (such as crash/stall/inconsistency) problem. These session variables should be set only when doing full table scan. In our case, mysqldump and some utility scripts (i.e. online schema change tool) turn logical read ahead on.

Submitting multiple async I/O requests at once

Another performance issue we noticed was that i/o unit size in InnoDB was only one page, even if doing readahead. 16KB i/o unit size is too small for sequential reads, and much less efficient than larger I/O unit size.
In 5.6, InnoDB uses Linux Native I/O (aio) by default. If submitting multiple consecutive 16KB read requests at once, Linux internally can merge requests and reads can be done efficiently. But unfortunately InnoDB submitted only one page i/o request at once. I filed a bug report #68659. As written in the bug report, on modern HDD RAID 1+0 environment, I could get more than 1000MB/s disk reads by submitting 64 consecutive pages requests at once, while I got only 160MB/s disk reads by submitting one page request.
To make Logical Read Ahead work better on our environments, we fixed this issue. On our MySQL, InnoDB submits many more page i/o requests before calling io_submit().

Benchmark

In both cases, our production tables (fragmented tables) were used.

1. Full table scan on pure HDD (basic benchmark, no other workload)

Table size	Without LRA	With LRA	Improvement
10GB	10 min 55.00 sec	1 min 15.20 sec	8.71x
50GB	52 min 44.42 sec	6 min 36.98 sec	7.97x

2. Online schema change under heavy workload

Table size	Without LRA	With LRA	Improvement
1GB	7 min 33 sec	24 sec	18.8x

* dump time only, not counting data loading time

Source code

All of our enhancements are available at GitHub.
- Logical read ahead implementation : diff
- Submitting multiple i/o requests at once : diff
- Enabling logical read ahead on mysqldump : diff

Conclusion

InnoDB was not efficient for full table scan, so we fixed it. We did two enhancements, one was implementing logical read ahead, the other was submitting multiple async read i/o requests at once. We have seen 8 to 18 times performance improvements on our production tables, and this has been very helpful to reduce our backup time, schema change time, etc. I hope this feature will be supported in InnoDB by Oracle officially, or at least by major MySQL fork products.

Yoshinori Matsunobu's blog

Thursday, October 10, 2013

Making full table scan 10x faster in InnoDB