Thursday, October 10, 2013

Making full table scan 10x faster in InnoDB

At MySQL Connect 2013, I talked about how we used MySQL 5.6 at Facebook, and explained some of new features we added to our Facebook MySQL 5.6 source tree. In this post, I'm going to talk about how we made full table scan faster in InnoDB.

Faster full table scan in InnoDB

 In general, almost all queries from applications are using indexes, and reading very few rows (0..1 on primary key lookups and 0..hundreds on range scans). But sometimes we run full table scans. Typical full table scan examples are logical backups (mysqldump) and online schema changes (SELECT ... INTO OUTFILE).

 We take logical backups by mysqldump at Facebook. As you know MySQL offers both physical and logical backup commands/utilities. Logical backup has some advantages against physical backup. For example:
  • Logical backup size is much smaller. 3x-10x size difference is not uncommon.
  • Relatively easier to parse backups. On physical backup, if we can't restore a database by some serious reasons such as checksum failure, it is very difficult to dig into InnoDB internal data structures and fix corruptions. We trust logical backups rather than physical backups
 Major drawbacks of the logical backup are that full backup and full restore are much slower than physical backup/restore.

 The slowness of full logical backup often causes problems. It may take very long time if database size grows a lot and tables are fragmented. At Facebook, we suffered from mysqldump performance problem that we couldn't finish full logical backup within reasonable amount of time on some HDD and Flashcache based servers. We knew that InnoDB wasn't efficient at full table scan because InnoDB did not actually do sequential reads, but did random reads mostly. This was a known issue for long years. As database storage capacity has been growing, the slow full table scan issue has been getting serious to us. So we decided to enhance InnoDB to do faster sequential reads. Finally our Database Engineering team implemented "Logical Readahead" feature in InnoDB. With logical readahead, our full table scan speed improved 9~10 times than before under usual production workloads. Under heavy production workloads, full table scan speed became 15~20 times faster.

Issues of full table scan on large, fragmented tables

 When doing full table scan, InnoDB scans pages and rows by primary key order. This applies to all InnoDB tables, including fragmented tables. If primary key pages (pages where primary keys and rows are stored) are not fragmented, full table scan is pretty fast because reading order is close to physical storage order. This is similar to reading files by OS command (dd/cat/etc) like below.
# dd if=/data/mysql/dbname/large_table.ibd of=/dev/null bs=16k iflag=direct
 You may find that even on commodity HDD servers, you can read more than 100MB/s multiplied by "number of drives" from disks. Over 1GB/s is not uncommon.

 Unfortunately, in many cases primary key pages are fragmented. For example, if you need to manage user_id and object_id mappings, primary key will be (user_id, object_id). Insert ordering does not match with user_id ordering, so new inserts/updates very often cause page splits. New split pages will be allocated at far from current pages. This means pages get fragmented.

 If primary key pages are fragmented, full table scan becomes very slow. Fig 1 illustrates the problem. After InnoDB reads leaf page #3, it has to read page #5230, and after that it has to read page #4. Page #5230 is far from page #3 and #4, so disk read ordering becomes almost random, not sequential. It is very well known that random reads on HDD is much slower than sequential reads. One very effective approach to improve random reads is using SSD, but per-GB cost on SSD is still more expensive than HDD so using SSD is not always possible.


Fig 1. Full table scan is not actually doing sequential reads


Does Linear Read Ahead really help?

 InnoDB supports prefetching feature called "Linear Read Ahead". With linear read ahead, InnoDB reads an extent (64 consecutive pages: 1MB if not compressed) at one time if N pages are accessed sequentially (N can be configured by innodb_read_ahead_threshold parameter, default is 56). But actually this does not help so much. One extent (64 pages) is very small range. For most large fragmented tables, next page does not exist in the same extent. See above fig 1 for example. After reading page#3, InnoDB needs to read page#5230. Page#3 does not exist in the same extent as #5230, so linear read ahead won't help here. This is pretty common case for large fragmented tables. That's why Linear read ahead does not help much to improve full table scan performance.

Optimization approach 1: Physical Read Ahead

 As described above, the problem of slow full table scan was because InnoDB did mostly random reads. To make it faster, making InnoDB do sequential reads was needed. The first approach I came up with was creating an UDF (Use Defined Function) to read ibd file (InnoDB data file) sequentially. After executing the UDF, pages in the ibd file should be within InnoDB buffer pool, so no random read happens when doing full table scan. Here is an example usage.
mysql> SELECT buf_warmup ("db1", "large_table"); /* loading into buf pool */
mysql> SELECT * FROM large_application_table; /* in-memory select */
 buf_warmup() is a udf that reads entire ibd file of database "db1", table "large_table". It takes time to read the entire ibd file from disk, but reads are sequential so much faster than random reads. When I tested, I could get ~5x overall faster time than normal linear read ahead.
 This proved that sequentially reading ibd files helped to improve throughput, but there were a couple of disadvantages:
  • If table size is bigger than InnoDB buffer pool size, this approach does not work
  • Reading entire ibd file means that not only primary key pages, but also secondary index pages and unused pages have to be read and cached into InnoDB buffer pool, even though only primary key pages are needed for full table scan. This is very inefficient if you have many secondary indexes.
  • Applications have to be changed to call UDF.
 This looked "good enough" solution, but our database engineering team came up with a better solution called "Logical Read Ahead", so we didn't choose UDF approach.

Logical Read Ahead

 Logical Read Ahead (LRA) works as below.
  1. Reading many branch pages of the primary key
  2. Collecting leaf page numbers
  3. Reading many (configurable amount of) leaf pages by page number order (mostly sequential disk reads)
  4. Reading rows by primary key order (same as usual full table scan, but buffer pool hit rate should be very high)
 This is illustrated at fig 2 below.


Fig 2: Logical Read Ahead

 Logical Read Ahead (LRA) solved issues of Physical Read Ahead. LRA enables InnoDB to read only primary key pages (not reading seconday index pages), and to prefetch configurable number of pages (not entire table) at one time. And LRA does not require any SQL syntax changes.
 We added two new session variables to make LRA work. One is "innodb_lra_size" to control how many MBs to prefetch leaf pages. The other is "innodb_lra_sleep" session variable to control how many milliseconds to sleep per prefetch. We tested around 512MB ~ 4096MB prefetch size and 50 milliseconds sleep, and so far we haven't encountered any serious (such as crash/stall/inconsistency) problem. These session variables should be set only when doing full table scan. In our case, mysqldump and some utility scripts (i.e. online schema change tool) turn logical read ahead on.

Submitting multiple async I/O requests at once

 Another performance issue we noticed was that i/o unit size in InnoDB was only one page, even if doing readahead. 16KB i/o unit size is too small for sequential reads, and much less efficient than larger I/O unit size.
 In 5.6, InnoDB uses Linux Native I/O (aio) by default. If submitting multiple consecutive 16KB read requests at once, Linux internally can merge requests and reads can be done efficiently. But unfortunately InnoDB submitted only one page i/o request at once. I filed a bug report #68659. As written in the bug report, on modern HDD RAID 1+0 environment, I could get more than 1000MB/s disk reads by submitting 64 consecutive pages requests at once, while I got only 160MB/s disk reads by submitting one page request.
 To make Logical Read Ahead work better on our environments, we fixed this issue. On our MySQL, InnoDB submits many more page i/o requests before calling io_submit().

Benchmark

In both cases, our production tables (fragmented tables) were used.

1. Full table scan on pure HDD (basic benchmark, no other workload)
Table sizeWithout LRAWith LRAImprovement
10GB10 min 55.00 sec1 min 15.20 sec8.71x
50GB52 min 44.42 sec6 min 36.98 sec7.97x

2. Online schema change under heavy workload
Table sizeWithout LRAWith LRAImprovement
1GB7 min 33 sec24 sec18.8x
* dump time only, not counting data loading time

Source code

 All of our enhancements are available at GitHub.
 - Logical read ahead implementation : diff
 - Submitting multiple i/o requests at once : diff
 - Enabling logical read ahead on mysqldump : diff

 

Conclusion

 InnoDB was not efficient for full table scan, so we fixed it. We did two enhancements, one was implementing logical read ahead, the other was submitting multiple async read i/o requests at once. We have seen 8 to 18 times performance improvements on our production tables, and this has been very helpful to reduce our backup time, schema change time, etc. I hope this feature will be supported in InnoDB by Oracle officially, or at least by major MySQL fork products.


51 comments:

Clay Ferguson said...

Dude did you just implement a Bayer Tree in database tables ?

sunbains said...

The second diff doesn't look right, see:

+ errno = -total_submitted;

Earlier in the code you have:

+ total_submitted += io_submit(array->aio_ctx[i], count,
+ &(array->pending[iocb_index]));

What if io_submit() fails after the 1st iteration?

Sunny Bains said...

In the third diff you have:

+#ifdef LINUX_NATIVE_AIO
+ os_aio_linux_dispatch_read_array_submit();
+#endif

What happens if LINUX_NATIV_AIO is not defined? Did I miss the IO call?

Rongrong Zhong said...

Sunny: You are right, it won't set the errno correctly if io_submit fail in later iterations. Thanks for pointing that out!

Tom K said...

Great article! Really impressed with Sunny's comment too.

Tim Callaghan said...

Yoshinori: How does this improvement in full table scan performance for your mysqldumps affect the other running workload on the server? I assume that bringing the data 18x faster into the InnoDB cache would have a measurable impact.

Justin Swanhart said...

Tim that is what innodb_old_blocks_time and innodb_old_blocks_pct are for. They are meant to prevent FTS from "blowing out" good blocks in the BP.

Justin Swanhart said...

That being said, buffer pool pinning is probably something else that should be added. This would keep some objects in the BP no matter what, like important lookup tables.

Tim Callaghan said...

Justin, fair enough. But I was hoping to see some actual evidence in either direction from Yoshinori. I don't trust anything without measurement.

Yoshinori Matsunobu said...

Tim, we set innodb_old_blocks_time and innodb_old_blocks_pct so that pages read by backups are quickly evicted.

Another important point I should note is that we use flashcache, and we do some optimizations here. In general, cache miss in InnoDB is not so much painful if cache hits in flashcache, because flash is fast. It's very painful if cache misses in flashcache, because hdd is slow. On backups, when reading ibd file from hdd (cache miss), our mysqld does not write to flash, but directly caches into buffer pool. This means existing cached pages in flash(flashcache) are not evicted. This is more important optimization for us.

Tim Callaghan said...

Thanks for the explanation. Well done on this clever optimization!

Rock said...

I Agree with this post, we have best option to buy twitter followers. get more twitter followers cheap from here.
purchase twitter followers

Web Designing Bangalore said...
This comment has been removed by the author.
Web Designing Bangalore said...

Never the less ,you efforts are very marvelous , wonderful. Thanks for useful information.
Website Development Company in Bangalore

zinavo shankar said...

Thanks for sharing the information with us.Its really good information.
Web Designing Bangalore

Fred Duffler said...

I'd like to thank you for tables and figures. http://online-essay-writer.org/ It was very useful paper for me indeed.

Manoranjan Sahoo said...

Nice tutorial, thanks for share this post. I hope peoples are really help after reading this blog.
Website designing company India

Manoj Kusshwaha said...

Nice article got a good knowledge you have

Matthew Corgan said...

Have you looked into the possibility of actually defragmenting the table, either in background threads or during writes?

rakesh said...

Great Information,thanks a lot for sharing this interesting article,really its very impressive..
web designing company

Sudha S said...

very informative article, thank you for sharing this great article...

Care

Sagheer said...

Online Jobs of Data Entry, Copy Pasting, Add Posting, Clicking, Web Surfing, Website Visiting, Article Sharing, Data Sharing, Google Business Plans, Investment Plans, Genuine earnings from home.
www.jobzcorner.com

Sagheer said...

All Kind of Current Affairs, Latest Hot News updates, Classified Sites, Pakistani Classified Sites, USA Classifieds, Indian Classifieds, Entertainment Articles, Entertainment News, Entertainment Pictures, Bollywood, Hollywood and Lollywood Pictures and Videos, Entertainment Latest updates, Hot Entertainment News and Pictures Funny Entertainment Pictures, lol Pictures, Funny Pictures and Much More Fun Only on 1 Current Affairs Network
hotcurrentaffairs.com

Faheem Zia said...

For All Latest Hot Current Affairs
www.hotcurrentaffairs.com

Web Design Company said...

Informative blog post. Website is the key to present business in internet.

Web Designing Companies in Bangalore | Web Development Company in Bangalore

chaithanya m said...
This comment has been removed by the author.
chaithanya m said...

B school in Bangalore enriches the reserved as entire widening of the learners by portion them put up the speculative acquaintance learnt in class into a waged environment

chaithanya m said...

Sherwani for men in Bangalore is grouping of royal sophistication and perfect style with clean cuts, fine craftsmanship and elaborate thread work.

sherwani for men in bangalore

chaithanya m said...

Mind Bird Solutions is a proven website design and Development company that is customer centric, is focused on long term stability, and has a leadership team of seasoned web services professionals..
web application development bangalore

chaithanya m said...

B school in Bangalore is sponsoring by IZee Education Trust. B school in Bangalore is assisted by IZee Manpower in India's Leading acceptance And Working out association........
b school in bangalore

Ahsan Afsar said...

World Most Popular and Top Amazing Speedy Cars
TopAmazingCars.BlogSpot.Com

muqtiyar ahmed said...

associated mostly with weddings ................

Sherwanis in Bangalore

muqtiyar ahmed said...
This comment has been removed by the author.
muqtiyar ahmed said...

guarantee that web application


web design bangalore

muqtiyar ahmed said...

will grow up the leaps and bounds

http://mindbirdsolutions.com/Web-Design-Bangalore.aspx

muqtiyar ahmed said...

very imperative that the clothes are easy enough to be worn and provide highlighting with the most elegant and comfortable form of wedding clothing..............


http://www.sudarshansilk.com/men-sherwani-shopping-for-marriage/online-men-sherwani-for-marriage-shopping-india.html

chaithanya m said...

Upgrade consumers with inexperienced help every step of the way from design and to development………….
Website design Bangalore

chaithanya m said...

to satisfy client with a creative website to meet his business goals
Web design Bangalore

chaithanya m said...

empower the online presence of a particular product or service in more identifiable way
Web designing Bangalore

chaithanya m said...

Creating affordable and professional website designs
Website designing Bangalore

chaithanya m said...

Achieve an online presence that delivers the very best
Web application development

chaithanya m said...

ease of use of Realtycompass.
Real estate Bangalore

Ahsan Afsar said...

Online Jobs of Data Entry and Copy Pasting with Jobz Corner
JobzCorner.com

muqtiyar ahmed said...

Nitesh Chelsea Grand entrance gate........

Nitesh Chelsea

muqtiyar ahmed said...

Top mba in bangalore entrance examination.............

Top Mba in Bangalore

muqtiyar ahmed said...

Bring the immediacy in career prospects to students……
Top mba in Bangalore

muqtiyar ahmed said...

Provide decorate and fluting is kind you look very stylish and graceful….
Online sherwani buy

muqtiyar ahmed said...

From fabrics ranging from brocade to terry wool…
Online shopping for marriage sherwani

muqtiyar ahmed said...

Provide highlighting with the most elegant and comfortable form of wedding clothing…
Sherwani for men

muqtiyar ahmed said...

Organism a home landlord is one of the most looked-for dreams of waged professionals….
Prestige kew garden

muqtiyar ahmed said...

Give a solid foundation, hassle-free implementation, and provide maximal extensibility.
Prestige kew garden

Post a Comment