<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3043493300793589377</id><updated>2012-01-29T02:53:14.870+09:00</updated><category term='linux'/><category term='handlersocket'/><category term='5.5'/><category term='HDD'/><category term='mysql'/><category term='SSD'/><category term='high availability'/><category term='mha'/><category term='mysqlconf'/><category term='nosql'/><category term='storage'/><category term='conference'/><category term='replication'/><category term='partitioning'/><category term='sharding'/><category term='oracle'/><category term='InnoDB'/><title type='text'>Yoshinori Matsunobu's blog</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>24</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-7423732978444881746</id><published>2012-01-09T10:40:00.002+09:00</published><updated>2012-01-09T10:40:09.602+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='mha'/><title type='text'>MHA for MySQL 0.53 released</title><content type='html'>&amp;nbsp;MHA for MySQL (Master High Availability Manager and tools for MySQL) version 0.53 has been released. Downloads are available &lt;a href="http://code.google.com/p/mysql-master-ha/downloads/list"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;Here are &lt;a href="http://code.google.com/p/mysql-master-ha/wiki/ReleaseNotes"&gt;release notes&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;New features:&lt;br /&gt;* Supporting RESET SLAVE ALL from MySQL 5.5.16&lt;br /&gt;* Supporting "skip_reset_slave" parameter to avoid running CHANGE MASTER TO on the promoted slave&lt;br /&gt;* Doing master's health checks optionally via MySQL CONNECT, in addition to SELECT&lt;br /&gt;* Supporting "mysql --binary-mode" from MySQL 5.6.3&lt;br /&gt;* Supporting ssh_host and ssh_port parameters&lt;br /&gt;* Supporting ssh_options parameter&lt;br /&gt;* Supporting --check_only for online master switch (dry-run)&lt;br /&gt;* When doing online master switch, MHA checks whether long queries are running on the new master. This is important to reduce workloads on the new master. Query time limit can be controlled via --running_seconds_limit.&lt;br /&gt;When executing SIGINT on online master switch, MHA tries to disconnect established connections via MHA.&lt;br /&gt;* Additionally checking replication filtering rules on online master switch&lt;br /&gt;&lt;br /&gt;Bug fixes:&lt;br /&gt;* MHA Manager looks for relay-log.info in wrong location when setting relay_log_info_file &lt;br /&gt;* Wrong option for master_ip_failover_script&lt;br /&gt;* Timeout settings for SSH connection health check does not always work&lt;br /&gt;* Modifying a rpm spec file to create valid rpm package for 64bit RHEL6&lt;br /&gt;* Forcing more strict ssh checking. Originally MHA checks master's reachability by just connecting via SSH and exiting with return code 0. This in some cases does not work especially if SSH works but data files are not accessible. In this fix, MHA checks master's ssh reachability by executing save_binary_logs command (dry run). MHA Client also needs to be updated to 0.53.&lt;br /&gt;* Zombie process remains on master ping timeout&lt;br /&gt;* Do not execute SET GLOBAL read_only=(0|1) if not needed&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;How to upgrade&lt;/h3&gt;&lt;br /&gt;&amp;nbsp;Upgrading MHA from older version to 0.53 is easy. Simply stop masterha_manager program, replace all MHA Node and MHA Manager packages to 0.53, and restart masterha_manager. You do not need to stop mysqld or MySQL Replication.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;MHA for MySQL, project Info&lt;/h3&gt;&lt;br /&gt; Project top page: &lt;a href="http://code.google.com/p/mysql-master-ha/"&gt;http://code.google.com/p/mysql-master-ha/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; Documentation: &lt;a href="http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6"&gt;http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; Source tarball, deb and rpm packages (stable release): &lt;a href="http://code.google.com/p/mysql-master-ha/downloads/list"&gt;http://code.google.com/p/mysql-master-ha/downloads/list&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; The latest source repository (development tree, url changed): &lt;a href="https://github.com/yoshinorim/mha4mysql-manager"&gt;https://github.com/yoshinorim/mha4mysql-manager&lt;/a&gt; (Manager source) and &lt;a href="https://github.com/yoshinorim/mha4mysql-node"&gt;https://github.com/yoshinorim/mha4mysql-node&lt;/a&gt; (Per-MySQL server source)&lt;br /&gt;&lt;br /&gt; Commercial support for MHA is available from &lt;a href="http://www.skysql.com/"&gt;SkySQL&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-7423732978444881746?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/7423732978444881746/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2012/01/mha-for-mysql-053-released.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/7423732978444881746'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/7423732978444881746'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2012/01/mha-for-mysql-053-released.html' title='MHA for MySQL 0.53 released'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-1433769035266948395</id><published>2011-10-25T03:35:00.000+09:00</published><updated>2011-10-25T03:35:34.954+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><title type='text'>Making slave pre-fetching work better with SSD</title><content type='html'>In the recent few weeks I have spent some time for creating yet another slave prefetching tool named "Replication Booster (for MySQL)", written in C/C++ and using &lt;a href="http://www.oscon.com/oscon2011/public/schedule/detail/18785"&gt;Binlog API&lt;/a&gt;. Now I'm happy to say that I have released &lt;a href="https://github.com/yoshinorim/replication-booster-for-mysql"&gt;an initial version at GitHub repository&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The objective of Replication Booster is same as &lt;a href="http://www.maatkit.org/doc/mk-slave-prefetch.html"&gt;mk-slave-prefetch&lt;/a&gt;: avoiding or reducing replication delay under disk i/o bound workloads. This is done by prefetching relay logs events, converting to SELECT, and executing SELECT before SQL Thread executes the events. Then SQL thread can be much faster because target blocks are already cached.&lt;br /&gt;&lt;br /&gt;On my benchmarking environment Replication Booster works pretty well.&lt;br /&gt;&lt;br /&gt;On HDD bound benchmarks(update by pk), SQL thread's qps was:&lt;br /&gt;- Normal Slave (without prefetch):      400 update/s&lt;br /&gt;- With prefetch:                      1,500 update/s&lt;br /&gt;&lt;br /&gt;On SSD (SAS SSD) bound benchmarks(update by pk), SQL thread's qps was:&lt;br /&gt;- Normal Slave:  1,780 update/s&lt;br /&gt;- With prefetch: 5,400 update/s&lt;br /&gt;&lt;br /&gt;It is great that slave could handle many more updates per second without replication delay on disk i/o bound benchmarks. It works very well on SSD, too. The below is a graph that shows Seconds_Behind_Master is gradually decreasing by using Replication Booster.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-mQ6mjg2rgyI/TqLl8OA7IkI/AAAAAAAAAHU/WMYSkqgCsN4/s1600/repl_delay.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 353px;" src="http://4.bp.blogspot.com/-mQ6mjg2rgyI/TqLl8OA7IkI/AAAAAAAAAHU/WMYSkqgCsN4/s600/repl_delay.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5666344103902388802" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In this benchmark I executed 4,000 updates per second on master. On the slave server, by default slave delayed continuously because the slave could handle only 1,779 updates per second. When starting using Replication Booster on the slave, the slave could execute 5,418 updates per second. This was higher than the master's qps so Seconds_Behind_Master gradually decreased. After the slave caught up with the master, the slave could execute as same volume of updates as the master (4,000 update/s), so no replication delay happened. This means on this environment we can raise maximum update traffics many more (1,780 update/s -&gt; 4,000-5,400 update/s) without investing for new H/W.&lt;br /&gt;&lt;br /&gt;I also tested on some of our production slaves (not used for services) and it showed good results, too. I could get 30-300% improvements, depending on cache hit ratio. If data was fully cached, of course I didn't get any benefit, but it didn't cause negative impacts either.&lt;br /&gt;&lt;br /&gt;I hope this tool is interesting to you.&lt;br /&gt;&lt;br /&gt;In this blog post, I'll explain backgrounds for developing this tool and basic design. I believe many more optimizations can be done in future. Your feedbacks are welcome.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Good concurrency, bad single threaded performance&lt;/h3&gt;&lt;br /&gt;I mentioned at Percona Live London that using SSD on slaves is a good practice to reduce replication delay, and SATA/SAS SSD is practical enough because unit price is much cheaper than PCI-E SSD, and SATA/SAS SSD shows not bad concurrency with many drives when using recent RAID controller(most applications actually do not need 30,000-50,000 read iops, even though running many MySQL instances on the same machine). It is certainly an advantage that many SATA/SAS drives (6-10) can be installed on 1U box.&lt;br /&gt;&lt;br /&gt;The biggest thing I'm concerned about using SATA/SAS SSD is single thread read iops. You can get only 2,000 read iops from SATA/SAS SSD with RAID controller. If you do not use RAID controller, it is not impossible to get 3,000+ read iops, but this is still much lower than using PCI-Express SSD. You can get 10,000 signle thread read iops from PCI-Express SSD.&lt;br /&gt;&lt;br /&gt;When using SATA/SAS SSD, it is easy to predict that slave delays much earlier than using PCI-E SSD. Especially if running multiple MySQL instances per single server, innodb_buffer_pool_size has to be small (i.e. 4GB-12GB), so lots of disk reads will happen. By using 6-10 SATA/SAS drives, maximum throughput can be competitive enough against PCI-Express SSD, but single thread read iops is not improved. This is an issue.&lt;br /&gt;&lt;br /&gt;"Slave prefetching" is a well known, great approach to make SQL Thread faster.&lt;br /&gt;&lt;h3&gt;What is slave prefetching?&lt;/h3&gt;&lt;br /&gt;The concept of "slave prefetching" is (I think) well known, but I briefly describe here in case you don't know..&lt;br /&gt;&lt;br /&gt;SQL Thread is single threaded. When SQL thread has to do lots of disk i/o by itself, replication is easily delayed. In almost all cases of slave lagging, I/O thread has received all binary log events (and saved as relay logs), but SQL thread delays execution due to massive random disk i/o. So there are many relay log events from SQL thread's current position (Relay_Log_Pos) to the end of relay log (EOF of relay logs).&lt;br /&gt;&lt;br /&gt;Random disk reads happen when target blocks(records/indexes) are not cached. If they are cached, random reads won't happen. If you can cache all entries before SQL Thread executes, SQL thread does not have to do random disk reads. Then SQL thread can be much faster. &lt;br /&gt;&lt;br /&gt;How can you do that? Read relay logs before SQL Thread executes, covert DML statements (especially UPDATE) to SELECT, then execute SELECT on the slave in parallel.&lt;br /&gt;&lt;br /&gt;I believe this concept was introduced to MySQL community by Paul Tackfield at YouTube 4-5 years ago. mk-slave-prefetch is an open source implementation.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Desire for C/C++ based, raw relay log event hanlding tool&lt;/h3&gt;&lt;br /&gt;At first I tested mk-slave-prefetch on my benchmarks. But as far as I tested, unfortunately it didn't work as I expected. I think the main reasons are as below:&lt;br /&gt;&lt;br /&gt;* mk-slave-prefetch uses mysqlbinlog to parse relay logs. But mysqlbinlog is not as flexible and fast as reading raw relay log events. For example, mysqlbinlog output events have to go through the file to the main prefetching program. mysqlbinlog is an external command line tool, so the main prefetching program has to fork a new process to run mysqlbinlog, which opens and closes relay logs every time.&lt;br /&gt;&lt;br /&gt;* mk-slave-prefetch is written in Perl. In general, a prefetching tool has to be fast enough to read, convert and execute SELECT statements before SQL thread executes. The tool has to be multi-threaded. The tool probably has to run on the same machine as MySQL slave, in order to minimize network overheads. The resource consumption (CPU and memory) should be small enough so that it doesn't hurt MySQL server performance.&lt;br /&gt; I don't believe Perl is a good programming language for developing such a tool.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I believe C/C++ is the best for programming language for this purpose. And I believe handling raw relay log events is much more efficient than using mysqlbinlog.&lt;br /&gt;&lt;br /&gt;Based on the above reasons, I decided to develop a new slave prefeching tool by myself. I had some experiences for parsing binary/relay logs when developing MHA, so at first I planned to create a simple relay log parser program. But immediately I changed my mind, and tried &lt;a href="https://code.launchpad.net/mysql-replication-listener"&gt;mysql-replication-listener (Binlog API)&lt;/a&gt;. Binlog API is a newly released utility tool from Oracle MySQL team. Binlog API has a "file driver" interface, which enables to parse binary log or relay log file and handle events one by one. By using Binlog API, handling raw binlog events becomes much easier. For example, it's easy to parse binlog events, get updated entries, store to external software such as Lucene/Hadoop, etc.&lt;br /&gt;&lt;br /&gt;Oracle says Binlog API is pre-alpha. But as far as I have tested for slave prefetching purpose, it works very well. It's fast enough, and I didn't encounter any crashing or memory leak issues. So I decided to develop a new slave prefetching tool using Binlog API.&lt;br /&gt;&lt;h3&gt;Introduction to Replication Booster for MySQL&lt;/h3&gt;&lt;br /&gt; I named the new slave prefetching tool as "Replication Booster". Keywords "slave" and "prefetch" were already used by mk-slave-prefetch, so I used different words.&lt;br /&gt;&lt;br /&gt; The below figure is a basic architecture of Replication Booster.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-qz8aKmd2bqM/TqManyAtewI/AAAAAAAAAHs/sAZ3anB_UtA/s1600/repl_booster.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 386px;" src="http://2.bp.blogspot.com/-qz8aKmd2bqM/TqManyAtewI/AAAAAAAAAHs/sAZ3anB_UtA/s600/repl_booster.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5666402026904189698" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Design notes&lt;/h4&gt;- Replication Booster is a separated tool (runs as a MySQL client). It works with normal MySQL 5.0/5.1/5.5/5.6. Starting/stopping Replication Booster is possible without doing anything on MySQL server side.&lt;br /&gt;&lt;br /&gt;- Replication Booster is written in C/C++, and using boost::regex for converting UPDATE/DELETE to SELECT. Binlog API also uses boost.&lt;br /&gt;&lt;br /&gt;- Using Binlog API to parse relay logs, not using mysqlbinlog&lt;br /&gt;&amp;nbsp;&amp;nbsp;- Using file driver, not tcp driver. file driver does not connect to MySQL server, and just reading relay log files. So even if file driver has bugs, impacts are rather limited (If it has memory leak, that's serious, but I haven't encountered yet).&lt;br /&gt;&lt;br /&gt;- Main thread parses relay log events, picking up query log events, passing to internal FIFO queues&lt;br /&gt;&amp;nbsp;&amp;nbsp;- Binlog API has an interface to get a binlog event header (event type, timestamp, server-id, etc) and an event body. So it is easy to pick up only query log events.&lt;br /&gt;&amp;nbsp;&amp;nbsp;- Parsing row based events is not supported yet. It should be worth implementing in the near future.&lt;br /&gt;&lt;br /&gt;- Multiple worker threads pop query events from queues, and convert query events to SELECT statements&lt;br /&gt;&lt;br /&gt;- A dedicated thread (monitoring thread) keeps track of current SQL Thread's position (Relay_Log_Pos)&lt;br /&gt;&lt;br /&gt;- Worker threads do not execute a SELECT statement if the query's position is behind current SQL Thread's position. This is because it's not needed (too late).&lt;br /&gt;&lt;br /&gt;- Main thread stops reading relay log events if the event's timestamp is N seconds (default 3) ahead of SQL Thread's timestamp&lt;br /&gt;&amp;nbsp;&amp;nbsp;- This is for cache efficiency. If reading too many events than needed, it causes negative impacts. In the worst case cache entries that SQL thread needs now are wiped out by newly selected blocks.&lt;br /&gt;&lt;br /&gt;- When slave is not delayed, Replication Booster should not cause negative impacts. It shouldn't use noticeable CPU/Disk/Memory resources. It shouldn't prevent MySQL server activities by holding locks, either. Of course, it shouldn't execute converted SELECT statements because they are not useful anymore. The last one is not easy to work on various kinds of environments (i.e. HDD/SSD/etc), but should be controllable by some external configuration parameters&lt;br /&gt;&lt;br /&gt;- Bugs on Replication Booster should not result in MySQL server outage.&lt;br /&gt;&lt;br /&gt;- Replication Booster works locally. It doesn't allow to connect to remote MySQL servers. This is for performance reasons. Executing tens of thousands of queries per second from this tool remotely will cause massive &lt;a href="http://yoshinorimatsunobu.blogspot.com/2011/10/testing-mysql-563-network-performance.html"&gt;fcntl() contentions&lt;/a&gt; and use high network resources (both bandwidth and CPU time). I don't like that.&lt;br /&gt;&lt;h4&gt;Configuration Parameters&lt;/h4&gt;--threads: Number of worker threads. Each worker thread converts binlog events and executes SELECT statements. Default is 10 (threads).&lt;br /&gt;&lt;br /&gt;--offset-events: Number of binlog events that main thread (relay log reader thread) skips initially when reading relay logs. This number should be high when you have faster storage devices such as SSD. Default is 500 (events).&lt;br /&gt;&lt;br /&gt;--seconds-prefetch: Main thread stops reading relay log events when the event's timestamp is --seconds-prefetch seconds ahead of current SQL thread's timestamp. After that the main thread starts reading relay logs from SQL threads's position again. If this value is too high, worker threads will execute many more SELECT statements than necessary. Default value is 3 (seconds).&lt;br /&gt;&lt;br /&gt;--millis-sleep: If --seconds-prefetch condition is met, main thread sleeps --millis-sleep milliseconds before starting reading relay log. Default is 10 milliseconds.&lt;br /&gt;&lt;br /&gt;- MySQL connection parameters: MySQL slave user, password, socket file or local ip/port&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;How to verify Replication Booster works on your environments&lt;/h4&gt;You may want to run Replication Booster where Seconds_Behind_Master is sometimes growing. If Replication Booster works as expected, you can get the following benefits.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;- Seconds_Behind_Master gets decreased, or growth rate of Seconds_Behind_Master has decreased&lt;br /&gt;&amp;nbsp;&amp;nbsp;- Update speed has improved (i.e. Com_update per second has increased) by this tool&lt;br /&gt;&lt;br /&gt;Replication Booster has some statistics variables, and prints these statistics when terminating the script (Ctrl+C) like below. If slave delays but "Executed SELECT queries" is almost zero, something is wrong.&lt;br /&gt;&lt;pre&gt;Running duration:    847.846 seconds&lt;br /&gt;Statistics:&lt;br /&gt; Parsed binlog events: 60851473&lt;br /&gt; Skipped binlog events by offset: 8542280&lt;br /&gt; Unrelated binlog events: 17444340&lt;br /&gt; Queries discarded in front: 17431937&lt;br /&gt; Queries pushed to workers: 17431572&lt;br /&gt; Queries popped by workers: 5851025&lt;br /&gt; Old queries popped by workers: 3076&lt;br /&gt; Queries discarded by workers: 0&lt;br /&gt; Queries converted to select: 5847949&lt;br /&gt; Executed SELECT queries: 5847949&lt;br /&gt; Error SELECT queries: 0&lt;br /&gt; Number of times to read relay log limit: 1344&lt;br /&gt; Number of times to reach end of relay log: 261838&lt;/pre&gt;&lt;br /&gt;I haven't spent so much time on this project yet (just started a few weeks ago). Current algorithm is simple. I believe many more optimizations can be done in future, but even so current benchmark numbers are pretty good. I hope we can use this tool on many places where we want to avoid replication delay but don't want to spend too much money for faster storage devices.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-1433769035266948395?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/1433769035266948395/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/10/making-slave-pre-fetching-work-better.html#comment-form' title='14 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1433769035266948395'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1433769035266948395'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/10/making-slave-pre-fetching-work-better.html' title='Making slave pre-fetching work better with SSD'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-mQ6mjg2rgyI/TqLl8OA7IkI/AAAAAAAAAHU/WMYSkqgCsN4/s72-c/repl_delay.png' height='72' width='72'/><thr:total>14</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-7802522842394691249</id><published>2011-10-05T04:37:00.000+09:00</published><updated>2011-10-05T04:37:59.121+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Testing MySQL 5.6.3 network performance improvements</title><content type='html'>I'm excited to see the new features in MySQL 5.6.3. Replication enhancements such as parallel SQL threads, crash safe slave and binlog checksum have been desired for years. I really appreciate that MySQL development team has released 5.6.3 in timely manner.&lt;br /&gt;&lt;br /&gt; In this blog entry, I'd like to pick up one of my most favorite performance enhancements in MySQL 5.6.3: "network performance improvements". This was reported by Mark Callaghan as &lt;a href="http://bugs.mysql.com/bug.php?id=54790"&gt;bug#54790&lt;/a&gt; over one year ago, and finally it's fixed (I assume it's mainly developed by Davi Arnaut at MySQL dev team) in 5.6.3. The root cause of this problem was that prior to 5.6.3, under high loads MySQL internally had to call fcntl() lots of times to change non-blocking sockets to blocking (with timeout and alarm) and vice versa. fcntl() is serialized on Linux, so it doesn't scale with many CPU cores. You may suffer from this performance issue when clients are not locally based (= in most cases) and query per second is quite high (hundreds of thousands or even more). In many real world environments qps is not so much high, but in the near future these numbers will not be unrealistic, with faster H/W components.&lt;br /&gt;&lt;br /&gt; I have tested some simple read-only benchmarks on 16-core and 24-core box, and results in 5.6.3 are great.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-XNc59cmmM10/Tos2RCYM-lI/AAAAAAAAAGc/YXX6zlJmzVU/s1600/5.5vs5.6_1.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 263px;" src="http://2.bp.blogspot.com/-XNc59cmmM10/Tos2RCYM-lI/AAAAAAAAAGc/YXX6zlJmzVU/s600/5.5vs5.6_1.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5659677023045220946" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-snxSHzdVc_Q/Tos2kioZO8I/AAAAAAAAAGk/N_9TQHGFx74/s1600/5.5vs5.6_2.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 263px;" src="http://1.bp.blogspot.com/-snxSHzdVc_Q/Tos2kioZO8I/AAAAAAAAAGk/N_9TQHGFx74/s600/5.5vs5.6_2.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5659677358120582082" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; These benchmarks are in-memory PK lookup from a remote client. Connections are persistent. The client server never became bottleneck. Apparently the maximum throughput of 5.6.3 is much higher than 5.5 (75% higher on 24 cores, 34% higher on 16 cores). It's interesting that 16 core vs 24 core performance difference is not so much in 5.5 (only 10%), but it's huge in 5.6 (45%). 45% improvement is close to theoretical maximum (16core to 24core: 50%), so for these kinds of benchmarks, 5.6 is really scalable at least up to 24 cores.&lt;br /&gt;&lt;br /&gt; Though many MySQL scalability issues could be mitigated by running multiple instances on a single server, you couldn't avoid this performance problem because calling fcntl() is serialized on Linux kernel side. So this performance fix is really important. With MySQL 5.6, I think we can take more scaling-up strategies rather than scaling-out.&lt;br /&gt;&lt;br /&gt; How about disk i/o bound benchmarks? On disk bound workloads, network performance improvements do not cause so much impacts of course (since qps is much lower), but as below, 5.6 showed good results compared to 5.5. These performance improvements mainly come from reducing InnoDB internal kernel mutex contentions in 5.6. I'd like to investigate more time to understand internal differences.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-SF8UAKx-dbI/Tos2obXAAuI/AAAAAAAAAGs/BOxbt4EothQ/s1600/5.5vs5.6_3.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 263px;" src="http://3.bp.blogspot.com/-SF8UAKx-dbI/Tos2obXAAuI/AAAAAAAAAGs/BOxbt4EothQ/s600/5.5vs5.6_3.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5659677424888054498" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-EnqePNtuDHg/TotIT_qdvxI/AAAAAAAAAG8/0qyvght243Y/s1600/5.5vs5.6_4.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 263px;" src="http://2.bp.blogspot.com/-EnqePNtuDHg/TotIT_qdvxI/AAAAAAAAAG8/0qyvght243Y/s600/5.5vs5.6_4.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5659696865065418514" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; These benchmarks were SATA SSD bound (8 drives, RAID1+0) benchmarks. InnoDB buffer pool size was 1GB and active data size was 2.5GB. 5.6 shows better results at higher concurrency.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-7802522842394691249?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/7802522842394691249/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/10/testing-mysql-563-network-performance.html#comment-form' title='18 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/7802522842394691249'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/7802522842394691249'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/10/testing-mysql-563-network-performance.html' title='Testing MySQL 5.6.3 network performance improvements'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-XNc59cmmM10/Tos2RCYM-lI/AAAAAAAAAGc/YXX6zlJmzVU/s72-c/5.5vs5.6_1.png' height='72' width='72'/><thr:total>18</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-370780031520599702</id><published>2011-09-16T19:54:00.000+09:00</published><updated>2011-09-16T19:54:29.896+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='mha'/><title type='text'>MHA for MySQL 0.52 released</title><content type='html'>&amp;nbsp;I'm happy to announce that MHA for MySQL (Master High Availability Manager and tools for MySQL) version 0.52 has been released. Downloads are available &lt;a href="http://code.google.com/p/mysql-master-ha/downloads/list"&gt;here&lt;/a&gt;.&lt;br /&gt;&amp;nbsp;This is a stable release. I have been using MHA on our(DeNA's) cross-datacenter environments. We had tens of failover opportunities and MHA worked successfully in all cases.&lt;/br&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;Here are &lt;a href="http://code.google.com/p/mysql-master-ha/wiki/ReleaseNotes"&gt;release notes&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;* multi-master configuration is supported&lt;br /&gt;&amp;nbsp;See &lt;a href="http://yoshinorimatsunobu.blogspot.com/2011/08/mysql-mha-support-for-multi-master.html"&gt;my previous blog entry&lt;/a&gt; for details.&lt;br /&gt;&lt;br /&gt;* Package name changed&lt;br /&gt;&amp;nbsp;To avoid potential trademark issues, I decided to change MHA package names (downloadable file names and GitHub repository name) from "MySQL-MasterHA-Manager" to "mha4mysql-manager", and from "MySQL-MasterHA-Node" to "mha4mysql-node". &lt;br /&gt;&lt;br /&gt;&amp;nbsp;- URLs of MHA development repositories on GitHub changed as below.&lt;br /&gt;&amp;nbsp;&amp;nbsp;Manager: &lt;a href="https://github.com/yoshinorim/mha4mysql-manager"&gt;https://github.com/yoshinorim/mha4mysql-manager&lt;/a&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;Node: &lt;a href="https://github.com/yoshinorim/mha4mysql-node"&gt;https://github.com/yoshinorim/mha4mysql-node&lt;/a&gt;&lt;br /&gt;&amp;nbsp;If you currently follow previous branches, please modify .git/config and point to new URLs to pull newer versions.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;- If you installed MHA node rpm packages (version 0.50) provided from our &lt;a href="http://code.google.com/p/mysql-master-ha/downloads/list"&gt;download site&lt;/a&gt;, when upgrading to newer versions, please explicitly uninstall the current rpm package (rpm -e MySQL-MasterHA-Node-0.50-...) and install newer version.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;All source codes, test cases, command line programs (i.e. masterha_manager), parameters etc are not changed. So product quality of MHA should not be affected by this change.&lt;br /&gt;&lt;br /&gt;* rpm and deb packages are now provided&lt;br /&gt;&amp;nbsp;Prior to 0.52, rpm package for MHA Manager was not provided, and deb package was not provided at all. Now both rpm and deb packages are provided for both Manager and Node(running on MySQL server).&lt;br /&gt;&amp;nbsp;See &lt;a href="http://code.google.com/p/mysql-master-ha/wiki/Installation"&gt;how to install&lt;/a&gt; page for details.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;MHA for MySQL, project Info&lt;/h3&gt;&lt;br /&gt; Project top page: &lt;a href="http://code.google.com/p/mysql-master-ha/"&gt;http://code.google.com/p/mysql-master-ha/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; Documentation: &lt;a href="http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6"&gt;http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; Source tarball, deb and rpm packages (stable release): &lt;a href="http://code.google.com/p/mysql-master-ha/downloads/list"&gt;http://code.google.com/p/mysql-master-ha/downloads/list&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; The latest source repository (development tree, url changed): &lt;a href="https://github.com/yoshinorim/mha4mysql-manager"&gt;https://github.com/yoshinorim/mha4mysql-manager&lt;/a&gt; (Manager source) and &lt;a href="https://github.com/yoshinorim/mha4mysql-node"&gt;https://github.com/yoshinorim/mha4mysql-node&lt;/a&gt; (Per-MySQL server source)&lt;br /&gt;&lt;br /&gt; Commercial support for MHA is available from &lt;a href="http://www.skysql.com/"&gt;SkySQL&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-370780031520599702?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/370780031520599702/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/09/mha-for-mysql-052-released.html#comment-form' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/370780031520599702'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/370780031520599702'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/09/mha-for-mysql-052-released.html' title='MHA for MySQL 0.52 released'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-6057628564107644426</id><published>2011-09-08T17:52:00.001+09:00</published><updated>2011-09-08T17:53:45.471+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conference'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Speaking at Oracle Open World and Percona Live London</title><content type='html'>In October, I'll speak at two big conferences - &lt;a href="http://www.oracle.com/openworld/index.html"&gt;Oracle Open World&lt;/a&gt; and &lt;a href="http://www.percona.com/live/london-2011/"&gt;Percona Live London&lt;/a&gt;. I appreciate Oracle and Percona to give me opportunities to talk. It's always fun for me to speak at overseas conferences and meet friends.&lt;br /&gt;&lt;p/&gt;&lt;br /&gt; At Oracle Open World, I have one session: "Scaling and Monitoring MySQL for Rapidly Growing Social Gaming" (here is a &lt;a href="http://technocation.org/files/doc/2011_OOW_MySQL_Content.html"&gt;timetable&lt;/a&gt;).&lt;br /&gt;&lt;p/&gt;&lt;br /&gt; At Percona Live London, I have one 3-hour tutorial and one 30-minute session .. "&lt;a href="http://www.percona.com/live/london-2011/tutorial/linux-and-hw-optimizations-for-mysql/"&gt;Linux and H/W optimizations for MySQL&lt;/a&gt;" and "&lt;a href="http://www.percona.com/live/london-2011/session/mha-introducing-automated-mysql-master-failover-solution/"&gt;MHA: Introducing automated MySQL master failover solution&lt;/a&gt;". &lt;br /&gt;&lt;p/&gt;&lt;br /&gt; I'll make sure that all of these talks will be practical enough. Basic theories and generic benchmarks will be covered since it's important to understand, but I'll more focus on how we use in production (on 1000+ MySQL servers).&lt;br /&gt;&lt;p/&gt;&lt;br /&gt; BTW, this is the first time for me to visit London and I'm looking forward to watching Premiere League, though I've heard that it's very difficult to get tickets.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-6057628564107644426?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/6057628564107644426/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/09/speaking-at-oracle-open-world-and.html#comment-form' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/6057628564107644426'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/6057628564107644426'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/09/speaking-at-oracle-open-world-and.html' title='Speaking at Oracle Open World and Percona Live London'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-1532024166661306497</id><published>2011-08-29T23:31:00.010+09:00</published><updated>2011-08-30T09:31:10.232+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='high availability'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>MySQL-MHA: Support for multi-master configuration</title><content type='html'>&amp;nbsp;After publishing &lt;a href="http://code.google.com/p/mysql-master-ha/"&gt;MySQL MHA&lt;/a&gt; in the end of July, I received a few requests for supporting multi-master configuration. So I spent time for extending MHA for supporting multi-master, and now it's published at &lt;a href="https://github.com/yoshinorim/MySQL-MasterHA-Manager/tree/multimaster"&gt;a separated GitHub branch&lt;/a&gt; (a new development tree). I'll take some time for people to evaluate the new feature, and after verifying it's stable enough, I'll merge it to the main branch and will release as a new version (&lt;a href="http://code.google.com/p/mysql-master-ha/downloads/list"&gt;tarball release&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;The below is procedures to install MHA Manager multi-master tree.&lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;$ git clone git://github.com/yoshinorim/MySQL-MasterHA-Manager.git&lt;br /&gt;$ cd MySQL-MasterHA-Manager&lt;br /&gt;$ git checkout -b multimaster origin/multimaster&lt;br /&gt;$ perl Makefile.PL&lt;br /&gt;$ make&lt;br /&gt;$ sudo make install&lt;/span&gt;&lt;/pre&gt; If you have not installed Module::Install Perl module and have difficulties to install it, you can simply copy the module from MHA Manager tarball package like below.&lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;$ cd MySQL-MasterHA-Manager(from github)&lt;br /&gt;$ cp -rp /path/to/MHA-Manager-Tarball/inc ./&lt;br /&gt;$ perl Makefile.PL&lt;br /&gt;...&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;Here are some notes to make MHA work with multi-master.&lt;br/&gt;&lt;br /&gt;&lt;br/&gt;&lt;br /&gt;* Only one primary master (writable) is allowed. MySQL global variable "read-only=1" must be set on other MySQL masters.&lt;br/&gt;&lt;br /&gt;* By default, all managed servers should be in two-tier replication channel. If you want to manage three or more tier replication channel, "multi-tier-slave=1" parameter must be set in the MHA configuration file.&lt;br/&gt;&lt;br /&gt;&lt;br/&gt;&lt;br /&gt;&lt;br /&gt;Example replication settings are as follows.&lt;br/&gt;&lt;br /&gt;&lt;br/&gt;&lt;br /&gt;&lt;br /&gt;1) Simple two-tier multi-master configuration&lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;  M1(host1,RW) &lt;----&gt; M2(host2,read-only)&lt;br /&gt;       |&lt;br /&gt; +-----+--------+&lt;br /&gt;S1(host3,R)    S2(host4,R)&lt;br /&gt;&lt;br /&gt;=&gt; After failover&lt;br /&gt;&lt;br /&gt;  M2(host2,RW)&lt;br /&gt;       |&lt;br /&gt; +-----+--------+&lt;br /&gt;S1(host3,R)    S2(host4,R)&lt;br /&gt;&lt;br /&gt;Configuration example:&lt;br /&gt;&lt;br /&gt;[server1]&lt;br /&gt;hostname=host1&lt;br /&gt;candidate_master=1&lt;br /&gt;&lt;br /&gt;[server2]&lt;br /&gt;hostname=host2&lt;br /&gt;candidate_master=1&lt;br /&gt;&lt;br /&gt;[server3]&lt;br /&gt;hostname=host3&lt;br /&gt;&lt;br /&gt;[server4]&lt;br /&gt;hostname=host4&lt;/span&gt;&lt;/pre&gt;On host2, SET GLOBAL read_only=1 must be set.&lt;br/&gt;&lt;br /&gt;&lt;br/&gt;&lt;br /&gt;&lt;br/&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;2) Three tier multi-master configuration&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;  M1(host1,RW) &lt;-----------------&gt; M2(host2,read-only)&lt;br /&gt;       |                                |&lt;br /&gt; +-----+--------+                       +&lt;br /&gt;S1(host3,R)    S2(host4,R)             S3(host5,R)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;=&gt; After failover&lt;br /&gt;&lt;br /&gt;          M2(host2,RW)&lt;br /&gt;                |&lt;br /&gt; +--------------+--------------------------+&lt;br /&gt;S1(host3,R)    S2(host4,R)             S3(host5,R)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Configuration example:&lt;br /&gt;[server default]&lt;br /&gt;multi_tier_slave=1&lt;br /&gt;&lt;br /&gt;[server1]&lt;br /&gt;hostname=host1&lt;br /&gt;candidate_master=1&lt;br /&gt;&lt;br /&gt;[server2]&lt;br /&gt;hostname=host2&lt;br /&gt;candidate_master=1&lt;br /&gt;&lt;br /&gt;[server3]&lt;br /&gt;hostname=host3&lt;br /&gt;&lt;br /&gt;[server4]&lt;br /&gt;hostname=host4&lt;br /&gt;&lt;br /&gt;[server5]&lt;br /&gt;hostname=host5&lt;/span&gt;&lt;/pre&gt;In this case, host5 is a third-tier slave, so MHA does not manage host5(MHA does not execute CHANGE MASTER on host5 when the primary master host1 fails). When current master host1 is down, host2 will be new master, so host5 can keep replication from host2 without doing anything.&lt;br/&gt;&lt;br /&gt;&lt;br/&gt;&lt;br /&gt;If you want to try MHA, &lt;a href="http://code.google.com/p/mysql-master-ha/wiki/Tutorial"&gt;tutorials&lt;/a&gt; will be good start.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-1532024166661306497?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/1532024166661306497/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/08/mysql-mha-support-for-multi-master.html#comment-form' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1532024166661306497'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1532024166661306497'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/08/mysql-mha-support-for-multi-master.html' title='MySQL-MHA: Support for multi-master configuration'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-1034954243287388684</id><published>2011-07-23T23:49:00.001+09:00</published><updated>2011-07-23T23:49:43.288+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='high availability'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Announcing MySQL-MHA: "MySQL Master High Availability manager and tools"</title><content type='html'>&lt;span style="font-weight:bold;"&gt;I have published "&lt;a href="http://code.google.com/p/mysql-master-ha/"&gt;MySQL MHA&lt;/a&gt;" that fully automates MySQL master failover. You can also get commercial support from &lt;a href="http://www.skysql.com/en/index"&gt;SkySQL&lt;/a&gt;. Let's try MHA today!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; Today I'm happy to announce that I have released &lt;a href="http://code.google.com/p/mysql-master-ha/"&gt;MySQL-MHA: MySQL Master High Availability manager and tools&lt;/a&gt; as an open source software (GPL v2 license). The below is a part of &lt;a href="http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6"&gt;documentation of MHA&lt;/a&gt;. I'm glad if you are interested in MHA.&lt;br /&gt;&lt;br /&gt; A primary objective of MHA is automating master failover and slave promotion within short (usually 10-30 seconds) downtime, without suffering from replication consistency problems, without spending money for lots of new servers, without performance penalty, without complexity (easy-to-install), and without changing existing deployments. MHA also provides a way for scheduled online master switch: changing currently running master to a new master safely, within a few seconds (normally 0.5-2 seconds) of downtime (blocking writes only).&lt;br /&gt;&lt;br /&gt; Difficulties of master failover is one of the biggest issues in MySQL. Many people have been aware of this issue, but in most cases there were not practical solutions. I created MHA to make our (&lt;a href="http://dena.jp/en/"&gt;DeNA&lt;/a&gt;'s) existing 100+ 5.0/5.1/5.5 and future MySQL applications highly available. I think many outside people can also use MHA pretty easily.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Project Info&lt;/h3&gt;&lt;br /&gt; Project top page: &lt;a href="http://code.google.com/p/mysql-master-ha/"&gt;http://code.google.com/p/mysql-master-ha/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; Documentation: &lt;a href="http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6"&gt;http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; Source tarball and rpm package (stable release): &lt;a href="http://code.google.com/p/mysql-master-ha/downloads/list"&gt;http://code.google.com/p/mysql-master-ha/downloads/list&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; The latest source repository (dev release): &lt;a href="https://github.com/yoshinorim/MySQL-MasterHA-Manager"&gt;https://github.com/yoshinorim/MySQL-MasterHA-Manager&lt;/a&gt; (Manager source) and &lt;a href="https://github.com/yoshinorim/MySQL-MasterHA-Node"&gt;https://github.com/yoshinorim/MySQL-MasterHA-Node&lt;/a&gt; (Per-MySQL server source)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Overview&lt;/h3&gt;&lt;br /&gt; Master Failover is not as trivial as you might think. Suppose you run single master and multiple slaves, which is the most common MySQL deployments. If the master crashes, you need to pick one of the latest slaves, promote it to the new master, and let other slaves start replication from the new master. This is actually not trivial. Even though you could identify the latest slave, other slaves might have not received all binary log events. If you let other slaves connect to the new master and start replication, these slaves lose some transactions. This will cause consistency problems. To avoid consistency problems, you need to identify which binlog events are not sent to each slave, and need to apply lost events to each slave before starting replication. This is very complex approach and manually doing recovery correctly is very difficult. This is illustrated in the &lt;a href="http://www.slideshare.net/matsunobu/automated-master-failover"&gt;slides&lt;/a&gt; (especially in &lt;a href="http://www.slideshare.net/matsunobu/automated-master-failover/10"&gt;p.10&lt;/a&gt; as below) that I presented at the MySQL Conference and Expo 2011.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-vz3yQMLk5Eo/TilhRCi5ZqI/AAAAAAAAAFo/y08RHXsituc/s1600/mha-problem.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 372px;" src="http://1.bp.blogspot.com/-vz3yQMLk5Eo/TilhRCi5ZqI/AAAAAAAAAFo/y08RHXsituc/s600/mha-problem.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5632139754372425378" /&gt;&lt;/a&gt;&lt;br /&gt;Fig: Master Failover: What makes it difficult?&lt;br /&gt;&lt;br /&gt; Currently most of MySQL Replication users are forced to do manual failover on master crashes. But it is not uncommon to result in more than one hour downtime to complete failover. Each slave is not likely to have received the same relay log events, so you may need to fix consistency problems later. Though master crash does not happen so often, it is really serious once it happens. &lt;br /&gt;&lt;br /&gt; MHA is invented to fix these issues. MHA provides the following functionality, and can be useful in many deployments where requirements such as high availability, data integrity, almost non-stop master maintenance are desired.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Automated master monitoring and failover&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; MHA has a functionality to monitor MySQL master in an existing replication environment, detecting master failure, and doing master failover automatically. Even though some of slaves have not received the latest relay log events, MHA automatically identifies differential relay log events from the latest slave, and applies differential events to other slaves. So all slaves can be consistent. MHA normally can do failover in seconds (9-12 seconds to detect master failure, optionally 7-10 seconds to power off the master machine to avoid split brain, a few seconds for applying differential relay logs to the new master, so total downtime is normally 10-30 seconds). In addition, you can define a specific slave as a candidate master (setting priorities) in a configuration file. Since MHA fixes consistencies between slaves, you can promote any slave to a new master and consistency problems (which might cause sudden replication failure) will not happen.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Interactive (manual) Master Failover&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; You can also use MHA for just failover, not for monitoring master. You can use MHA for master failover interactively.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Non-interactive master failover&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; Non-interactive master failover (not monitoring master, but doing failover automatically) is also supported. This feature is useful especially when you have already used a software that monitors MySQL master. For example, you can use &lt;a href="http://www.linux-ha.org/wiki/Pacemaker"&gt;Pacemaker(Heartbeat)&lt;/a&gt; for detecting master failure and virtual ip address takeover, and use MHA for master failover and slave promotion.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Online switching master to a different host&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; In many cases, it is necessary to migrate an existing master to a different machine (i.e. the current master has H/W problems on RAID controller or RAM, you want to replace with faster machine, etc). This is not a master crash, but scheduled master maintenance is needed to do that. Scheduled master maintenance causes downtime (at least you can not write master) so should be done as quickly as possible. On the other hand, you should block/kill current running sessions very carefully because consistency problems between different masters might happen (i.e "updating master1, updating master 2, committing master1, getting error on committing master 2" will result in data inconsistency). Both fast master switch and graceful blocking writes are required.&lt;br /&gt; MHA provides a way to do that. You can switch master gracefully within 0.5-2 seconds of writer block. In many cases 0.5-2 seconds of writer downtime is acceptable and you can switch master even without allocating scheduled maintenance window. This means you can take actions such as upgrading to higher versions, faster machine, etc much more easily.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Architecture&lt;/h3&gt;&lt;br /&gt; When a master crashes, MHA recovers rest slaves as below.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-AIInutsrkAk/TiF5kAcIzQI/AAAAAAAAAFg/jW1SnzM8T_E/s1600/mha_recovery_procedure.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 399px;" src="http://4.bp.blogspot.com/-AIInutsrkAk/TiF5kAcIzQI/AAAAAAAAAFg/jW1SnzM8T_E/s600/mha_recovery_procedure.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5629914668690099458" /&gt;&lt;/a&gt;&lt;br /&gt;Fig: Steps for recovery&lt;br /&gt;&lt;br /&gt; Basic algorithms are described in the &lt;a href="http://www.slideshare.net/matsunobu/automated-master-failover"&gt;slides&lt;/a&gt; presented at the MySQL Conference and Expo 2011, especially from page  &lt;a href="http://www.slideshare.net/matsunobu/automated-master-failover/13"&gt;no.13&lt;/a&gt; to &lt;a href="http://www.slideshare.net/matsunobu/automated-master-failover/34"&gt;no.34&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt; In relay log files on slaves, master's binary log positions are written at "end_log_pos" sections (&lt;a href="http://www.slideshare.net/matsunobu/automated-master-failover/18"&gt;example&lt;/a&gt;). By comparing the latest end_log_pos between slaves, we can identify which relay log events are not sent to each slave. MHA internally recovers slaves (fixes consistency issues) by using this mechanism. In addition to basic algorithms covered in the &lt;a href="http://www.slideshare.net/matsunobu/automated-master-failover"&gt;slides&lt;/a&gt; at the MySQL Conf 2011, MHA does some optimizations and enhancements, such as generating differential relay logs very quickly (indenpendent from relay log file size), making recovery work with row based formats, etc.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;MHA Components&lt;/h4&gt;&lt;br /&gt; MHA consists of MHA Manager and MHA Node as below. &lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-wCKF5uopGNI/TiF4dA2df6I/AAAAAAAAAFY/AHScjL84JF4/s1600/components_of_mha.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 397px;" src="http://1.bp.blogspot.com/-wCKF5uopGNI/TiF4dA2df6I/AAAAAAAAAFY/AHScjL84JF4/s600/components_of_mha.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5629913449029795746" /&gt;&lt;/a&gt;&lt;br /&gt;Fig: MHA components&lt;br /&gt;&lt;br /&gt; MHA Manager has manager programs such as monitoring MySQL master, controlling master failover, etc.&lt;br /&gt;&lt;br /&gt; MHA Node has failover helper scripts such as parsing MySQL binary/relay logs, identifying relay log position from which relay logs should be applied to other slaves, applying events to the target slave, etc. MHA Node runs on each MySQL server.&lt;br /&gt;&lt;br /&gt; When MHA Manager does failover, MHA manager connects MHA Node via SSH and executes MHA Node commands when needed.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Advantages&lt;/h3&gt;&lt;span style="font-weight:bold;"&gt;* Master failover and slave promotion can be done very quickly&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; MHA normally can do failover in seconds (9-12 seconds to detect master failure, optionally 7-10 seconds to power off the master machine to avoid split brain, a few seconds or more for applying differential relay logs to the new master, so total downtime is normally 10-30 seconds), unless all slaves delay replication seriously. After recovering the new master, MHA recovers the rest slaves in parallel. Even though you have tens of slaves, it does not affect master recovery time, and you can recover slaves very quickly.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Master crash does not result in data inconsistency&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; When the current master crashes, MHA automatically identifies differential relay log events between slaves, and applies to each slave. So finally all slaves can be in sync, as long as all slave servers are alive.&lt;br /&gt; By using together with Semi-Synchronous Replication, (almost) no data loss can also be guaranteed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* No need to modify current MySQL settings (MHA works with regular MySQL (5.0 or later))&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; One of the most important design principles of MHA is to make MHA easy to use as long as possible. MHA works with existing traditional MySQL 5.0+ master-slaves replication environments. Though many other HA solutions require to change MySQL deployment settings, MHA does not force such tasks for DBAs. MHA works with the most common two-tier single master and multiple slaves environments. MHA works with both asynchronous and semi-synchronous MySQL replication. Installing/Uninstalling/Starting/Stopping/Upgrading/Downgrading MHA can be done without changing (including starting/stopping) MySQL replication. When you need to upgrade MHA to newer versions, you don't need to stop MySQL. Just replace with newer MHA versions and restart MHA Manager is fine.&lt;br /&gt;&lt;br /&gt; MHA works with normal MySQL versions starting from MySQL 5.0. Some HA solutions require special MySQL versions (i.e. MySQL Cluster, MySQL with Global Transaction ID, etc), but you may not like to migrate applications just for master HA. In many cases people have already deployed many legacy MySQL applications and they don't want to spend too much time to migrate to different storage engines or newer bleeding edge distributions just for master HA. MHA works with normal MySQL versions including 5.0/5.1/5.5 so you don't need to migrate.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* No need to increase lots of servers&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; MHA consists of MHA Manager and MHA Node. MHA Node runs on the MySQL server when failover/recovery happens so it doesn't require additional server. MHA Manager normally runs on a dedicated server so you need to add one (or two for HA) server(s), but MHA Manager can monitor lots of (even 100+) masters from single server, so the total number of servers is not increased so much. Note that it is even possible to run MHA Manager on one of slave servers. In this case total number of servers is not increased at all.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* No performance penalty&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; MHA works with regular asynchronous or semi-synchronous MySQL replication. When monitoring master server, MHA just sends ping packets to master every N seconds (default 3) and it does not send heavy queries. You can expect as fast performance as regular MySQL replication.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Works with any storage engine&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; MHA works with any storage engines as long as MySQL replication works, not limited to InnoDB (crash-safe, transactional storage engine). Even though you use legacy MyISAM environments that are not easy to migrate, you can use MHA.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Production case study&lt;/h3&gt;&lt;br /&gt; I'm using MHA on our (DeNA's) production environments. We manage more than 100 MySQL applications (master/slave pairs) from a few old (32bit, 3GB RAM) manager servers (one manager per datacenter), and so far working very well. MHA does not spend resources at monitoring stage so managing hundreds of MySQL applications from single manager running on an old machine is totally possible (CPU util is 0-3% in total).&lt;br /&gt; We have been frequently using MHA for online master switch. Some popular social games grow more rapidly than we expect. In many cases scaling out (sharding) is chosen, but scaling up (increasing RAM, replacing HDD with SSD, etc) is sometimes better than scaling out. We switch master from a slower machine to a faster machine (and vice versa) by using MHA (MHA has a separated online master switch command), and we have been able to switch more than 10 masters with only 0.5-1 second of downtime (not being able to connect to master) each. 0.5-1 second downtime is acceptable in our cases. Social game users (especially paying users) tend to be very strict on performance and availability, but we haven't received any inquiries/complaints when switching masters with MHA.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;SkySQL provides commercial support for MHA&lt;/h3&gt;&lt;br /&gt; After I presented about MHA at the MySQL Conference in April, many people told me that they were interested in trying MHA. I'm happy if many people use my software and satisfied with it. On the other hand, I'm a full time employee at DeNA, and DeNA does not provide software support/consulting business so I can't provide 24x7 support/consulting by myself. What if you want such services? Hopefully &lt;a href="http://www.skysql.com/en/index"&gt;SkySQL&lt;/a&gt; has decided to offer that. You can get 24x7 support of MHA (and of course, MySQL) from SkySQL! I have many ex-MySQL friends at SkySQL and they have excellent expertise to provide MySQL related support services. If you are interested, go to SkySQL website and talk with their sales representatives. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt; I'm attending OSCON and introduce MHA at &lt;a href="http://www.oscon.com/oscon2011/public/schedule/detail/18774"&gt;my session&lt;/a&gt;, so if you are interested and staying at OSCON, I'd like to talk with you.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-1034954243287388684?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/1034954243287388684/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/07/announcing-mysql-mha-mysql-master-high.html#comment-form' title='32 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1034954243287388684'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1034954243287388684'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/07/announcing-mysql-mha-mysql-master-high.html' title='Announcing MySQL-MHA: &quot;MySQL Master High Availability manager and tools&quot;'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-vz3yQMLk5Eo/TilhRCi5ZqI/AAAAAAAAAFo/y08RHXsituc/s72-c/mha-problem.png' height='72' width='72'/><thr:total>32</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-8801289030543738999</id><published>2011-06-10T23:27:00.001+09:00</published><updated>2011-06-11T14:36:29.201+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='oracle'/><title type='text'>New Oracle ACE Director</title><content type='html'>A few days ago Oracle offered me a position of &lt;a href="http://www.oracle.com/technetwork/community/oracle-ace/index.html"&gt;Oracle ACE Director&lt;/a&gt;, and I accepted it. Thanks Oracle. I'm glad to see that Oracle has evaluated my activities and we can have a good relationships each other. Oracle has been great for MySQL so far and I'd be happy to keep contributions.&lt;br /&gt;&lt;br /&gt;Since I live in Tokyo and my native language is Japanese, I usually do community activities in Japanese (speaking at conferences in Tokyo, publishing books written in Japanese, etc). I'd like to do more in English, and my employer DeNA thankfully supports it. Right now it's confirmed that I'll speak at OSCON.Data in July, and at Oracle Open World in October. Personally I haven't traveled to Europe for years so it's great if I have an opportunity to speak at MySQL/Linux conferences in Europe.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-8801289030543738999?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/8801289030543738999/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/06/new-oracle-ace-director.html#comment-form' title='20 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/8801289030543738999'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/8801289030543738999'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/06/new-oracle-ace-director.html' title='New Oracle ACE Director'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>20</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-133502879448417557</id><published>2011-05-20T19:48:00.004+09:00</published><updated>2011-06-07T09:00:50.214+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='partitioning'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='sharding'/><title type='text'>Proper handling of insert-mostly, select-recently datasets</title><content type='html'>Some kinds of large tables such as chat messages, blog entries, etc have the following characteristics.&lt;br /&gt;&lt;br /&gt;* huge number of records, huge data and index size&lt;br /&gt;* insert and select mostly&lt;br /&gt;* select from only recent data&lt;br /&gt;* select by secondary index (i.e. user_id)&lt;br /&gt;* secondary index entries are inserted in random order&lt;br /&gt;&lt;br /&gt;What are optimal ways to handle these tables? The below single large table does not perform well.&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;CREATE TABLE message (&lt;br /&gt; message_id BINGINT UNSIGNED PRIMARY KEY,&lt;br /&gt; user_id INT UNSIGNED,&lt;br /&gt; body VARCHAR(255),&lt;br /&gt; ...&lt;br /&gt; created DATETIME,&lt;br /&gt; INDEX(user_id)&lt;br /&gt;) ENGINE=InnoDB;&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;The cause of poor performance is a secondary index on user_id. user_id is inserted in random order. Index size grows, and sooner or later it will exceed RAM size. Once index size on user_id exceeds RAM size, inserting into message table causes massive random read disk i/o, which reduces throughput significantly.&lt;br /&gt; The below is a simple insert benchmark. Once random read disk i/o starts happening, throughput drops hugely. "Sequential order" means index entries are inserted sequentially, "Random order" means randomly.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-gP5RjvUz3SQ/TdYuxRF75vI/AAAAAAAAADQ/0hW0dmYMQJI/s1600/index1.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 385px;" src="http://1.bp.blogspot.com/-gP5RjvUz3SQ/TdYuxRF75vI/AAAAAAAAADQ/0hW0dmYMQJI/s600/index1.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5608721809873299186" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This figure was what &lt;a href="http://www.mysqlconf.com/mysql2009/public/schedule/detail/6661"&gt;I presented at the MySQL Conference and Expo 2009&lt;/a&gt;. It's pretty old, but basic principles have not changed.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Using Range Partitioning&lt;/h3&gt;&lt;br /&gt;How can we make it faster? One of the best approaches in MySQL is using &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/partitioning-range.html"&gt;range partitioning&lt;/a&gt;, partitioned by date or primary key. This is one of my favorite features in MySQL. By using range partitioning, only the latest partition is actively accessed. Data/indexes in the rest partitions are much less accessed so they don't occupy buffer pool. Each partition size can be small enough to fit in memory, so insert performance does not drop.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-KknHKVu0tbM/TdYu6N3JNbI/AAAAAAAAADY/wZpfILYpdwI/s1600/5.1_partitioning.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 219px;" src="http://2.bp.blogspot.com/-KknHKVu0tbM/TdYu6N3JNbI/AAAAAAAAADY/wZpfILYpdwI/s600/5.1_partitioning.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5608721963624773042" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;CREATE TABLE message (&lt;br /&gt; message_id BIGINT UNSIGNED,&lt;br /&gt; user_id INT UNSIGNED,&lt;br /&gt; body VARCHAR(255),&lt;br /&gt; ...&lt;br /&gt; created DATETIME,&lt;br /&gt; INDEX(message_id)&lt;br /&gt; INDEX(user_id)&lt;br /&gt;) engine=InnoDB&lt;br /&gt; PARTITION BY RANGE(to_days(d1)) (&lt;br /&gt; PARTITION p201103 VALUES LESS THAN (to_days('2011-03-01')),&lt;br /&gt; PARTITION p201104 VALUES LESS THAN (to_days('2011-04-01')),&lt;br /&gt; PARTITION p201105 VALUES LESS THAN (to_days('2011-05-01')),&lt;br /&gt; PARTITION p201106 VALUES LESS THAN (to_days('2011-06-01')),&lt;br /&gt; PARTITION p201107 VALUES LESS THAN (to_days('2011-07-01')),&lt;br /&gt;...&lt;br /&gt;);&lt;/span&gt;&lt;/pre&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-QUDHaz77DtE/TdYvHfb8sMI/AAAAAAAAADg/9JGfUWUzcx0/s1600/index2.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 265px;" src="http://2.bp.blogspot.com/-QUDHaz77DtE/TdYvHfb8sMI/AAAAAAAAADg/9JGfUWUzcx0/s600/index2.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5608722191680843970" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;As long as INSERT statements do inserts order by partition key and other SQL statements fetch only the recent data, no random disk reads will happen. Partitioning itself has some CPU overheads, but it's almost negligible in the real workloads, compared to disk i/o overheads.&lt;br /&gt;&lt;br /&gt;(update:) Index and data size on each partition can be measured from information schema.&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;mysql&gt; SELECT partition_name, index_length, data_length, table_rows FROM &lt;br /&gt;information_schema.partitions WHERE table_name='message';&lt;br /&gt;+----------------+--------------+-------------+------------+&lt;br /&gt;| partition_name | index_length | data_length | table_rows |&lt;br /&gt;+----------------+--------------+-------------+------------+&lt;br /&gt;| p201103        |  15565062144 | 15527313408 |  145146231 |&lt;br /&gt;| p201104        |  15522070528 | 15507390464 |  205873280 |&lt;br /&gt;| p201105        |   9736028160 |  9945743360 |   88653190 |&lt;br /&gt;| p201106        |        32768 |       16384 |          0 |&lt;br /&gt;+----------------+--------------+-------------+------------+&lt;br /&gt;6 rows in set (0.13 sec)&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;In MySQL 5.0 or earlier versions where range partitioning is not supported, creating daily/weekly/monthly tables is a good way as a workaround, though applications have to aware of table name differences.&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;&lt;br /&gt;CREATE TABLE message_201103 ..&lt;br /&gt;CREATE TABLE message_201104 ..&lt;br /&gt;CREATE TABLE message_201105 ..&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&lt;h3&gt;How about Database Sharding?&lt;/h3&gt;&lt;br /&gt;As you know, database sharding is very common approach for handling huge data. Is sharding good for handling these tables? Probably not. Database sharding is mainly used to reduce slow disk i/o by reducing data size per server. In the above case, inserts can be done in memory regardless of data size(10,000+ inserts/second), so from performance point of view splitting tables is not needed as long as applications can keep up with in-memory insert speed. From database size vs storage size point of view (disk capacity point of view), you'll need to archive or purge older data.&lt;br /&gt;&lt;br /&gt;Actually I have seen a couple of times that people use NoSQLs supporting transparent sharding(Auto-Sharding) for these kinds of tables: such as MongoDB, Cassandra. Data size will sooner or later exceed disk size, so using unlimited horizontally scaling database sounds reasonable. But if the database products don't support range partitioning, sharding becomes much less optimal for handling these tables. Suppose you have 3000GB datasets and only recent 30GB data are mostly accessed. With MySQL 5.1+ range partitioning, you can simply manage one large(3000GB) table with weekly/daily partitions (as long as disk space is available). Only the latest partitions (30GB) are actively accessed and the rest partitions (2970GB) are less likely accessed. Single commodity database server can probably handle enough workloads.&lt;br /&gt;&lt;br /&gt; On the other hand, if you shard 3000GB database without range partitioning support, you might need 300GB x 10 shards because the whole secondary indexes are accessed. This means you need 10 times more servers.&lt;br /&gt;&lt;br /&gt; Transparent sharding is good for application developers, but I believe range partitioning is a mandatory feature for handling insert-mostly, select-recently huge datasets.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-133502879448417557?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/133502879448417557/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/05/proper-handling-of-insert-mostly-select.html#comment-form' title='23 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/133502879448417557'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/133502879448417557'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/05/proper-handling-of-insert-mostly-select.html' title='Proper handling of insert-mostly, select-recently datasets'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-gP5RjvUz3SQ/TdYuxRF75vI/AAAAAAAAADQ/0hW0dmYMQJI/s72-c/index1.png' height='72' width='72'/><thr:total>23</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-8272978029579925486</id><published>2011-04-14T00:09:00.000+09:00</published><updated>2011-04-14T00:10:16.597+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Slides: Automated, Non-Stop MySQL operations and failover</title><content type='html'>&amp;nbsp;On Tuesday I presented "Automated, Non-Stop MySQL Operations and Failover" at the MySQL Conference and Expo 2011, and &lt;a href="http://www.slideshare.net/matsunobu/automated-master-failover"&gt;published the slides at SlideShare&lt;/a&gt;. I thought this talk was very complicated and it was not easy to understand in 45-minute session. Now slides are online so I assume attendees will be easier to understand what steps are needed for master failover and slave promotion.&lt;br /&gt;&amp;nbsp;As I mentioned during the talk, I'm planning to release the tool (monitoring master failure, promoting slave automatically or manually, and optionally switching alive master manually) as an open source software soon. The tool does all the steps what I covered at the talk so actually you don't need to do the steps manually. If you find any missing steps that need to be covered, I'd appreciate if you point out. &lt;br /&gt;&amp;nbsp;I'm also writing English documentation (user's guide) so please stay tuned if you're interested.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-8272978029579925486?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/8272978029579925486/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/04/slides-automated-non-stop-mysql_14.html#comment-form' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/8272978029579925486'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/8272978029579925486'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/04/slides-automated-non-stop-mysql_14.html' title='Slides: Automated, Non-Stop MySQL operations and failover'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-4996786226277982179</id><published>2011-04-14T00:08:00.002+09:00</published><updated>2011-04-14T00:09:04.958+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><title type='text'>Slides: Linux and H/W optimizations for MySQL</title><content type='html'>On Monday I presented 3-hour tutorial "Linux and H/W optimizations for MySQL" at the MySQL Conference and Expo 2011, and published &lt;a href="http://www.slideshare.net/matsunobu/linux-and-hw-optimizations-for-mysql-7614520"&gt;the slides at SlideShare&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-4996786226277982179?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/4996786226277982179/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/04/slides-linux-and-hw-optimizations-for_14.html#comment-form' title='24 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/4996786226277982179'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/4996786226277982179'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/04/slides-linux-and-hw-optimizations-for_14.html' title='Slides: Linux and H/W optimizations for MySQL'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>24</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-4736619133061139983</id><published>2011-04-06T19:20:00.001+09:00</published><updated>2011-04-06T19:24:58.729+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Tracking long running transactions in MySQL</title><content type='html'>&amp;nbsp;Sometimes I want to trace transactions that have taken long time to execute. If a transaction is holding row locks for 30 seconds, all other clients requesting locks for the same rows will be blocked for up to 30 seconds (or will get Lock Wait Timeout errors if you set innodb_lock_wait_timeout lower). This is serious.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;The important step to solve such issues is identifying what kinds of queries are executed by problematic transactions. But tracking long running transactions is not as easy as tracking just slow queries. Suppose you execute the following transaction.&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;[client1]$ mysql --host=mysql_server db1&lt;br /&gt;mysql&gt; BEGIN;&lt;br /&gt;mysql&gt; SELECT * FROM t1 WHERE id=2 FOR UPDATE;&lt;br /&gt;... (do nothing for 10 seconds)&lt;br /&gt;mysql&gt; UPDATE t1 SET value=20 WHERE id=2;&lt;br /&gt;mysql&gt; COMMIT;&lt;/span&gt;&lt;/pre&gt;&amp;nbsp;This transaction takes more than 10 seconds to complete, but each query finishes very quickly. So nothing is written to the slow query log. Analyzing slow query logs is not helpful in this case.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;If you enable general query log, all queries including above are written. But this hurts performance (writing to general query log is serialized inside MySQL), and logging all queries makes log file size huge. Analyzing too big log files is not so fun. In addition, neither query execution time nor error code is written to the general query log. So just analyzing general query log is not enough to verify whether the above SELECT .. FOR UPDATE took 10 seconds or finished quickly.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;How about analyzing statement based binary logs? Unfortunately this is not always helpful. First, uncommitted transactions are not written to the binary log. There are many cases that transactions take long time and finally do rollback. Second, SELECT statements (including SELECT ... FOR UPDATE/LOCK IN SHARE MODE) are not written to the binary log. When you use locking reads, lots of transactions will be likely to be blocked. You probably want to identify which SELECT statement has caused the problem.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;Using SHOW ENGINE INNODB STATUS? SHOW ENGINE INNODB STATUS prints active transactions like below. &lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;---TRANSACTION 1B43B50, ACTIVE 24 sec, process no 7388, OS thread id 1235609920 starting index read&lt;br /&gt;mysql tables in use 1, locked 1&lt;br /&gt;LOCK WAIT 2 lock struct(s), heap size 376, 1 row lock(s)&lt;br /&gt;MySQL thread id 23, query id 140396660 client2.example.com 192.168.0.2 root Updating&lt;br /&gt;update t1 set value=100 where id=2&lt;br /&gt;------- TRX HAS BEEN WAITING 24 SEC FOR THIS LOCK TO BE GRANTED:&lt;br /&gt;RECORD LOCKS space id 0 page no 213041 n bits 80 index `PRIMARY` of table `test`.`t1` &lt;br /&gt;trx id 1B43B50 lock_mode X locks rec but not gap waiting&lt;br /&gt;Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0&lt;br /&gt; 0: len 4; hex 80000002; asc     ;;&lt;br /&gt; 1: len 6; hex 000001b43b4e; asc     ;N;;&lt;br /&gt; 2: len 7; hex 810000012d011c; asc     -  ;;&lt;br /&gt; 3: len 4; hex 80000002; asc     ;;&lt;br /&gt;------------------&lt;br /&gt;&lt;br /&gt;---TRANSACTION 1B43B4F, ACTIVE 34 sec, process no 7388, OS thread id 1096673600&lt;br /&gt;2 lock struct(s), heap size 376, 1 row lock(s)&lt;br /&gt;MySQL thread id 22, query id 140396658 client1.example.com 192.168.0.1 root&lt;/span&gt;&lt;/pre&gt;&amp;nbsp;You can identify that a client from 192.168.0.2 was executing "update t1 set value=100 where id=2" and was waiting for 24 seconds to get some locks. But there is no information about what queries held the lock. A client from 192.168.0.1 (1B43B4F) was running a transaction for 34 seconds, so this was probably the one holding the lock. But there is no information about why it was running for 34 seconds. In this case I want to get all queries that the 1B43B4F executed so far.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;If you can modify application programs, tracing long running transactions is not so difficult. Adding below logics will help.&lt;br /&gt;&lt;br /&gt;1. Checking transaction start time (t1)&lt;br /&gt;2. Pushing queries (and current timestamp) by the transaction into some in-memory queue&lt;br /&gt;3. Checking transaction end time (t2)&lt;br /&gt;4. If (t2 - t1) exceeds N seconds, printing all queries that the transaction executed&lt;br /&gt;5. Deleting the in-memory queue&lt;br /&gt;&lt;br /&gt;&amp;nbsp;The problem is that this approach is not feasible in many cases. Because long running transactions may be executed from every client, you have to implement tracing logic on all application servers. Modifying core database access codes and re-deplying on all application servers is not fun. If you use O/R mappers, things become more difficult. Modifying database driver programs on all application servers? I don't want to do that.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;Based on the above reasons, I thought that the most practical approach is tracing slow transactions on MySQL servers, without modifying any existing program (including client programs and MySQL servers). Hopefully there is a way to do that: capturing MySQL packets and tracking transactions and queries. The above 1-5 tracing algorithm should work. &lt;br /&gt; I implemented the tool &lt;a href="https://github.com/yoshinorim/MySlowTranCapture"&gt;"MySlowTranCapture" and published at GitHub&lt;/a&gt;. Here is an example output.&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;[mysql-server]# myslowtrancapture -i eth0&lt;br /&gt;Monitoring eth0 interface..&lt;br /&gt;Listening port 3306..&lt;br /&gt;Logging transactions that take more than 4000 milliseconds..&lt;br /&gt;&lt;br /&gt;From 192.168.0.1:24441&lt;br /&gt;2011/02/23 09:12:17.258307 -&gt;&lt;br /&gt;begin&lt;br /&gt;2011/02/23 09:12:17.258354 &lt;-&lt;br /&gt;GOT_OK&lt;br /&gt;2011/02/23 09:12:17.264797 -&gt;&lt;br /&gt;select * from diary where diary_id=100 for update&lt;br /&gt;2011/02/23 09:12:17.265087 &lt;-&lt;br /&gt;GOT_RES&lt;br /&gt;2011/02/23 09:12:17.277622 -&gt;&lt;br /&gt;select 1&lt;br /&gt;2011/02/23 09:12:17.277713 &lt;-&lt;br /&gt;GOT_RES&lt;br /&gt;2011/02/23 09:13:01.232620 -&gt;&lt;br /&gt;update diary set diary_date=now() where diary_id=100&lt;br /&gt;2011/02/23 09:13:01.232960 &lt;-&lt;br /&gt;GOT_OK&lt;br /&gt;2011/02/23 09:13:17.360993 -&gt;&lt;br /&gt;commit&lt;br /&gt;&lt;br /&gt;From 192.168.0.2:24442&lt;br /&gt;2011/02/23 09:12:20.969288 -&gt;&lt;br /&gt;begin&lt;br /&gt;2011/02/23 09:12:20.969483 &lt;-&lt;br /&gt;GOT_OK&lt;br /&gt;2011/02/23 09:12:20.977699 -&gt;&lt;br /&gt;update diary set diary_date=now() where diary_id=100&lt;br /&gt;2011/02/23 09:13:11.300935 &lt;-&lt;br /&gt;GOT_ERR:Lock wait timeout exceeded; try restarting transaction&lt;br /&gt;2011/02/23 09:13:13.136967 -&gt;&lt;br /&gt;rollback&lt;br /&gt;--------------------&lt;/span&gt;&lt;/pre&gt;&amp;nbsp;It is easy to identify that the first transaction caused problems. The first transaction held an exclusive lock for diary_id=100 at 09:12:17.264797, and didn't release until 09:13:17.360993. All queries by the transaction completed within a millisecond, so the first transaction was probably sleeping, or took long time to do other logics (i.e. accessing to remote servers and taking long time there) before closing the transaction.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;MySlowTranCapture uses libpcap to capture network packets, like tcpdump. Since libpcap loses packets sometimes, it is not guaranteed to capture 100% transactions. &lt;br /&gt;&amp;nbsp;MySlowTranCapture approximately uses 10-30% CPU resources of mysqld process. This is not bad for short-term analyzing purposes (when I tested tcpdump, it used 140% CPU resources of mysqld, and most of CPU time seemed to be spent for writing network packets to local files). &lt;br /&gt;&amp;nbsp;I sometimes use this tool for debugging, too. Sometimes developers execute unnecessarily many SQL statements within single transaction. MySlowTranCapture is useful to trace such transactions on development servers, by setting -t lower (i.e. 50ms).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-4736619133061139983?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/4736619133061139983/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/04/tracking-long-running-transactions-in.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/4736619133061139983'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/4736619133061139983'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/04/tracking-long-running-transactions-in.html' title='Tracking long running transactions in MySQL'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-4616943067262461553</id><published>2011-04-05T11:10:00.002+09:00</published><updated>2011-04-05T11:16:02.605+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysqlconf'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Speaking at MySQL Conference and Expo 2011</title><content type='html'>Next week I'll stay at Hyatt Santa Clara to attend the MySQL Conference and Expo 2011. This year I'll present one tutorial and two sessions.&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://en.oreilly.com/mysql2011/public/schedule/detail/17111"&gt;Linux and H/W optimizations for MySQL&lt;/a&gt;  (Apr 11, 9:00-12:30)&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://en.oreilly.com/mysql2011/public/schedule/detail/17265"&gt;Automated, Non-Stop MySQL Operations and Failover&lt;/a&gt;  (Apr 12, 15:05-15:50)&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://en.oreilly.com/mysql2011/public/schedule/detail/17226"&gt;Using MySQL As NoSQL - Introduction to HandlerSocket Plugin&lt;/a&gt; (Apr 13, 17:15-18:00) (Co-present with Kazuho Oku. He is a former employee at Cybozu Labs, and recently joined DeNA. He is an author of Q4M and MyCached(origin of HandlerSocket))&lt;br /&gt;&lt;br /&gt; These talks will be based on my experiences as a former MySQL consultant and our experiences from running real-world social gaming platforms (DeNA is not well-known in US yet, but our service has 2-3 billion page views per day, so it's not small). Speaking 4.5 hours is not easy for a non-native English speaker, but I'll do the best to make the sessions will be interesting and beneficial.&lt;br /&gt; Due to the recent disasters and ongoing rolling blackouts in Japan, most of my colleagues had to cancel the flight, and I also have to go back to Tokyo just after the UC (I booked the flight leaving at Apr 14 6:55pm),so I can't attend to some interesting events scheduled in Apr 15 or later. But it's confirmed that I can stay from Apr 10 to 14 so I look forward to seeing ex-MySQL colleagues and friends!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-4616943067262461553?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/4616943067262461553/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/04/speaking-at-mysql-conference-and-expo.html#comment-form' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/4616943067262461553'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/4616943067262461553'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2011/04/speaking-at-mysql-conference-and-expo.html' title='Speaking at MySQL Conference and Expo 2011'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-114397352581357996</id><published>2010-12-16T21:05:00.001+09:00</published><updated>2010-12-16T21:05:19.641+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='5.5'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Thanks for releasing MySQL 5.5 GA!</title><content type='html'>&amp;nbsp; I am excited to see that MySQL 5.5 GA (5.5.8) has been finally released . All of my ex-colleagues at MySQL/Oracle have done amazing jobs. I was also positively surprised that 5.5 was released on schedule:). So far I have felt that Oracle is leading MySQL development and product management very well. &lt;br /&gt;&amp;nbsp; 5.5 has lots of practical features not only for web services providers, but also for enterprise users. Especially I like the following features.&lt;br /&gt;&lt;h4&gt;Improved concurrency&lt;/h4&gt;&amp;nbsp; Very often lots of concurrent sessions access to the same table. In such cases, one of MySQL internal global mutexes (LOCK_open) becomes very hot and serious concurrency problems have happened. Increasing table_cache have caused even negative impacts (because MySQL has to do linear search here. See &lt;a href="http://bugs.mysql.com/bug.php?id=33948"&gt;bug#33948&lt;/a&gt; for details). In 5.5, the problem has been fixed. Here is a very simple concurrent PK lookup benchmark result(using mysqlslap). It's good to see that the worst performance line has been significantly improved in 5.5. &lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/__ybECuKG5tc/TQnveow1iHI/AAAAAAAAACw/76-Nfn3OJ8o/s1600/55.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 210px;" src="http://3.bp.blogspot.com/__ybECuKG5tc/TQnveow1iHI/AAAAAAAAACw/76-Nfn3OJ8o/s400/55.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5551231325328214130" /&gt;&lt;/a&gt;&lt;br /&gt;&amp;nbsp; Other global mutexes such as InnoDB kernel mutex are still hot, but overall performance becomes much better.&lt;br /&gt;&amp;nbsp; It's also great that &lt;a href="http://bugs.mysql.com/bug.php?id=26590"&gt;the limit of 1023 concurrent data-modifying transactions&lt;/a&gt; has been raised in 5.5. &lt;br /&gt;&lt;h4&gt;4-byte UTF-8&lt;/h4&gt;&amp;nbsp; I'm not sure how many people are aware of this issue, but this is important for users in APAC regions such as China/Japan/Korea. Most of multi-byte characters consume 2 or 3 bytes in UTF-8 per character, but some characters (a few hundred Japanese characters) consume 4 bytes. Prior to MySQL 5.5, 4-byte UTF-8 characters were not supported. If people need to handle these characters, they have to use local character sets such as cp932, gbk5. Now we can use UTF-8 safely (You need to use "utf8mb4" charset). &lt;br /&gt;&lt;h4&gt;Audit Plugin Interface&lt;/h4&gt;&amp;nbsp; Lots of users in financial industries have wanted this feature for a long time. They frequently want to audit following information. &lt;br /&gt; - Login/Access Timestamp&lt;br /&gt; - Failed login/access information&lt;br /&gt; - Database username&lt;br /&gt; - Client IP/Hostname or terminal name&lt;br /&gt; - Accessed table/view/other objects name and action&lt;br /&gt; - Full executed query string&lt;br /&gt; - The number of affected/retrieved rows&lt;br /&gt;&amp;nbsp; MySQL general query log has most of the above information, but it does not have any filtering feature so you have to parse huge general logs, which is not efficient. By writing audit plugins, you can log only specific events.&lt;br /&gt;&lt;h4&gt;Semi-Synchronous Replication&lt;/h4&gt;&amp;nbsp; By using Semi-Synchronous replication, the likelihood of the data loss on slaves in case of master crash will be much more reduced. Since this is not fully synchronous approach, you still have risks to lose data in case of crash. But in practice this can be very helpful for less strict users (i.e. web services providers). &lt;br /&gt;&lt;h4&gt;"Change buffering" in InnoDB (speeding up DELETEs)&lt;/h4&gt;&amp;nbsp;InnoDB have supported "&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/innodb-insert-buffering.html"&gt;Insert Buffering&lt;/a&gt;" feature for a long time. Insert Buffering can significantly improve insert performance if you have lots of non-unique secondary indexes. &lt;br /&gt;&amp;nbsp; On the other hand, insert buffering does not have any effect for delete-marking or purging index entries. In general, modifying indexes cause random disk reads when target index leaf blocks are not cached, which significantly slows down response time. That's one of the reasons why bulk deletes(and updates that modifies lots of secondary indexes) were not fast in InnoDB. In 5.5, &lt;a href="http://blogs.innodb.com/wp/2010/09/mysql-5-5-innodb-change-buffering/"&gt;"Insert Buffering" feature was extended to "Change Buffering"&lt;/a&gt;, which is effective not only for inserts, but also for deletes. Here is &lt;a href="http://www.innodb.com/wp/wp-content/uploads/2010/04/InnoDB_Performance_benchmarks_2010.pdf"&gt;a slide&lt;/a&gt; (p.18) from MySQL team that shows the effect of Change Buffering (160x faster response time for bulk deleting 100k rows). &lt;br /&gt;&lt;br /&gt;&amp;nbsp; I am also interested in &lt;a href="http://dev.mysql.com/doc/refman/5.5/en/performance-schema.html"&gt;Performance Schema&lt;/a&gt;. I like FILE_SUMMARY_BY_INSTANCE table which enables to fetch per-file disk i/o statistics. If Performance Schema can also be used to fetch per-index statistics (like userstats patch) and can be disabled/enabled dynamically, that would be great.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-114397352581357996?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/114397352581357996/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2010/12/thanks-for-releasing-mysql-55-ga.html#comment-form' title='15 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/114397352581357996'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/114397352581357996'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2010/12/thanks-for-releasing-mysql-55-ga.html' title='Thanks for releasing MySQL 5.5 GA!'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/__ybECuKG5tc/TQnveow1iHI/AAAAAAAAACw/76-Nfn3OJ8o/s72-c/55.png' height='72' width='72'/><thr:total>15</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-509393264964815357</id><published>2010-11-08T23:58:00.003+09:00</published><updated>2010-11-09T10:48:03.233+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Handling long texts/blobs in InnoDB - 1 to 1 relationship, covering index</title><content type='html'>&amp;nbsp; I have seen that 1 to 1 relationship is sometimes used for MySQL(InnoDB) to avoid significant performance slowdown. To understand the performance difference, it is necessary to understand how InnoDB stores column values to data blocks. Suppose you have the following "diary" table (for storing blog entries).&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;CREATE TABLE diary (&lt;br /&gt;  diary_id INT UNSIGNED  AUTO_INCREMENT,&lt;br /&gt;  user_id INT UNSIGNED NOT NULL,&lt;br /&gt;  post_date TIMESTAMP NOT NULL,&lt;br /&gt;  status TINYINT UNSIGNED NOT NULL,&lt;br /&gt;  rating FLOAT NOT NULL,&lt;br /&gt;  title VARCHAR(100) NOT NULL,&lt;br /&gt;  body TEXT,&lt;br /&gt;  PRIMARY KEY (diary_id), &lt;br /&gt;  INDEX user_date(user_id, post_date),&lt;br /&gt;  INDEX user_rating(user_id, rating)&lt;br /&gt;) CHARSET utf8 ENGINE=InnoDB;&lt;/span&gt;&lt;/pre&gt;&amp;nbsp; "body" column contains diary text, which is pretty large (1KB/row). On the other hand, the rest columns are small enough (less than 50 bytes per row in total). This table is mostly used from applications for fetching titles, not fetching body. 90% SELECT queries will be like this:&lt;br /&gt;1) SELECT user_id, post_date, title FROM diary WHERE diary_id=?&lt;br /&gt;&lt;br /&gt;&amp;nbsp; And the rest 10% SELECT queries will be:&lt;br /&gt;2) SELECT body FROM diary  WHERE diary_id=?&lt;br /&gt;&lt;br /&gt;&amp;nbsp; You probably guess that 1) is much faster than 2) because 1) fetches only &lt;50 bytes while 2) fetches +1KB data. But unfortunately, in most cases 1) is as slow as 2). Even though 100% SELECT queries are 1), overall throughput will be poor if diary table is much larger than innodb buffer pool size. The reason is how InnoDB stores column values. The below is a rough image.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/__ybECuKG5tc/TNgNmYdHO0I/AAAAAAAAACo/p-yLePL7UMc/s1600/innodb_block.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 345px;" src="http://3.bp.blogspot.com/__ybECuKG5tc/TNgNmYdHO0I/AAAAAAAAACo/p-yLePL7UMc/s600/innodb_block.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5537190694902774594" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp; InnoDB stores large texts/blobs separetely from other columns if no more space is available within the same block. The separated page is called "Overflow Page". But this doesn't work for the diary table. This is mainly because diary body(1KB) is much smaller than InnoDB block size(16KB). When inserting diary entries, InnoDB block space is normally available enough to store all columns including body. As a result, body is stored next to the rest columns, and InnoDB blocks are occupied mostly by body.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/__ybECuKG5tc/TNgL-unI3HI/AAAAAAAAACg/XwSXsbN_Yvk/s1600/innodb_block2.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 360px;" src="http://1.bp.blogspot.com/__ybECuKG5tc/TNgL-unI3HI/AAAAAAAAACg/XwSXsbN_Yvk/s600/innodb_block2.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5537188914144009330" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp; InnoDB's I/O unit is per block basis. Even though you do not fetch body, InnoDB internally has to read blocks that contain body. In other words, single InnoDB block can not have many diary entries. So a lot of random disk reads happen here.&lt;br /&gt;&amp;nbsp; I talked about this topic &lt;a href="http://www.slideshare.net/matsunobu/more-mastering-the-art-of-indexing"&gt;at the MySQL Conference &amp; Expo this year&lt;/a&gt;. I explained two solutions, one is using 1 to 1 relationship and the other is using covering index.&lt;br /&gt;&lt;br /&gt;&amp;nbsp; 1 to 1 relationship solution is creating below two tables:&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;&lt;br /&gt;CREATE TABLE diary_head (&lt;br /&gt;  diary_id INT UNSIGNED  AUTO_INCREMENT,&lt;br /&gt;  user_id INT UNSIGNED NOT NULL,&lt;br /&gt;  post_date TIMESTAMP NOT NULL,&lt;br /&gt;  status TINYINT UNSIGNED NOT NULL,&lt;br /&gt;  rating FLOAT NOT NULL,&lt;br /&gt;  title VARCHAR(100) NOT NULL,&lt;br /&gt;  PRIMARY KEY (diary_id), &lt;br /&gt;  INDEX user_date(user_id, post_date),&lt;br /&gt;  INDEX user_rating(user_id, rating)&lt;br /&gt;) CHARSET utf8 ENGINE=InnoDB;&lt;br /&gt;&lt;br /&gt;CREATE TABLE diary_body (&lt;br /&gt;  diary_id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,&lt;br /&gt;  body TEXT&lt;br /&gt;) CHARSET utf8 ENGINE=InnoDB;&lt;/span&gt;&lt;/pre&gt;&amp;nbsp; diary_head table does not have the largest text column(body). So diary_head table size can be much smaller than the original diary table. InnoDB blocks that have diary_head's column values are frequently accessed, but the number of the blocks can be smaller. So these InnoDB blocks will be cached within InnoDB buffer pool very well. Most of SELECT queries do not require random disk i/o, so total throughput can be increased.&lt;br /&gt;&lt;br /&gt;&amp;nbsp; Another approach is creating a covering index that covers all columns except body. You do not need to create 1 to 1 tables in this case.&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;mysql&gt; ALTER TABLE diary ADD INDEX diary_covering (diary_id, user_id, post_date, &lt;br /&gt;    -&gt; status, rating, title); &lt;/span&gt;&lt;/pre&gt;&amp;nbsp; By fetching these columns through diary_covering index instead of the PRIMARY key, InnoDB does not read blocks that contain body. diary_covering index size will be as small as the above diary_head table, so it will be cached very well. &lt;br /&gt;&lt;br /&gt;&amp;nbsp; Both 1 to 1 relationship and an additional covering index perform pretty well(The benchmarking results are written in the &lt;a href="http://www.slideshare.net/matsunobu/more-mastering-the-art-of-indexing"&gt;conference slides&lt;/a&gt;). But neither is straightforward for developers. Using 1 to 1 relationship forces developers to use joins to fetch all diary columns, and forces them to maintain consistency (or use foreign key constraints). Using additional covering index sometimes increases data size significantly. And more, you need to add FORCE INDEX in the SELECT statement like below so that MySQL can use the covering index insted of the PRIMARY key. &lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;mysql&gt; SELECT user_id, post_date, title FROM diary &lt;br /&gt;    -&gt; FORCE INDEX (diary_covering)  WHERE diary_id=?&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&amp;nbsp; This is needed because MySQL optimizer chooses PRIMARY KEY for PK lookups because the optimizer guesses PK lookup (unique lookup) is the fastest than any other type of index lookup, even though it is not true in this case. Using SQL hints is sometimes not easy, especially when you use O/R mapping tools.&lt;br /&gt;&lt;br /&gt;&amp;nbsp; Personally I hope that InnoDB team or outside experts supports new InnoDB data format which is optimal for handling such large data types. PBXT handles these types of columns pretty well. PBXT stores large texts separately from other smaller columns, and it won't read large text data when SQL statement doesn't touch the large columns. If such kinds of data formats (storing specified columns in DDL into separated blocks) are supported in InnoDB, that will be nice.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-509393264964815357?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/509393264964815357/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2010/11/handling-long-textsblobs-in-innodb-1-to.html#comment-form' title='35 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/509393264964815357'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/509393264964815357'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2010/11/handling-long-textsblobs-in-innodb-1-to.html' title='Handling long texts/blobs in InnoDB - 1 to 1 relationship, covering index'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/__ybECuKG5tc/TNgNmYdHO0I/AAAAAAAAACo/p-yLePL7UMc/s72-c/innodb_block.png' height='72' width='72'/><thr:total>35</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-765226865821283814</id><published>2010-10-20T00:27:00.004+09:00</published><updated>2011-05-05T17:04:10.941+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='handlersocket'/><title type='text'>Using MySQL as a NoSQL - A story for exceeding 750,000 qps on a commodity server</title><content type='html'>&lt;span style="font-weight:bold;"&gt;UPDATE: Oracle officially released &lt;a href="http://blogs.innodb.com/wp/2011/04/nosql-to-innodb-with-memcached/"&gt;memcached daemon plugin&lt;/a&gt; that talks with InnoDB. I'm glad to see that NoSQL+MySQL has become an official solution. It's still preview release but will be very promising. Let's try it to make it better!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp; Most of high scale web applications use MySQL + memcached. Many of them use also NoSQL like TokyoCabinet/Tyrant. In some cases people have dropped MySQL and have shifted to NoSQL. One of the biggest reasons for such a movement is that it is said that NoSQL performs better than MySQL for simple access patterns such as primary key lookups. Most of queries from web applications are simple so this seems like a reasonable decision.&lt;br /&gt;&amp;nbsp; Like many other high scale web sites, we at &lt;a href="http://www.dena.jp/en/index.html"&gt;DeNA&lt;/a&gt;(*) had similar issues for years. But we reached a different conclusion. We are using "only MySQL". We still use memcached for front-end caching (i.e. preprocessed HTML, count/summary info), but we do not use memcached for caching rows. We do not use NoSQL, either. Why? Because we could get much better performance from MySQL than from other NoSQL products. In our benchmarks, we could get 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients. We also have got excellent performance on production environments. &lt;br /&gt;&amp;nbsp; Maybe you can't believe the numbers, but this is a real story. In this long blog post, I'd like to share our experiences.&lt;br /&gt;(*) For those who do not know.. I left Oracle in August 2010. Now I work at DeNA, one of the largest social game platform providers in Japan.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Is SQL really good for fast PK lookups?&lt;/h3&gt;&lt;br /&gt;&amp;nbsp; How many times do you need to run PK lookups per second? Our applications at DeNA need to execute lots of PK lookups, such as fetching user info by user id, fetching diary info by diary id. memcached and NoSQL certainly fit very well for such requirements. When you run simple multi-threaded "memcached get" benchmarks, you can probably execute 400,000+ get operations per second, even though memcached clients are located on remote servers. When I tested with the latest libmemcached and memcached, I could get 420,000 get per sec on a 2.5GHz x 8 core Nehalem box with a quad-port Broadcom Gigabit Ethernet card.&lt;br /&gt;&lt;br /&gt;&amp;nbsp; How frequently can MySQL execute PK lookups? Benchmarking is easy. Just run concurrent queries from sysbench, super-smack, mysqlslap, etc.&lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace; line-height: 1px"&gt;[matsunobu@host ~]$ mysqlslap --query="select user_name,..  &lt;br /&gt;from test.user where user_id=1" \ &lt;br /&gt;--number-of-queries=10000000 --concurrency=30 --host=xxx -uroot &lt;/span&gt;&lt;/pre&gt; You can easily check how many InnoDB rows are read per second.&lt;pre&gt;&lt;span style="font-family: monospace;"&gt;[matsunobu@host ~]$ mysqladmin extended-status -i 1 -r -uroot \&lt;br /&gt;| grep -e "Com_select"&lt;br /&gt;...&lt;br /&gt;| Com_select                            | 107069     |&lt;br /&gt;| Com_select                            | 108873     |&lt;br /&gt;| Com_select                            | 108921     |&lt;br /&gt;| Com_select                            | 109511     |&lt;br /&gt;| Com_select                            | 108084     |&lt;br /&gt;| Com_select                            | 108483     |&lt;br /&gt;| Com_select                            | 108115     |&lt;br /&gt;...&lt;/span&gt;&lt;/pre&gt;&amp;nbsp; 100,000+ queries per second seems not bad, but much slower than memcached. What is MySQL actually doing? From vmstat output, both %user and %system were high.&lt;pre&gt;&lt;span style="font-family: monospace;"&gt;[matsunobu@host ~]$ vmstat 1&lt;br /&gt; r  b  swpd   free   buff  cache      in     cs us sy id wa st&lt;br /&gt;23  0     0 963004 224216 29937708 58242 163470 59 28 12  0  0&lt;br /&gt;24  0     0 963312 224216 29937708 57725 164855 59 28 13  0  0&lt;br /&gt;19  0     0 963232 224216 29937708 58127 164196 60 28 12  0  0&lt;br /&gt;16  0     0 963260 224216 29937708 58021 165275 60 28 12  0  0&lt;br /&gt;20  0     0 963308 224216 29937708 57865 165041 60 28 12  0  0&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&amp;nbsp; Oprofile output told more about where CPU resources were spent.&lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace;"&gt;samples  %        app name                 symbol name&lt;br /&gt;259130    4.5199  mysqld                   MYSQLparse(void*)&lt;br /&gt;196841    3.4334  mysqld                   my_pthread_fastmutex_lock&lt;br /&gt;106439    1.8566  libc-2.5.so              _int_malloc&lt;br /&gt;94583     1.6498  bnx2                     /bnx2&lt;br /&gt;84550     1.4748  ha_innodb_plugin.so.0.0.0 ut_delay&lt;br /&gt;67945     1.1851  mysqld                   _ZL20make_join_statistics&lt;br /&gt;P4JOINP10TABLE_LISTP4ItemP16st_dynamic_array&lt;br /&gt;63435     1.1065  mysqld                   JOIN::optimize()&lt;br /&gt;55825     0.9737  vmlinux                  wakeup_stack_begin&lt;br /&gt;55054     0.9603  mysqld                   MYSQLlex(void*, void*)&lt;br /&gt;50833     0.8867  libpthread-2.5.so        pthread_mutex_trylock&lt;br /&gt;49602     0.8652  ha_innodb_plugin.so.0.0.0 row_search_for_mysql&lt;br /&gt;47518     0.8288  libc-2.5.so              memcpy&lt;br /&gt;46957     0.8190  vmlinux                  .text.elf_core_dump&lt;br /&gt;46499     0.8111  libc-2.5.so              malloc&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&amp;nbsp; MYSQLparse() and MYSQLlex() are called during SQL parsing phase. make_join_statistics() and JOIN::optimize() are called during query optimization phase. These are "SQL" overhead. It's obvious that performance drops were caused by mostly SQL layer, not by "InnoDB(storage)" layer. MySQL has to do a lot of things like below while memcached/NoSQL do not neeed to do.&lt;br /&gt;&lt;br /&gt;* Parsing SQL statements&lt;br /&gt;* Opening, locking tables&lt;br /&gt;* Making SQL execution plans&lt;br /&gt;* Unlocking, closing tables&lt;br /&gt;&lt;br /&gt;&amp;nbsp; MySQL also has to do lots of concurrency controls. For example, fcntl() are called lots of times during sending/receiving network packets. Global mutexes such as LOCK_open, LOCK_thread_count are taken/relesed very frequently. That's why my_pthread_fastmutex_lock() were ranked #2 in the oprofile output and %system were not small. &lt;br /&gt;&amp;nbsp; Both MySQL development team and external community are aware of concurrency issues. Some issues have already been solved in 5.5. I'm glad to see that lots of fixes have been done so far.&lt;br /&gt;&amp;nbsp; But it is also important that %user reached 60%. Mutex contentions result in %system increase, not %user increase. Even though all mutex issues inside MySQL are fixed, we can not expect 300,000 queries per second.&lt;br /&gt;&amp;nbsp; You may be heard about HANDLER statement. Unfortunately HANDLER statement was not so much helpful to improve throughput because query parsing, opening/closing tables still be needed. &lt;br /&gt;&lt;br /&gt;&lt;h3&gt;CPU efficiency is important for in-memory workloads&lt;/h3&gt;&lt;br /&gt;&amp;nbsp; If little active data fit in memory, SQL overheads become relatively negligible. This is simply because disk i/o costs are much much higher. We do not need to care so much about SQL costs in this case. &lt;br /&gt;&amp;nbsp; But, on some of our hot MySQL servers, almost all data fit in memory and they became completely CPU bound. Profiling results were similar to what I described above: SQL layer spent most of resources. We needed to execute lots of primary key lookups(i.e. SELECT x FROM t WHERE id=?) or limited range scans. Even though 70-80% of queries were simple PK lookups from the same table (difference was just values in WHERE), every time MySQL had to parse/open/lock/unlock/close, which seemed not efficient for us.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Have you heard about NDBAPI?&lt;/h3&gt;&lt;br /&gt;&amp;nbsp; Is there any good solution to reduce CPU resources/contentions around SQL layer in MySQL? If you are using MySQL Cluster, &lt;a href="http://dev.mysql.com/doc/ndbapi/en/index.html"&gt;NDBAPI &lt;/a&gt;would be the best solution. When I worked at MySQL/Sun/Oracle as a consultant, I had seen lots of customers who were dissapointed at SQL Node + NDB performance, then became happy after they could get N times bettern performance by writing NDBAPI clients. You can use both NDBAPI and SQL in MySQL Cluster. It's recommended using NDBAPI for frequent access patterns, and using SQL + MySQL + NDB for ad-hoc or infrequent query patterns.&lt;br /&gt;&amp;nbsp; This was what we wanted. We wanted faster access APIs, but we also wanted to use SQL for ad-hoc or complex queries. But DeNA is using InnoDB, like many other web services. Switching to NDB is not trivial. &lt;a href="http://www.innodb.com/products/embedded-innodb/"&gt;Embedded InnoDB&lt;/a&gt; does neither support SQL nor network interface so it's not an option for us.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Developing "HandlerSocket Plugin" - a MySQL plugin that speaks NoSQL network protocols&lt;/h3&gt;&lt;br /&gt;&amp;nbsp; We thought that the best approach was implementing a NoSQL network server inside MySQL. That is, writing a network server as a MySQL plugin (daemon plugin) which listens on specific ports, accepting NoSQL protocols/APIs, then accessing to InnoDB directly by using &lt;a href="http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine"&gt;MySQL internal storage engine APIs&lt;/a&gt;. This approach is similar to NDBAPI, but it can talk with InnoDB. &lt;br /&gt;&amp;nbsp; This concept was initially invented and prototyped by &lt;a href="http://developer.cybozu.co.jp/kazuho/in_english/"&gt;Kazuho Oku&lt;/a&gt; at &lt;a href="http://labs.cybozu.co.jp/en/"&gt;Cybozu Labs&lt;/a&gt; last year. He wrote &lt;a href="http://developer.cybozu.co.jp/kazuho/2009/08/mycached-memcac.html"&gt;MyCached&lt;/a&gt; UDF that speaks memcached protocols. My colleague Akira Higuchi implemented another plugin: &lt;a href="http://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL"&gt;HandlerSocket&lt;/a&gt;. The below picture shows about what Hanldersocket can do.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/__ybECuKG5tc/TLqWDlJxDNI/AAAAAAAAACQ/bOq6w5Q5nYc/s1600/mysql_HandlerSocket.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 359px;" src="http://4.bp.blogspot.com/__ybECuKG5tc/TLqWDlJxDNI/AAAAAAAAACQ/bOq6w5Q5nYc/s600/mysql_HandlerSocket.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5528896480807554258" /&gt;&lt;/a&gt;&lt;br /&gt;&lt; Fig 1 &gt; What is Hanldersocket?&lt;br /&gt;&lt;br /&gt;&amp;nbsp; HandlerSocket is a MySQL daemon plugin so that applications can use MySQL like NoSQL. The biggest purpose of the HandlerSocket is that it talks with storage engines like InnoDB without SQL-related overheads. To access MySQL tables, of course HandlerSocket needs to open/close tables. But HandlerSocket does not open/close tables every time. It keeps tables opened for reuse. Opening/closing tables is very costly and causes serious mutex contentions so it's very helpful to improve performance. Of course HandlerSocket closes tables when traffics become small etc so that it won't block administrative commands (DDL) forever.&lt;br /&gt;&amp;nbsp; What is different from using MySQL + memcached? By comparing Fig 1 with Fig 2, I think you'll notice lots of differences. Fig 2 shows typical memcached and MySQL usage. memcached is aggressively used for caching database records. This is mainly because memcached get operation is much faster than in-memory / on-disk PK lookups in MySQL. If HandlerSocket can fetch records as fast as memcached, we don't need memcached for caching records.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/__ybECuKG5tc/TLqV2t9GeeI/AAAAAAAAACI/e8knOF_enPM/s1600/mysql_memcached.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 414px;" src="http://1.bp.blogspot.com/__ybECuKG5tc/TLqV2t9GeeI/AAAAAAAAACI/e8knOF_enPM/s600/mysql_memcached.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5528896259832052194" /&gt;&lt;/a&gt;&lt;br /&gt;&lt; Fig 2 &gt; Common architecture pattern for MySQL + memcached&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Using HandlerSocket&lt;/h3&gt;&lt;br /&gt;&amp;nbsp; As an example, here is a "user" table. Suppose we need to fetch user information by user_id. &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace;"&gt;CREATE TABLE user (&lt;br /&gt;  user_id INT UNSIGNED PRIMARY KEY,&lt;br /&gt;  user_name VARCHAR(50),&lt;br /&gt;  user_email VARCHAR(255),&lt;br /&gt;  created DATETIME &lt;br /&gt;) ENGINE=InnoDB; &lt;/span&gt;&lt;/pre&gt; In MySQL, fetching user info can be done by, of course, SELECT statements.&lt;pre&gt;&lt;span style="font-family: monospace;"&gt;mysql&gt; SELECT user_name, user_email, created FROM user WHERE user_id=101;&lt;br /&gt;+---------------+-----------------------+---------------------+&lt;br /&gt;| user_name     | user_email            | created             |&lt;br /&gt;+---------------+-----------------------+---------------------+&lt;br /&gt;| Yukari Takeba | yukari.takeba@dena.jp | 2010-02-03 11:22:33 |&lt;br /&gt;+---------------+-----------------------+---------------------+&lt;br /&gt;1 row in set (0.00 sec)&lt;/span&gt;&lt;/pre&gt;&amp;nbsp; Let me show how we can do the same thing with HandlerSocket.&lt;br /&gt;&lt;br /&gt;* Installing HandlerSocket&lt;br /&gt;&amp;nbsp; Installation steps are described &lt;a href="http://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL/blob/master/docs-en/installation.en.txt"&gt;here&lt;/a&gt;. Basic steps are as below:&lt;br /&gt;&lt;br /&gt;1. Download HandlerSocket &lt;a href="http://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL"&gt;here&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;2. Building HandlerSocket (both client and server codes)&lt;br /&gt;./configure --with-mysql-source=...  --with-mysql-bindir=... ; make; make install&lt;br /&gt;3. Installing HandlerSocket into MySQL &lt;br /&gt;mysql&gt; INSTALL PLUGIN handlersocket soname 'handlersocket.so';&lt;br /&gt;&lt;br /&gt;&amp;nbsp; Since HandlerSocket is a MySQL plugin, you can use it like other plugins such as InnoDB Plugin, Q4M, Spider, etc. That is, you do not need to modify MySQL source code itself. MySQL version has to be 5.1 or later. You need both MySQL source code and MySQL binary to build HandlerSocket.&lt;br /&gt;&lt;br /&gt;* Writing HandlerSocket client code&lt;br /&gt;&amp;nbsp; We provice C++ and Perl client libraries. Here is a sample Perl code to fetch a row by pk lookup. &lt;br /&gt;&lt;pre name="code" class="c"&gt;&lt;br /&gt;#!/usr/bin/perl&lt;br /&gt;&lt;br /&gt;use strict;&lt;br /&gt;use warnings;&lt;br /&gt;use Net::HandlerSocket;&lt;br /&gt;&lt;br /&gt;#1. establishing a connection&lt;br /&gt;my $args = { host =&gt; 'ip_to_remote_host', port =&gt; 9998 };&lt;br /&gt;my $hs = new Net::HandlerSocket($args);&lt;br /&gt;&lt;br /&gt;#2. initializing an index so that we can use in main logics.&lt;br /&gt; # MySQL tables will be opened here (if not opened)&lt;br /&gt;my $res = $hs-&gt;open_index(0, 'test', 'user', 'PRIMARY',&lt;br /&gt;    'user_name,user_email,created');&lt;br /&gt;die $hs-&gt;get_error() if $res != 0;&lt;br /&gt;&lt;br /&gt;#3. main logic&lt;br /&gt; #fetching rows by id&lt;br /&gt; #execute_single (index id, cond, cond value, max rows, offset)&lt;br /&gt;$res = $hs-&gt;execute_single(0, '=', [ '101' ], 1, 0);&lt;br /&gt;die $hs-&gt;get_error() if $res-&gt;[0] != 0;&lt;br /&gt;shift(@$res);&lt;br /&gt;for (my $row = 0; $row &lt; 1; ++$row) {&lt;br /&gt;  my $user_name= $res-&gt;[$row + 0];&lt;br /&gt;  my $user_email= $res-&gt;[$row + 1];&lt;br /&gt;  my $created= $res-&gt;[$row + 2];&lt;br /&gt;  print "$user_name\t$user_email\t$created\n";&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;#4. closing the connection&lt;br /&gt;$hs-&gt;close();&lt;br /&gt;&lt;/pre&gt;&amp;nbsp; The above code fetches user_name, user_email and created columns from user table, looking by user_id=101. So you'll get the same results as the above SELECT statement.&lt;pre&gt;&lt;span style="font-family: monospace;"&gt;[matsunobu@host ~]$ perl sample.pl&lt;br /&gt;Yukari Takeba   yukari.takeba@dena.jp   2010-02-03 11:22:33&lt;/span&gt;&lt;/pre&gt;&amp;nbsp; For most web applications, it's a good practice to keep lightweight HandlerSocket connections established (persistent connections), so that lots of requests can focus on main logic (the #3 in the above code).&lt;br /&gt;&amp;nbsp; HandlerSocket protocol is a small-sized text based protocol. Like memcached text protocol, you can use telnet to get rows through HandlerSocket.&lt;pre&gt;&lt;span style="font-family: monospace;"&gt;[matsunobu@host ~]$ telnet 192.168.1.2 9998&lt;br /&gt;Trying 192.168.1.2...&lt;br /&gt;Connected to xxx.dena.jp (192.168.1.2).&lt;br /&gt;Escape character is '^]'.&lt;br /&gt;&lt;span style="color:#008800"&gt;P       0       test    user    PRIMARY user_name,user_email,created&lt;/span&gt;&lt;br /&gt;0       1&lt;br /&gt;&lt;span style="color:#008800"&gt;0       =       1       101&lt;/span&gt;&lt;br /&gt;0       3       Yukari Takeba   yukari.takeba@dena.jp   2010-02-03 11:22:33&lt;/span&gt;&lt;/pre&gt;(&lt;span style="color:#008800"&gt;Green&lt;/span&gt; lines are request packets, fields must be separated by TAB)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Benchmarking&lt;/h3&gt;&lt;br /&gt;&amp;nbsp; Now it's good time to show our benchmarking results. I used the above user table, and tested how many PK lookup operations can be done from multi-threaded remote clients. All user data fit in memory (I tested 1,000,000 rows). I also tested memcached with similar data (I used libmemcached and memcached_get() to fetch a user data). In MySQL via SQL tests, I used a traditional SELECT statement: "SELECT user_name, user_email, created FROM user WHERE user_id=? ". Both memcached and HandlerSocket client codes were written in C/C++. All client programs were located on remote hosts, connecting to MySQL/memcached via TCP/IP. &lt;br /&gt;&lt;br /&gt;&amp;nbsp; The highest throughput was as follows:&lt;br /&gt;&lt;pre&gt;&lt;span style="font-family: monospace;"&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;                           approx qps     server CPU util&lt;br /&gt;MySQL via SQL                105,000      %us 60%  %sy 28%&lt;br /&gt;memcached                    420,000      %us  8%  %sy 88%&lt;br /&gt;MySQL via HandlerSocket      750,000      %us 45%  %sy 53%&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp; MySQL via HandlerSocket could get over 7.5 times higher throughput than traditional MySQL via SQL statements, even though %us was 3/4. This shows that SQL-layer in MySQL is very costly and skipping the layer certainly improves performance dramatically.  It is also interesting that MySQL via HandlerSocket was 178% faster than memcached, and memcached spent too much %system resources. Though memcached is an excellent product, there are still rooms for optimizations.&lt;br /&gt;&lt;br /&gt;&amp;nbsp; The below is oprofile outputs, gathered during MySQL via HandlerSocket tests. CPU resources were spent on core operations such as network packets handling, fetching rows, etc (bnx2 is a network device driver program). &lt;pre&gt;&lt;span style="font-family: monospace;"&gt;samples  %        app name                 symbol name&lt;br /&gt;984785    5.9118  bnx2                     /bnx2&lt;br /&gt;847486    5.0876  ha_innodb_plugin.so.0.0.0 ut_delay&lt;br /&gt;545303    3.2735  ha_innodb_plugin.so.0.0.0 btr_search_guess_on_hash&lt;br /&gt;317570    1.9064  ha_innodb_plugin.so.0.0.0 row_search_for_mysql&lt;br /&gt;298271    1.7906  vmlinux                  tcp_ack&lt;br /&gt;291739    1.7513  libc-2.5.so              vfprintf&lt;br /&gt;264704    1.5891  vmlinux                  .text.super_90_sync&lt;br /&gt;248546    1.4921  vmlinux                  blk_recount_segments&lt;br /&gt;244474    1.4676  libc-2.5.so              _int_malloc&lt;br /&gt;226738    1.3611  ha_innodb_plugin.so.0.0.0 _ZL14build_template&lt;br /&gt;P19row_prebuilt_structP3THDP8st_tablej&lt;br /&gt;206057    1.2370  HandlerSocket.so         dena::hstcpsvr_worker::run_one_ep()&lt;br /&gt;183330    1.1006  ha_innodb_plugin.so.0.0.0 mutex_spin_wait&lt;br /&gt;175738    1.0550  HandlerSocket.so         dena::dbcontext::&lt;br /&gt;cmd_find_internal(dena::dbcallback_i&amp;, dena::prep_stmt const&amp;, &lt;br /&gt;ha_rkey_function, dena::cmd_exec_args const&amp;)&lt;br /&gt;169967    1.0203  ha_innodb_plugin.so.0.0.0 buf_page_get_known_nowait&lt;br /&gt;165337    0.9925  libc-2.5.so              memcpy&lt;br /&gt;149611    0.8981  ha_innodb_plugin.so.0.0.0 row_sel_store_mysql_rec&lt;br /&gt;148967    0.8943  vmlinux                  generic_make_request&lt;/span&gt;&lt;/pre&gt;&amp;nbsp; Since MySQL via HandlerSocket runs inside MySQL and goes to InnoDB, you can get statistics from regular MySQL commands such as SHOW GLOBAL STATUS. It's worth to see 750,000+ Innodb_rows_read. &lt;pre&gt;&lt;span style="font-family: monospace;"&gt;$ mysqladmin extended-status -uroot -i 1 -r | grep "InnoDB_rows_read"&lt;br /&gt;...&lt;br /&gt;| Innodb_rows_read                      | 750192     |&lt;br /&gt;| Innodb_rows_read                      | 751510     |&lt;br /&gt;| Innodb_rows_read                      | 757558     |&lt;br /&gt;| Innodb_rows_read                      | 747060     |&lt;br /&gt;| Innodb_rows_read                      | 748474     |&lt;br /&gt;| Innodb_rows_read                      | 759344     |&lt;br /&gt;| Innodb_rows_read                      | 753081     |&lt;br /&gt;| Innodb_rows_read                      | 754375     |&lt;br /&gt;...&lt;/span&gt;&lt;/pre&gt;Detailed specs were as follows. &lt;br /&gt;&amp;nbsp; &amp;nbsp; Model: Dell PowerEdge R710&lt;br /&gt;&amp;nbsp; &amp;nbsp; CPU: Nehalem 8 cores, E5540  @ 2.53GHz&lt;br /&gt;&amp;nbsp; &amp;nbsp; RAM: 32GB (all data fit in the buffer pool)&lt;br /&gt;&amp;nbsp; &amp;nbsp; MySQL Version: 5.1.50 with InnoDB Plugin&lt;br /&gt;&amp;nbsp; &amp;nbsp; memcached/libmemcached version: 1.4.5(memcached), 0.44(libmemcached)&lt;br /&gt;&amp;nbsp; &amp;nbsp; Network: Broadcom NetXtreme II BCM5709 1000Base-T (Onboard, quad-port, using three ports)&lt;br /&gt;&amp;nbsp; &amp;nbsp; * Both memcached and HandlerSocket were network i/o bound. When I tested with a single port, I got around 260,000 qps on MySQL via HandlerSocket, 220,000 qps on memcached.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Features and Advantages of HandlerSocket&lt;/h3&gt;&lt;br /&gt;&amp;nbsp; HandlerSocket has lots of features and advantages like below. Some of them are really beneficial for us.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Supporting lots of query patterns&lt;/span&gt;&lt;br /&gt;&amp;nbsp; HandlerSocket supports PK/unique lookups, non-unique index lookups, range scan, LIMIT, and INSERT/UPDATE/DELETE. Operations that do not use any index are not supported. multi_get operations (similar to IN(1,2,3..), fetching multiple rows via single network round-trip) are also supported.&lt;br /&gt;&amp;nbsp; See &lt;a href="http://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL/tree/master/docs-en/"&gt;documentation&lt;/a&gt; for details.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Can handle lots of concurrent connections&lt;/span&gt;&lt;br /&gt;&amp;nbsp; HandlerSocket connection is light. Since HandlerSocket employs epoll() and worker-thread/thread-pooling architecture, the number of MySQL internal threads is limited (can be controlled by handlersocket_threads parameter in my.cnf). So you can establish thousands or tens of thousands of network connections to HandlerSocket, without losing stability(consuming too much memory, causing massive mutex contentions, etc: such as &lt;a href="http://bugs.mysql.com/bug.php?id=26590"&gt;bug#26590&lt;/a&gt;, &lt;a href="http://bugs.mysql.com/bug.php?id=33948"&gt;bug#33948&lt;/a&gt;, &lt;a href="http://bugs.mysql.com/bug.php?id=49169"&gt;bug#49169&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;*  Extremely high performance&lt;/span&gt;&lt;br /&gt;&amp;nbsp; HandlerSocket is possible to gain competitive enough performance against other NoSQL lineups, as already described. Actually I have not seen any NoSQL product that can execute 750,000+ queries on a commodity server from remote clients via TCP/IP.&lt;br /&gt;&amp;nbsp; Not only HandlerSocket eliminates SQL related function calls, but also it optimizes around network/concurrency issues. &lt;br /&gt;&lt;br /&gt;** Smaller network packets&lt;br /&gt;&amp;nbsp; HandlerSocket protocol is much simpler and smaller than normal MySQL protocols. So overall network transfer size can be much smaller.&lt;br /&gt;&lt;br /&gt;** Running limited number of MySQL internal threads&lt;br /&gt;&amp;nbsp; See above.&lt;br /&gt;&lt;br /&gt;** Grouping client requests&lt;br /&gt;&amp;nbsp; When lots of concurrent requests come to HandlerSocket, each worker thread gathers as many requests as possible, then executing gathered requests at one time, and sending back results. This can improve performance greatly, by sacrificing response time a bit. For example, you can gain the following benefits. I'll explain them in depth in later posts, if anybody is interested.&lt;br /&gt;*** Can reduce the number of fsync() calls&lt;br /&gt;*** Can reduce replication delay&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;br /&gt;* No duplicate cache&lt;/span&gt;&lt;br /&gt;&amp;nbsp; When you use memcached to cache MySQL/InnoDB records, records are cached in both memcached and InnoDB buffer pool. They are duplicate so less efficient (Memory is still expensive!). Since HandlerSocket plugin accesses to InnoDB storage engine, records are cached inside InnoDB buffer pool, which can be reused by other SQL statements.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* No data inconsistency&lt;/span&gt;&lt;br /&gt;&amp;nbsp; Since data is stored at one place (inside InnoDB), data consistency check between memcached and MySQL is not needed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Crash-safe&lt;/span&gt;&lt;br /&gt;&amp;nbsp; Backend storage is InnoDB. It's transactional and crash safe. Even though you use innodb-flush-log-at-trx-commit!=1, you lose only &lt; 1s of data on server crash.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* SQL can be used from mysql clients&lt;/span&gt;&lt;br /&gt;&amp;nbsp; In many cases people still want to use SQL (i.e to generate summary reports). This is why we can't use Embedded InnoDB. Most NoSQL products don't support SQL interface, either.&lt;br /&gt;&amp;nbsp; HandlerSocket is just a plugin for MySQL. You can usually send SQL statements from MySQL clients, and use HandlerSocket protocols when you need high throughput. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* All operational benefits from MySQL&lt;/span&gt;&lt;br /&gt;&amp;nbsp; Again, HandlerSocket runs inside MySQL, so all MySQL operations such as SQL, online backups, replication, monitoring by Nagios / EnterpriseMonitor, etc are supported. HandlerSocket activities can be monitored by regular MySQL command such as SHOW GLOBAL STAUTS, SHOW ENGINE INNODB STATUS, SHOW PROCESSLIST, etc.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* No need to modify/rebuild MySQL&lt;/span&gt;&lt;br /&gt;&amp;nbsp; Since it's a plugin, it runs on both MySQL Community and MySQL Enterprise Servers. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Independent from storage engines&lt;/span&gt;&lt;br /&gt;&amp;nbsp; HandlerSocket is developed so that it can talk with any storage engine, though we have tested and used with 5.1 and 5.5 InnoDB Plugin only.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Notes and Limitations&lt;/h3&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* Need to learn HandlerSocket APIs&lt;/span&gt;&lt;br /&gt;&amp;nbsp; You need to write a program to talk with HandlerSocket, though it's pretty easy to use. We provide C++ API and Perl bindings.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* No security&lt;/span&gt;&lt;br /&gt;&amp;nbsp; Like other NoSQL databases, HandlerSocket does not provide any security feature. HandlerSocket's worker threads run with system user privileges, so applications can access to all tables through HandlerSocket protocols. Of course you can use firewalls to filter packets, like other NoSQL products.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;* No benefit for HDD bound workloads&lt;/span&gt;&lt;br /&gt;&amp;nbsp; For HDD i/o bound workloads, a database instance can not execute thousands of queries per second, which normally results in only 1-10% CPU usage. In such cases, SQL execution layer does not become bottleneck, so there is no benefit to use Hanldersocket. We use HandlerSocket on servers that almost all data fit in memory.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;DeNA is using HandlerSocket in production&lt;/h3&gt;&lt;br /&gt;&amp;nbsp; We already use HandlerSocket plugin in our production environments. The results are great. We could have reduced lots of memcached and MySQL slave servers. Overall network traffics have been reduced, too. We haven't seen any performance problem (slow response time, stalls etc) so far. We've been very satisfied with the results.&lt;br /&gt;&amp;nbsp; I think MySQL has been underrated from NoSQL/Database communities. MySQL actually has much longer history than most of other products, and lots of unique and great enhancements have been done so far by excellent my ex-colleagues. I know from NDBAPI that MySQL has very strong potentials as a NoSQL. Storage engine API and daemon plugin interface are completely unique, and they made Akira and DeNA develop HandlerSocket possible. As an ex-employee at MySQL and a long-time fun for MySQL, I'd like to see MySQL becomes better and more popular, not only as an RDBMS, but also as Yet Another NoSQL lineup.&lt;br /&gt;&amp;nbsp; Since HandlerSocket plugin is Open Source, feel free to try. We'd be appreciated if you give us any feedback.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-765226865821283814?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/765226865821283814/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html#comment-form' title='229 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/765226865821283814'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/765226865821283814'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html' title='Using MySQL as a NoSQL - A story for exceeding 750,000 qps on a commodity server'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/__ybECuKG5tc/TLqWDlJxDNI/AAAAAAAAACQ/bOq6w5Q5nYc/s72-c/mysql_HandlerSocket.png' height='72' width='72'/><thr:total>229</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-6163585886432486807</id><published>2009-08-27T01:25:00.008+09:00</published><updated>2009-08-27T08:14:35.829+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Accessing MySQL tables from UDF + storage engine API</title><content type='html'>A couple of weeks ago, at a MySQL study group in Tokyo I presented about various kinds of MySQL hacking techniques, such as debugging with MS Visual Studio or gdb, tracing with DTrace, writing information schema plugins, accessing tables from UDF and storage engine API, and extending MySQL server code itself. This 90-minute session went successful. Nearly 100 people attended, including famous MySQL developers such as Kazuho Oku-san and Kentoku Shiba-san, having productive discussions with attendees and quite a lot of people seemed interested in MySQL hacking. The slides are written in Japanese, but sample codes can be understandable and can be downloaded &lt;a href="http://www.mysql.gr.jp/frame/modules/bwiki/index.php?plugin=attach&amp;refer=matsunobu&amp;openfile=demo.tar.gz"&gt;here&lt;/a&gt;. &lt;br /&gt; What audiences were most interested in was accessing MySQL tables directly from Plugins(currently UDFs) and storage engine API. I showed a sample UDF code, based on what Kazuho-san &lt;a href="http://labs.cybozu.co.jp/blog/kazuhoatwork/2008/06/optimizing_mysql_performance_u.php"&gt;released over one year ago&lt;/a&gt;, which is over 10 times faster than SQL or stored procedures for complex operations. Excerpted codes are as follows(skipping error handling).&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="c"&gt;&lt;br /&gt;#define MYSQL_SERVER 1&lt;br /&gt;#include &lt;mysql_priv.h&gt;&lt;br /&gt;...&lt;br /&gt;/* initializing tables, fields and indexes */&lt;br /&gt;  THD *thd = current_thd;&lt;br /&gt;  TABLE_LIST tables;&lt;br /&gt;  TABLE *table;&lt;br /&gt;  tables.init_one_table("test","message", TL_READ);&lt;br /&gt;  simple_open_n_lock_tables(thd, &amp;tables);&lt;br /&gt;  table= tables.table;&lt;br /&gt;  message_id_fld = get_field(table, "id");&lt;br /&gt;  message_user_id_fld = get_field(table, "user_id");&lt;br /&gt;  message_key = index_init(table, "user_id", true);&lt;br /&gt;&lt;br /&gt;/* accessing table. Equivalent query is &lt;br /&gt;   SELECT id FROM message WHERE user_id=? ORDER BY id DESC LIMIT 1; */&lt;br /&gt;  uchar*  key_buff= (uchar*) thd-&gt;alloc(message_key-&gt;key_length);&lt;br /&gt;  bzero(key_buff, message_key-&gt;key_length);&lt;br /&gt;  int null_offset= message_user_id_fld-&gt;null_bit; &lt;br /&gt;  int4store(key_buff + null_offset, user_id);&lt;br /&gt;  err= table-&gt;file-&gt;index_read_last_map(table-&gt;record[0], key_buff, 1);&lt;br /&gt;  fprintf(stderr, "id=%lld\n", message_id_fld-&gt;val_int());&lt;br /&gt;&lt;br /&gt;/* closing objects */&lt;br /&gt;  table-&gt;file-&gt;ha_index_end();&lt;br /&gt;  close_thread_tables(thd);&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;By defining MYSQL_SERVER to 1, you can access to MySQL internal structures. &lt;br /&gt;You can build the sample UDF on development environment as follows. MySQL source code is required to build when accessing to internal structures. &lt;br /&gt;&lt;pre name="code" class="c"&gt;&lt;br /&gt;g++ -g -DSAFE_MUTEX -DSAFEMALLOC \&lt;br /&gt;-Wall -fno-rtti -fno-exceptions -fPIC -shared \&lt;br /&gt;-I/debug-built-mysql-src/include \&lt;br /&gt;-I/debug-built-mysql-src/regex \&lt;br /&gt;-I/debug-built-mysql-src/sql \&lt;br /&gt;-o udf_sample.so  udf_sample.cc&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;On production environment, skip using -DSAFE_MUTEX -DSAFEMALLOC and include MySQL source built with configurations for production. When using some macros(i.e. current_thd) that behave differently between debug and prod configurations, you need to include appropriate source directly. &lt;br /&gt;&lt;pre name="code" class="c"&gt;&lt;br /&gt;g++ -g \&lt;br /&gt;-Wall -fno-rtti -fno-exceptions -fPIC -shared \&lt;br /&gt;-I/prod-built-mysql-src/include \&lt;br /&gt;-I/prod-built-mysql-src/regex \&lt;br /&gt;-I/prod-built-mysql-src/sql \&lt;br /&gt;-o udf_sample.so  udf_sample.cc&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This snippet skips error handling etc for good legibility, but you need to carefully handle errors, otherwise mysqld crashes. Please check bench_udf.cc, build_udf_debug.sh, and build_udf_prod.sh&lt;a href="http://www.mysql.gr.jp/frame/modules/bwiki/index.php?plugin=attach&amp;refer=matsunobu&amp;openfile=demo.tar.gz"&gt;in my sample code&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt; MySQL is fast, but it still has many expensive CPU operations such as parsing SQL statements, making execution plans every time (MySQL doesn't have optimizer-plan caching mechanism), a lot of memcpy/malloc operations, etc. If you write a UDF which directly accesses tables through storage engine API, these expensive operations can be skipped so it can be much faster than SQL statements in many cases. &lt;br /&gt; The above code looks like using key value store APIs (i.e. TokyoCabinet APIs). Most KVSs are faster than MySQL, and one of the reasons is they don't need to do extra operations such as making execution plans etc. But it is not impossible to use MySQL like fast KVS as above. You can not only get benefits about better performance, but also utilize traditional RDBMS implementations such as B+Tree (range scan), perfect durability, schema awareness. Data is stored in one place, and you can get the same data regardless of accessing methods. For example, you can use UDF for some special parts (i.e. picking up all friends' latest messages, which is difficult for MySQL to fetch quickly), and use normal SQL statements for others. I assume such two-way client interfaces (fast API and traditional SQL) can be a common implementation for next-generation RDBMS. Third, this is currently not common hacks so there is almost no document. For example, handling two or more tables requires additional codes, but this is not documented (I am still fully not understand to safely handle them). &lt;br /&gt; Of course there are many disadvantages about using UDFs and storage engine APIs directly. First, it highly depends on MySQL versions. Storage engine APIs are standardized but client side operations such as calling table open/lock functions are different from MySQL versions. Second, mysqld easily crashes if UDF has bugs. &lt;br /&gt; But there are a lot of possibilities here and this certainly attracts developers. For example, Kazuho-san yesterday published a very interesting UDF: &lt;a href="http://developer.cybozu.co.jp/kazuho/2009/08/mycached-memcac.html"&gt;mycached (memcached server udf)&lt;/a&gt;. It listens memcached protocol, then directly accessing a table through storage engine API (primary key lookup). According to his benchmark this is twice as fast as  SELECT with primary key lookup (if records are cached in buffer pool). I assume this approach will work very well as a durable key-value store, for example storing session data, last access timestamp etc.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-6163585886432486807?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/6163585886432486807/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/08/accessing-mysql-tables-from-udf-storage.html#comment-form' title='25 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/6163585886432486807'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/6163585886432486807'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/08/accessing-mysql-tables-from-udf-storage.html' title='Accessing MySQL tables from UDF + storage engine API'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>25</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-6291429043739839196</id><published>2009-08-12T20:11:00.026+09:00</published><updated>2009-08-13T06:25:40.600+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='InnoDB'/><title type='text'>Great performance effect of fixing broken group commit</title><content type='html'>Yesterday &lt;a href="http://www.innodb.com/wp/2009/08/11/innodb-plugin-104-released/"&gt;InnoDB Plugin 1.0.4&lt;/a&gt; was released by Innobase. This version contains one of the most important performance fixes - "Fix for broken group commit". After MySQL5.0, InnoDB breaks group commit when using with binary log (or with other transactional storage engines), even though setting innodb_support_xa=0. This was really serious because fsync() (called at transaction commit when setting innodb_flush_log_at_trx_commit=1) is very expensive. &lt;a href="http://bugs.mysql.com/bug.php?id=13669"&gt;The initial bug report&lt;/a&gt; about this was submitted by Peter four years ago. Last year David Lutz submitted &lt;a href="http://bugs.mysql.com/bug.php?id=38501"&gt;a bug report with his prototype patch&lt;/a&gt;. It is great news that this bug fix has been finally implemented in the official InnoDB release.&lt;br /&gt;I did a simple benchmarking by mysqlslap. mysqlslap has functionality to run concurrent inserts from multiple connections. The result is as follows.&lt;br /&gt;&lt;pre&gt;mysqlslap --concurrency=1,5,10,20,30,50 --iterations=1 --engine=innodb \&lt;br /&gt;--auto-generate-sql --auto-generate-sql-load-type=write \&lt;br /&gt;--number-of-queries=100000&lt;/pre&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/__ybECuKG5tc/SoLRkFG_DQI/AAAAAAAAABg/s3Kkprdnwts/s1600-h/fix_broken_group_commit.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 155px;" src="http://4.bp.blogspot.com/__ybECuKG5tc/SoLRkFG_DQI/AAAAAAAAABg/s3Kkprdnwts/s400/fix_broken_group_commit.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5369084123556023554" /&gt;&lt;/a&gt;&lt;br /&gt;H/W is Sun Fire X4450, Intel Xeon X5560 Nehalem 2.80GHz * 16cores, 12GB RAM, SAS HDD with write cache. my.cnf configuration is as follows. log-bin is enabled.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[mysqld]&lt;br /&gt;basedir=/usr/mysql5137&lt;br /&gt;datadir=/data/mysql5137/data&lt;br /&gt;ignore_builtin_innodb&lt;br /&gt;plugin-load=innodb=ha_innodb.so;innodb_trx=ha_innodb.so;&lt;br /&gt;  innodb_locks=ha_innodb.so;innodb_lock_waits=ha_innodb.so;&lt;br /&gt;  innodb_cmp=ha_innodb.so;innodb_cmp_reset=ha_innodb.so;&lt;br /&gt;  innodb_cmpmem=ha_innodb.so;innodb_cmpmem_reset=ha_innodb.so&lt;br /&gt;innodb_log_files_in_group=2&lt;br /&gt;innodb_buffer_pool_size=2G&lt;br /&gt;innodb_flush_method=O_DIRECT&lt;br /&gt;innodb_log_file_size=512M&lt;br /&gt;innodb_data_file_path=ibdata1:500M:autoextend&lt;br /&gt;innodb_file_per_table&lt;br /&gt;log-bin&lt;br /&gt;table_cache=8192&lt;/pre&gt;&lt;br /&gt;Apparently InnoDB Plugin 1.0.4 outperforms normal InnoDB (6.1 times faster on 30 connections, innodb_support_xa=1). Normal InnoDB doesn't scale with connections but InnoDB Plugin 1.0.4 does. What is the difference? Normal InnoDB does the following at transaction commit.&lt;br /&gt;&lt;pre&gt;pthread_mutex_lock(&amp;amp;prepare_commit_mutex)&lt;br /&gt;  writing into InnoDB logfile for prepare, then fsync&lt;br /&gt;   (skipped if innodb_support_xa=0)&lt;br /&gt;  writing into binlog&lt;br /&gt;  writing into InnoDB logfile for commit, then fsync&lt;br /&gt;pthread_mutex_unlock(&amp;amp;prepare_commit_mutex)&lt;/pre&gt;&lt;br /&gt;Under the critical section protected by prepare_commit_mutex, only one thread can do operation. So when 100 threads do commit at the same time, fsync() is called 100 times for prepare, 100 times for commit (200 in total). Group commit is totally broken. As you see the above graph, innodb_support_xa=0 is effective (though it still breaks group commit), but in general innodb_support_xa=0 is not recommended because it will break consistency between binlog and InnoDB in case of a crash.&lt;br /&gt;In InnoDB Plugin 1.0.4, the behavior has changed as follows.&lt;br /&gt;&lt;pre&gt;writing into InnoDB logfile for prepare, then fsync&lt;br /&gt; (skipped if innodb_support_xa=0)&lt;br /&gt;pthread_mutex_lock(&amp;amp;prepare_commit_mutex)&lt;br /&gt;  writing into binlog&lt;br /&gt;  writing into InnoDB logfile for commit&lt;br /&gt;pthread_mutex_unlock(&amp;amp;prepare_commit_mutex)&lt;br /&gt;fsync to the InnoDB logfile&lt;/pre&gt;&lt;br /&gt;fsync, the most expensive operation, is called outside the critical section, so group commit is possible and concurrency is much more improved. The following graph shows how much Innodb_data_fsyncs was increased after executing mysqlslap(committing 100,000 transactions).&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/__ybECuKG5tc/SoLR5SGu7pI/AAAAAAAAABo/zO3Kz1tmgOk/s1600-h/Innodb_data_fsyncs.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 151px;" src="http://4.bp.blogspot.com/__ybECuKG5tc/SoLR5SGu7pI/AAAAAAAAABo/zO3Kz1tmgOk/s400/Innodb_data_fsyncs.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5369084487821880978" /&gt;&lt;/a&gt;&lt;br /&gt; In 5.1.37+Builtin(support_xa=1), 2 fsyncs happens per transaction, regardless of # of concurrent connections. In 5.1.37+Builtin(support_xa=0), 1 fsync happens per transaction, regardless of # of concurrent connections. These mean group commit is broken. In both cases about 10,000 fsyncs were executed per second, which seems upper limit for regular HDD with BBU. On the other hand, InnoDB Plugin 1.0.4 greatly reduces the number of fsyncs(i.e. 200251 to 26092 on 30 connections(innodb_support_xa=1): 87% decreased). This shows group commit works well.  &lt;br /&gt;&lt;br /&gt; Write ordering between binlog and InnoDB logfile is still guaranteed. Write ordering for InnoDB prepare is not same as the ordering of binlog, but this is fine. Prepared entries are used only for recovery and not visible to applications. When doing crash recovery, mysqld reads binlog at first(picking up xids), then checking prepared but not committed entries(xids) in InnoDB logfile, then applying these entries in the order of binlog xids. So in the end write ordering is guaranteed.&lt;br /&gt;Note that if you set sync_binlog=1, it is still very slow because writing into binlog is protected by mutex (prepare_commit_mutex and LOCK_log).&lt;br /&gt;This is my understanding. InnoDB Plugin 1.0.4 also has other awesome features (i.e. multiple i/o threads) so it's worth trying.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-6291429043739839196?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/6291429043739839196/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/08/great-performance-effect-of-fixing.html#comment-form' title='24 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/6291429043739839196'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/6291429043739839196'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/08/great-performance-effect-of-fixing.html' title='Great performance effect of fixing broken group commit'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/__ybECuKG5tc/SoLRkFG_DQI/AAAAAAAAABg/s3Kkprdnwts/s72-c/fix_broken_group_commit.JPG' height='72' width='72'/><thr:total>24</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-2873864454426272699</id><published>2009-07-28T21:47:00.009+09:00</published><updated>2009-07-28T23:55:33.263+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><title type='text'>iostat: (r/s + w/s) * svctm = %util on Linux</title><content type='html'>iostat -x is very useful to check disk i/o activities.  Sometimes it is said that "check %util is less than 100%" or "check svctm is less than 50ms", but please do not fully trust these numbers. For example, the following two cases (DBT-2 load on MySQL) used same disks (two HDD disks, RAID1) and reached almost 100% util, but performance numbers were very different (no.2 was about twice as fast as no.1).&lt;br /&gt;&lt;pre&gt;# iostat -xm 10&lt;br /&gt;avg-cpu:  %user   %nice %system %iowait  %steal   %idle&lt;br /&gt;       21.16    0.00    6.14   29.77    0.00   42.93&lt;br /&gt;&lt;br /&gt;Device: rqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s&lt;br /&gt;sdb    2.60 389.01  283.12 47.35     4.86     2.19&lt;br /&gt;avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;43.67     4.89   14.76   3.02  99.83&lt;/pre&gt;&lt;br /&gt;&lt;pre&gt;# iostat -xm 10&lt;br /&gt;avg-cpu:  %user   %nice %system %iowait  %steal   %idle&lt;br /&gt;       40.03    0.00   16.51   16.52    0.00   26.94&lt;br /&gt;&lt;br /&gt;Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s&lt;br /&gt;sdb              6.39   368.53 543.06 490.41     6.71     3.90&lt;br /&gt;avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;21.02     3.29    3.20   0.90  92.66&lt;/pre&gt;&lt;br /&gt;100% util does not mean disks can not be faster anymore. For example, command queuing (TCQ/NCQ) or battery backed up write cache can often boosts performance significantly. For random i/o oriented applications(in most cases), I pay attention to r/s and w/s. r/s is the number of read requests that were issued to  the  device per second. w/s is the  number of write requests that were issued to the device per second (copied from man). r/s + w/s is the total number of i/o requests per second (IOPS) so it is easier to check whether disks work as expected or not. For example, a few thousands of IOPS can be expected on single Intel SSD drive. For sequential i/o operations, r/s and w/s can be significantly affected by Linux parameters such as max_sectors_kb even though throughput is not different, so I check different iostat status variables such as rrqm/s, rMB/s.&lt;br /&gt;&lt;br /&gt;What about svctm? Actually Linux's iostat calculates svctm automatically from r/s, w/s and %util. Here is an excerpt from iostat.c .&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;nr_ios = sdev.rd_ios + sdev.wr_ios;&lt;br /&gt;tput   = ((double) nr_ios) * HZ / itv;&lt;br /&gt;util   = ((double) sdev.tot_ticks) / itv * HZ;&lt;br /&gt;svctm  = tput ? util / tput : 0.0;&lt;br /&gt;...&lt;br /&gt;/*       rrq/s wrq/s   r/s   w/s  rsec  wsec   rkB   wkB  rqsz  qusz await svctm %util */&lt;br /&gt;printf(" %6.2f %6.2f %5.2f %5.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %6.2f %6.2f\n",&lt;br /&gt;   ((double) sdev.rd_merges) / itv * HZ,&lt;br /&gt;   ((double) sdev.wr_merges) / itv * HZ,&lt;br /&gt;   ((double) sdev.rd_ios) / itv * HZ,&lt;br /&gt;   ((double) sdev.wr_ios) / itv * HZ,&lt;br /&gt;...&lt;/pre&gt;&lt;br /&gt;The latter means the following.&lt;br /&gt;&lt;pre&gt;r/s = sdev.rd_ios / itv * HZ&lt;br /&gt;w/s = sdev.wr_ios / itv * HZ&lt;/pre&gt;&lt;br /&gt;The former means the following.&lt;br /&gt;&lt;pre&gt;svctm = util / ((sdev.rd_ios + sdev.wr_ios) * HZ / itv)&lt;/pre&gt;&lt;br /&gt;If %util is 100%, svctm is just 1 / (r/s + w/s) seconds, 1000/(r/s+w/s) milliseconds. This is an inverse number of IOPS. In other words, svctm * (r/s+w/s) is always 1000 if %util is 100%. So checking svctm is practically as same as checking r/s and w/s (as long as %util is close to 100%). The latter (IOPS) is much easier, isn't it?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-2873864454426272699?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/2873864454426272699/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/07/iostat-rs-ws-svctm-util-on-linux.html#comment-form' title='27 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/2873864454426272699'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/2873864454426272699'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/07/iostat-rs-ws-svctm-util-on-linux.html' title='iostat: (r/s + w/s) * svctm = %util on Linux'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>27</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-2555307192937321819</id><published>2009-05-28T01:30:00.003+09:00</published><updated>2009-05-28T09:35:04.604+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><title type='text'>Overwriting is much faster than appending</title><content type='html'>Writing small volume of data (Bytes-MBs) with sync (fsync()/fdatasync()/O_SYNC/O_DSYNC) is very common for RDBMS and is needed to guarantee durability. For transactional log files, sync happens per commit. For data files, sync happens at checkpoint etc. Typically RDBMS does syncing data very frequently. In this case, overwriting is much faster than appending for most filesystems/storages. Overwriting does not change file size, while appending does. Increasing file size requires a lot of overheads such as allocating space within the filesystem, updating &amp;amp; flushing metadata. This really matters when you writes data with fsync() very frequently. The following are simple benchmarking results on ext3 RHEL5.3.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;1. creating an empty file, then writing 8KB  128*1024 times with fdatasync()&lt;br /&gt;fdatasync per second: 3085.94321&lt;br /&gt;(Emulating current binlog (sync-binlog=1) behavior)&lt;br /&gt;&lt;br /&gt;2. creating a 1GB data file, then writing 8KB 128*1024 times with fdatasync()&lt;br /&gt;fdatasync per second: 12330.47268&lt;br /&gt;(Emulating current InnoDB log file behavior)&lt;br /&gt;&lt;br /&gt;3. zero-filling 1GB, then writing 8KB 128*1024 times with fdatasync() immediately&lt;br /&gt;fdatasync per second: 6569.00072&lt;br /&gt;(Zero-filling causes massive disk writes so killing application performance)&lt;br /&gt;&lt;br /&gt;4. zero-filling 1GB, sleeping 20 seconds, then writing 8KB 128*1024 times with fdatasync()&lt;br /&gt;fdatasync per second: 11669.81157&lt;br /&gt;(Zero-filling finished within 20 seconds. This is actually does the same thing with no.2)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Apparently no.2(and no.4) are much faster than no.1 and no.3. The difference between no.1 and no.2 is just appending or overwriting. Four times difference is amazing but this is real and I got similar results on other filesystems except zfs (I tested xfs, reiserfs and zfs). (Updated in May 28: I got about 7,000 fsync/sec for both appending/overwriting on zfs. There is no outstanding difference because zfs is Copy On Write filesystem as comment #3)&lt;br /&gt;&lt;br /&gt;This is one of the reason why sync-binlog=1 is very slow. Binlog is appended, not overwritten. Default sync-binlog value is 0 (no sync happens at commit) so appending does not cause serious performance drop. But there are cases that sync-binlog=1 is absolutely needed. I am currently directly working on this and implementing "&lt;a href="http://forge.mysql.com/worklog/task.php?id=4925"&gt;preallocating binlog&lt;/a&gt;" functionality.&lt;br /&gt;&lt;br /&gt;The difference between no.3 and no.4 is also interesting. Overwriting requires preallocation : allocating space before writing. If preallocation happens *dynamically during heavy loads* (no.3), application performance is seriously degraded.&lt;br /&gt;&lt;br /&gt;No.3 is close to current InnoDB system tablespace (ibdata) mechanism. InnoDB extends tablespace size by innodb_autoextend_increment MBs dynamically, then doing overwriting. But as you see above, dynamically preallocating with zero is not good for performance.&lt;br /&gt;&lt;br /&gt;Using posix_fallocate() instead of zero-filling will fix this issue. posix_fallocate() preallocates space without any overhead. Unfortunately currently most of enterprise Linux distributions/filesystems/glibc don't behave as expected, but internally doing zero-filling instead(including RHEL5.3).&lt;br /&gt;Preallocating large enough space before going into production or low-load hours (midnight etc) is a current workaround.&lt;br /&gt;&lt;br /&gt;Talking about InnoDB deployment, innodb_file_per_table is popular and easy-to-manage, but it's currently not overwriting architecture(Updated in May 28: Preallocating GBs of data beforehand is not possible unless you explicitly load &amp;amp; delete data into/from tables. The maximum (auto)extension of an .ibd file is 4MB at one time, regardless of innodb_autoextend_increment setting. See &lt;a href="http://bugs.mysql.com/bug.php?id=31592"&gt;bug#31592&lt;/a&gt; for detail.) Not using innodb_file_per_table, but preallocating large InnoDB tablespaces before going into production (i.e. ibdata1,2,3.., over 100GB in total) can often get better performance.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-2555307192937321819?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/2555307192937321819/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/05/overwriting-is-much-faster-than_28.html#comment-form' title='23 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/2555307192937321819'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/2555307192937321819'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/05/overwriting-is-much-faster-than_28.html' title='Overwriting is much faster than appending'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>23</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-5741969762726604890</id><published>2009-05-19T00:42:00.006+09:00</published><updated>2009-05-19T01:02:02.819+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='storage'/><title type='text'>Make sure write cache is enabled on your RAID controller</title><content type='html'>Using a write cache protected by battery (BBWC) is well known and one of the best practices in RDBMS world. But I have frequently seen situations that people do not set write cache properly. Sometimes they just forget to enable write cache. Sometimes write cache is disabled even though they say they set properly.&lt;br /&gt;&lt;br /&gt;Make sure that BBWC is enabled. If not enabled, you will be able to easily get better performance by just enabling it. The following is a DBT-2 example.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;Write cache is disabled:&lt;br /&gt;&lt;br /&gt;# iostat -xm 10&lt;br /&gt;avg-cpu:  %user   %nice %system %iowait  %steal   %idle&lt;br /&gt;        21.16    0.00    6.14   29.77    0.00   42.93&lt;br /&gt;&lt;br /&gt;Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s&lt;br /&gt;sdb               2.60   389.01 283.12 47.35     4.86&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;2.19    43.67     4.89   14.76   3.02  99.83&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Write cache is enabled:&lt;br /&gt;&lt;br /&gt;# iostat -xm 10&lt;br /&gt;avg-cpu:  %user   %nice %system %iowait  %steal   %idle&lt;br /&gt;        40.03    0.00   16.51   16.52    0.00   26.94&lt;br /&gt;&lt;br /&gt;Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s&lt;br /&gt;sdb               6.39   368.53 543.06 490.41     6.71&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;3.90    21.02     3.29    3.20   0.90  92.66&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Both run same applications(DBT-2), but the server activity was significantly different each other. I got 85% better result when write cache is enabled.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Checking write cache and battery status&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;BBWC is mostly equipped with H/W raid controller so operational command depends on products. Here is an example of "arcconf" command result.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;# /usr/StorMan/arcconf GETCONFIG 1 AL&lt;br /&gt;...&lt;br /&gt;--------------------------------------------------------&lt;br /&gt;Controller Battery Information&lt;br /&gt;--------------------------------------------------------&lt;br /&gt;Status                          : Optimal&lt;br /&gt;Over temperature                : No&lt;br /&gt;Capacity remaining              : 99 percent&lt;br /&gt;Time remaining (at current draw): 3 days, 1 hours, 11 minutes&lt;br /&gt;&lt;br /&gt;--------------------------------------------------------&lt;br /&gt;Logical device information&lt;br /&gt;--------------------------------------------------------&lt;br /&gt;Logical device number 0&lt;br /&gt;...&lt;br /&gt;Read-cache mode    : Disabled&lt;br /&gt;Write-cache mode   : Enabled (write-back)&lt;br /&gt;Write-cache setting: Enabled (write-back) when&lt;br /&gt;protected by battery&lt;br /&gt;...&lt;br /&gt;&lt;br /&gt;--------------------------------------------------------&lt;br /&gt;Physical Device information&lt;br /&gt;--------------------------------------------------------&lt;br /&gt;Device #0&lt;br /&gt;Device is a Hard drive&lt;br /&gt;...&lt;br /&gt;Size        : 140009 MB&lt;br /&gt;Write Cache : Disabled (write-through)&lt;br /&gt;...&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Write cache should be enabled only when battery backup is working. In other words:&lt;br /&gt;- Write cache on logical device (H/W raid controller) is enabled when protected by battery&lt;br /&gt;- Write cache on physical device is disabled&lt;br /&gt;- The battery has enough capacity and long enough remaining time&lt;br /&gt;&lt;br /&gt;I recommend DBAs to monitor write cache status regularly (adding this to your monitoring scripts), including battery status checking. Long time ago I was asked for urgent help to fix a problem that application performance suddenly went down. I looked into problems then found that write cache unexpectedly turned off because a battery was expired. If you successfully detected that battery capacity was decreased before write cache was disabled, you would be able to take an action before server performance suddenly goes down (i.e. allocating scheduled down time in order to replace the battery).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Quick health check with mysql commands&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If you are not familiar with H/W raid controller specific command but want to check write cache status quickly, using mysqlslap or stored procedure is easy.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;mysqlslap:&lt;br /&gt;$ mysql -e "set global innodb_flush_log_at_trx_commit=1"&lt;br /&gt;$ mysqlslap --concurrency=1 --iterations=1 --engine=innodb \&lt;br /&gt;--auto-generate-sql --auto-generate-sql-load-type=write \&lt;br /&gt;--number-of-queries=100000&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;stored procedure:&lt;br /&gt;create table t (c1 int) engine=innodb;&lt;br /&gt;delimiter //&lt;br /&gt;create procedure sp1(IN i INTEGER)&lt;br /&gt;BEGIN&lt;br /&gt;DECLARE done INTEGER DEFAULT 0;&lt;br /&gt;WHILE i &gt; 0 DO&lt;br /&gt;insert into t values (1);&lt;br /&gt;SET i = i - 1;&lt;br /&gt;END WHILE;&lt;br /&gt;END;&lt;br /&gt;//&lt;br /&gt;delimiter ;&lt;br /&gt;&lt;br /&gt;set global innodb_flush_log_at_trx_commit=1;&lt;br /&gt;call sp1(100000);&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;You will be able to insert thousands of records per second if write cache is enabled. If disabled, only hundreds of inserts per second is possible, so you can easily check.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-5741969762726604890?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/5741969762726604890/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/05/make-sure-write-cache-is-enabled-on.html#comment-form' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/5741969762726604890'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/5741969762726604890'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/05/make-sure-write-cache-is-enabled-on.html' title='Make sure write cache is enabled on your RAID controller'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-1743104185323086365</id><published>2009-05-12T09:00:00.001+09:00</published><updated>2009-05-12T10:22:48.864+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='SSD'/><category scheme='http://www.blogger.com/atom/ns#' term='HDD'/><category scheme='http://www.blogger.com/atom/ns#' term='InnoDB'/><title type='text'>Tables on SSD, Redo/Binlog/SYSTEM-tablespace on HDD</title><content type='html'>I recently did a disk bound DBT-2 benchmarking on SSD/HDD (MySQL 5.4.0, InnoDB). Now I'm pretty confident that storing tables on SSD, redo/Binlog/SYSTEM-tablespace on HDD will be one of the best practices for the time being.&lt;br /&gt;&lt;br /&gt;This post is a detailed benchmarking report.&lt;br /&gt;(This post is very long and focusing on InnoDB only. If you are familiar with HDD/SSD/InnoDB architecture and understand what my blog title means, skipping section 1 (general theory) then reading from section 2 (benchmarking results) would be fine. )&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;1. General Theory of HDD, SSD and InnoDB&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;SSD is often called as a disruptive storage technology. Currently storage capacity is much smaller and unit price is much higher than HDD, but the situation is very rapidly changing. In the near future many people will use SSD instead of HDD.&lt;br /&gt;&lt;br /&gt;From DBA's standpoint, you have a couple of choices for storage allocation.&lt;br /&gt;- Storing all files on SSD, not using HDD at all&lt;br /&gt;- Storing all files on HDD, not using SSD at all&lt;br /&gt;- Using SSD and HDD altogether (some files on SSD, others on HDD).&lt;br /&gt;&lt;br /&gt;Which is the best approach? My favorite approach is storing tables on SSD, storing Redo Log files, Binary Log files, and SYSTEM-tablespace(ibdata) on HDD. I describe a detailed reason and some DBT-2 benchmarking results below.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;1.1 HDD is very good at sequential writes if write cache is enabled&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Using battery backed up write cache(BBWC) is one of the best practices for RDBMS world. BBWC is normally equipped with hardware raid controller.  Normally InnoDB flushes data to disks (redo log files) per each commit to guarantee durability. You can change this behavior by changing innodb_flush_log_at_trx_commit parameter to 0 or 2, but default setting (1) is recommended if you need durability.&lt;br /&gt;Without write cache, flushing to disks require disk rotation and disk seek. Redo logs are sequentially written, so disk seek doesn’t happen when the disk is dedicated for redo log files, but disk rotation still happens. Average disk rotation overhead is 1/2 round. Theoretical maximum throughput is only 500 flushes per second when using single 15000RPM drive (15,000 * 2 / 60 seconds = 500).&lt;br /&gt;If using write cache, the situation is greatly improved. Flushing to write cache does not require disk seek/rotation so finishes very quickly. The data in write cache will be written to disk with optimal order (i.e. writing many data at one time), so total throughput is highly improved. Over 10,000 fsync() per second is not impossible.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/__ybECuKG5tc/SggRPdNBiJI/AAAAAAAAAAs/m5ABt61X6AM/s1600-h/bbwc.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 121px;" src="http://1.bp.blogspot.com/__ybECuKG5tc/SggRPdNBiJI/AAAAAAAAAAs/m5ABt61X6AM/s400/bbwc.JPG" alt="" id="BLOGGER_PHOTO_ID_5334532715855317138" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Sometimes people say that write cache is not effective because the total write volume is the same. This is not correct. On HDD, disk seek &amp;amp; rotation overhead is huge. By using write cache, internally a lot of write operations are converted into small number of write operations. Then the total number of disk seeks &amp;amp; rotations can be much smaller. For random-write oriented files (i.e. index files), disk seeks &amp;amp; rotations still happen (but reduced so still very effective) , but for sequential-write oriented files (i.e. REDO log files) write cache is very effective.&lt;br /&gt;Write cache needs to be protected by battery in order not to be invalidated by power failure etc.&lt;br /&gt;Using H/W RAID + BBWC + HDD is used for years and now it’s a proven technology.  You can setup with pretty reasonable cost.&lt;br /&gt;Note that HDD/SSD storage devices also have write cache themselves, but don't turn on. It's very dangerous because it's not protected by battery so data in cache is destroyed by power failure etc.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;1.2 SSD is very good at random reads, good at random writes, not so good at sequential writes, compared to HDD&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Currently read/write performance on SSD highly depends on products and device drivers. SSD is very fast for random reads. I could get over 5,000 *random reads* on single Intel X25-E SSD drive. For HDD, normally only a few hundreads of random reads is possible. Over 10 times difference is huge.&lt;br /&gt;For sequential reads, performance difference is smaller but there is still a significant difference on some SSDs(When I tested, single X25-E drive was two times faster than two SAS 15,000RPM RAID1 drives).&lt;br /&gt;For writes, I heard both positive and negative information. I tested Intel X25-E SSD then it worked very well (I got over 2,000 random writes on single drive with write cache). But I heard from many people who tested different SSDs that some SSDs don't perform well for writes. Some cases it was much slower than HDD, some cases write performance gradually dropped after a few months, some cases it freezed for a while. You would also be concerned with a "Write Endurance" issue on SSD when running on production environments.&lt;br /&gt;On HDD, there is a huge difference between random writes and sequential writes. Sequential writes is very fast on HDD with write cache. On the other hand, on SSD, there is not so much difference between random writes and sequential writes. Though performance highly depends on SSD drives themselves, you probably notice that sequential write performance is not so different between HDD and SSD (or even faster on HDD). Since HDD has a longer history and is cheaper than SSD, currently there is not a strong reason to use SSD for sequential write oriented files.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;1.3 MySQL/InnoDB files &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;From these perspectives, it would make sense to locate random i/o oriented files on SSD, sequential write oriented files on HDD. Let's classify MySQL/InnoDB files as follows.&lt;br /&gt;&lt;br /&gt;Random i/o oriented:&lt;br /&gt;- Table files (*.ibd)&lt;br /&gt;- UNDO segments (ibdata)&lt;br /&gt;&lt;br /&gt;Sequential write oriented:&lt;br /&gt;- REDO log files (ib_logfile*)&lt;br /&gt;- Binary log files (binlog.XXXXXX)&lt;br /&gt;- Doublewrite buffer (ibdata)&lt;br /&gt;- Insert buffer (ibdata)&lt;br /&gt;- Slow query logs, error logs, general query logs, etc&lt;br /&gt;&lt;br /&gt;By default, table files (*.ibd) are not created but included in InnoDB's SYSTEM-tablespace(ibdata). By using "innodb_file_per_table" parameter, *.ibd files are created then table/index data are stored there. Table files are of course randomly read/written so storing on SSD is better. Note that write cache is also very effective on SSD so using H/W raid with BBWC + SSD would be nice.&lt;br /&gt;&lt;br /&gt;REDO log files and Binary log files are transactional logs. They are sequentially written so you can get very good performance on HDD with write cache. Sequential disk read will happen on recovery, but normally this will not cause a performance problem because log file size is normally much smaller than data files and sequential reads are much faster than random reads (happening on data files).&lt;br /&gt;&lt;br /&gt;Doublewrite buffer is a special feature for InnoDB. InnoDB first writes flushed pages to the "doublewrite buffer", then writing the pages to their correct positions on data files. This is to avoid page corruption (Without doublewrite buffer, page might get corrupted if power failure etc happens during writing to disks). Writing to doublewrite buffer is sequential so highly optimized for HDD. Sequential read will happen on recovery. MySQL has a parameter "skip_innodb_doublewrite" to disable doublewrite buffer, but disabling is dangerous. The amount of write data to doublewrite buffer is equivalent to that to data areas, so this is not negligible.&lt;br /&gt;&lt;br /&gt;Insert buffer is also a special feature for InnoDB. If non-unique, secondary index blocks are not in memory, InnoDB inserts entries to a "insert buffer" to avoid random disk i/o operations. Periodically, the insert buffer is merged into the secondary index trees in the database. Insert buffer enables to reduce the number of disk i/o operations by merging i/o requests to the same block, and random i/o operations can be sequential. So insert buffer is also highly optimized for HDD. Both sequential writes and reads will happen in normal operations.&lt;br /&gt;&lt;br /&gt;UNDO segments are random i/o oriented. To guarantee MVCC, innodb needs to register old images in UNDO segments. Reading previous images from UNDO segments on disk requires random reads. If you run a very long transaction with repeatable read (i.e. mysqldump --single-transaction) or running a long query, a lot of random reads might happen, so storing UNDO segments on SSD would be better in that case. If you run only short transactions/queries, this will not be an issue.&lt;br /&gt;&lt;br /&gt;Based on the above, I want to use "innodb_file_per_table", then storing *.ibd files on SSD, storing ib_logfile* (REDO log files), binary logs, ibdata (SYSTEM-tablespace) on HDD.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2 Benchmarking&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I ran DBT-2(disk i/o intensive) to verify whether my assumption was correct or not.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2.1 Benchmarking environment&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Sun Fire X4150&lt;br /&gt;CPU: Intel Xeon 3.3GHz, quad core, 8 cores in total&lt;br /&gt;HDD: SAS 15,000RPM, 2 disks, RAID 1, write cache enabled on H/W raid controller&lt;br /&gt;SSD: Intel X25-E, Single drive, write cache enabled on drive&lt;br /&gt;(Note that this is dangerous and not recommended for production at all. I didn't have a H/W raid controller working well on Intel X25-E so I just turned on for benchmarking purpose only)&lt;br /&gt;OS: RedHat Enterprise Linux 5.3 (2.6.18-128.el5)&lt;br /&gt;Filesystem: ext3&lt;br /&gt;I/O Scheduler: noop&lt;br /&gt;MySQL version: 5.4.0&lt;br /&gt;&lt;br /&gt;DBT-2 option:&lt;br /&gt;./run_workload.sh  -n -s 100 -t 1 -w 100 -d 600 -c 20&lt;br /&gt;(Approximate datafile size is 9.5 GB)&lt;br /&gt;&lt;br /&gt;my.cnf:&lt;br /&gt;[mysqld]&lt;br /&gt;basedir=/root/mysql5400&lt;br /&gt;datadir=/ssd/mysql-data&lt;br /&gt;innodb_data_home_dir=&lt;br /&gt;innodb_data_file_path=/hdd1/ibdata1:500M:autoextend&lt;br /&gt;innodb_file_per_table&lt;br /&gt;innodb_log_group_home_dir=/hdd/log&lt;br /&gt;innodb_log_files_in_group=2&lt;br /&gt;innodb_log_file_size=512M&lt;br /&gt;innodb_flush_log_at_trx_commit=1&lt;br /&gt;innodb_buffer_pool_size=2G&lt;br /&gt;innodb_flush_method=O_DIRECT&lt;br /&gt;innodb_support_xa=0&lt;br /&gt;(Binlog is disabled)&lt;br /&gt;&lt;br /&gt;This is a disk-bound configuration. If you increase innodb_buffer_pool_size higher, you can get better results but I intentionally set lower to do disk i/o intensive loads.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2.2 Benchmarking results&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2.2.1 HDD vs SSD  &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;All files on HDD(two drives, RAID1):  3447.25&lt;br /&gt;All files on SSD(single drive)     : 14842.44&lt;br /&gt;(The number is NOTPM, higher is better)&lt;br /&gt;&lt;br /&gt;I got 4.3 times better result on SSD. As &lt;a href="http://www.mysqlperformanceblog.com/2009/05/01/raid-vs-ssd-vs-fusionio/"&gt;Vadim&lt;/a&gt;, &lt;a href="http://www.bigdbahead.com/index.php?s=SSD"&gt;Matt&lt;/a&gt;, and other many people have already shown, just replacing HDD with SSD works very well for DBT-2 benchmarking.&lt;br /&gt;The following is iostat result.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;HDD 3447.25 NOTPM&lt;br /&gt;Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s&lt;br /&gt;sdf1              4.60    80.03  363.48  377.30     6.52&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;10.14    46.06    17.89   24.18   1.35  99.98&lt;br /&gt;-----cpu-----&lt;br /&gt;us sy id wa st&lt;br /&gt;3  1 50 46  0&lt;br /&gt;&lt;br /&gt;SSD 14842.44 NOTPM&lt;br /&gt;Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s&lt;br /&gt;sda1              0.00    11.90 1738.65 1812.33    30.73&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;46.42    44.49     4.01    1.13   0.27  95.03&lt;br /&gt;-----cpu-----&lt;br /&gt;us sy id wa st&lt;br /&gt;18  5 57 20  0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2.2.2 Storing some files on SSD, others on HDD &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;All files on SSD : 14842.44&lt;br /&gt;Redo log on HDD: 15539.8&lt;br /&gt;Redo log and ibdata on HDD: 23358.63&lt;br /&gt;Redo log and ibdata on tmpfs: 24076.43&lt;br /&gt;Redo log and ibdata on separate SSD drives: 20450.78&lt;br /&gt;&lt;br /&gt;These are quite interesting results. Let's see one by one.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2.2.2.1 Redo log file on HDD &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Storing just redo log files on HDD did not have so good effect (just 4.7% improvement). DBT-2 internally executes 14.5 DML(insert/update/delete) statements per transaction on average, so commit frequency is less than typical web applications. If you execute commit more frequently, performance difference will be bigger.The following is a statistics about how many bytes were written during single DBT-2 benchmarking. You can get this information by iostat -m.&lt;br /&gt;&lt;br /&gt;MB written on SSD: 31658&lt;br /&gt;MB written on HDD: 1928&lt;br /&gt;&lt;br /&gt;So 31658MB were written to SSD but only 1928MB were written to HDD. 20 times difference is huge so just moving redo logs from SSD to HDD would not have big impact. Note that this statistics highly depends on applications. For DBT-2, this doesn't work anyway.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2.2.2.2 Redo log and ibdata on HDD&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;By storing redo log and ibdata on HDD, I got much better result (14842.44-&gt;23358.63, 57% improvement!). Dirty pages are written twice (one for doublewrite buffer, the other for actual data area) so by moving ibdata, in which doublewrite buffer is allocated, to different drives, you can decrease the amount of writes to SSD by half. Since doublewrite buffer is sequentially written, it fits for HDD very well. Here is a result of iostat -m.&lt;br /&gt;&lt;br /&gt;MB written on SSD : 23151&lt;br /&gt;MB written on HDD : 25761&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;iostat &amp;amp; vmstat results are as follows. Apparently HDD was not so busy so working very well for sequential writes.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;SSD-HDD  23358.63NOTPM&lt;br /&gt;Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s&lt;br /&gt;sda1(SATA SSD)    0.03     0.00 2807.15 1909.00    49.48&lt;br /&gt;sdf1(SAS HDD)     0.00   547.08    0.38  737.22     0.01&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;35.91    37.08     2.90    0.61   0.18  84.69&lt;br /&gt;40.03   111.17     0.13    0.18   0.11   7.79&lt;br /&gt;&lt;br /&gt;-----cpu-----&lt;br /&gt;us sy id wa st&lt;br /&gt;28  9 49 14  0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2.2.2.3 Redo log and ibdata on tmpfs&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I was interested in whether storing redo log and ibdata on HDD can get the "best" performance or not. To verify this, I tested to store these files on tmpfs. This is not useful in production environment at all, but it's fine to check performance higher limit. tmpfs should be faster than any other high-end SSD(including PCIe SSD).  If there is not so big performance difference, moving these files from HDD to very fast SSD  is not needed. Here is a result.&lt;br /&gt;&lt;br /&gt;Redo log and ibdata on HDD(NOTPM): 23358.63&lt;br /&gt;Redo log and ibdata on tmpfs(NOTPM): 24076.43&lt;br /&gt;&lt;br /&gt;I got only 3% performance improvement. Actually this is not suprise because HDD was not maxed out when testing "redo log and ibdata on HDD".&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2.2.2.4 Redo log and ibdata on separate SSD drive&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Some people might be interested in how performance is different if using two SSD drives, one for *ibd, the other for Redo/ibdata. The following is a result.&lt;br /&gt;&lt;br /&gt;Redo log and ibdata on HDD(NOTPM): 23358.63&lt;br /&gt;Redo log and ibdata on separate SSD drives(NOTPM): 20450.78&lt;br /&gt;&lt;br /&gt;So I got 12.5% worse result compared to HDD. Does this mean Intel X25-E is worse for sequential writes than SAS HDD? Actually the situation seemed not so simple. When I did a very simple sequential write oriented benchmarking (mysqlslap insert) on single drive, I got very close numbers compared to HDD. So there must be other reasons for performance drop. The following is iostat result.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;SSD-SSD  20450.78NOTPM&lt;br /&gt;Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s&lt;br /&gt;sda1(SATA SSD)    0.00     0.00 2430.32 1657.15    43.32&lt;br /&gt;sdb1(SATA SSD)    0.00    12.60    0.75 1017.43     0.01&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;31.06    37.27     2.29    0.56   0.16  67.28&lt;br /&gt;34.46    69.33     0.57    0.56   0.27  27.02&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;SSD-HDD  23358.63NOTPM&lt;br /&gt;Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s&lt;br /&gt;sda1(SATA SSD)    0.03     0.00 2807.15 1909.00    49.48&lt;br /&gt;sdf1(SAS HDD)     0.00   547.08    0.38  737.22     0.01&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;35.91    37.08     2.90    0.61   0.18  84.69&lt;br /&gt;40.03   111.17     0.13    0.18   0.11   7.79&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;sda activity on SSD-SSD test was about 15% lower than on SSD-HDD test even though sdb was not fully busy. I have not identified the reason yet,but currently I assume that SATA interface (Host Bus Adapter port) got a bit saturated. I'll try other H/W components (SATA HDD, using different HBAs which has many more ports, etc) and see what will happen in the near future.&lt;br /&gt;Anyway, any high end SSD won't beat tmpfs numbers(24076.43NOTPM) so HDD(23358.63NOTPM) would be enough.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;3 Conclusion &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Now I'm pretty confident that for many InnoDB based applications it would be a best practice to store *.ibd files on SSD, and store Redo log files, binary log files, and ibdata (SYSTEM-tablespace) on HDD. I intentionally disabled binary logs for these benchmarking because currently binlog has concurrency issues (breaking InnoDB group-commit). Hopefully this issue will be fixed soon in future 5.4 so I'll be able to show you detailed results at that time. In theory same principles as redo logs(storing on HDD) can be applied.&lt;br /&gt;&lt;br /&gt;I have done many other benchmarks (disabling doublewrite buffer, running with AUTOCOMMIT, UNDO log performance difference between HDD and SSD, etc) so I'd like to share some interesting results in other posts.&lt;br /&gt;&lt;br /&gt;A very interesting point I have seen is that "The amount of disk reads/writes" really matters on SSD, which does not matter on HDD. This is because disk seeks &amp;amp; rotation overhead is very small on SSD, so data transfer speed relatively got important. For example, doublewrite buffer really influenced performance on SSD, which did not have impact on HDD. This point might force architecture changes for RDBMS. From this perspective, I am highly impressed by PBXT. I'd like to do intensive tests on PBXT in the near future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-1743104185323086365?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/1743104185323086365/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/05/tables-on-ssd-redobinlogsystem.html#comment-form' title='22 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1743104185323086365'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1743104185323086365'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/05/tables-on-ssd-redobinlogsystem.html' title='Tables on SSD, Redo/Binlog/SYSTEM-tablespace on HDD'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/__ybECuKG5tc/SggRPdNBiJI/AAAAAAAAAAs/m5ABt61X6AM/s72-c/bbwc.JPG' height='72' width='72'/><thr:total>22</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-2703325631920990989</id><published>2009-04-29T21:18:00.018+09:00</published><updated>2009-05-01T00:02:42.530+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><title type='text'>Linux I/O scheduler queue size and MyISAM performance</title><content type='html'>&lt;a href="http://www.mysqlconf.com/mysql2009/public/schedule/detail/6661"&gt;At MySQL Conference and Expo 2009&lt;/a&gt;, I explained how Linux I/O scheduler queue size affects MyISAM insert performance.&lt;br /&gt;&lt;br /&gt;It is well known that Linux implemented four types of I/O schedulers (noop/deadline/anticipatory/cfq) in Linux kernel 2.6.10. The default is cfq in most distributions including RHEL, which is not so good then noop normally outperforms, but I'll talk this in other posts.&lt;br /&gt;&lt;br /&gt;Linux I/O scheduler also has a functionality to sort incoming I/O requests in its request-queue for optimization. Queue size is configurable. Queue size is defined as "/sys/block/sdX/queue/nr_requests" then you can change queue length as follows.&lt;br /&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;br /&gt;# cat /sys/block/sda/queue/nr_requests&lt;br /&gt;128&lt;br /&gt;# echo 100000 &gt; /sys/block/sda/queue/nr_requests&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/blockquote&gt;Changing queue size is even effective for noop scheduler.&lt;br /&gt;&lt;br /&gt;Here are benchmarking results about changing i/o scheduler queue size for MyISAM insert-intensive loads. Detailed are written &lt;a href="http://assets.en.oreilly.com/1/event/21/Mastering%20the%20Art%20of%20Indexing%20Presentation.pdf"&gt;in the slides at the UC&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/__ybECuKG5tc/SfhITDU6TeI/AAAAAAAAAAc/6dsvxjfg3m8/s1600-h/iosched-myisam-hdd.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 141px;" src="http://1.bp.blogspot.com/__ybECuKG5tc/SfhITDU6TeI/AAAAAAAAAAc/6dsvxjfg3m8/s400/iosched-myisam-hdd.JPG" alt="" id="BLOGGER_PHOTO_ID_5330089651140709858" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/__ybECuKG5tc/SfhIfnmMI6I/AAAAAAAAAAk/qB7bWrP9TAA/s1600-h/iosched-myisam-ssd.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 158px;" src="http://1.bp.blogspot.com/__ybECuKG5tc/SfhIfnmMI6I/AAAAAAAAAAk/qB7bWrP9TAA/s400/iosched-myisam-ssd.JPG" alt="" id="BLOGGER_PHOTO_ID_5330089867035288482" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Apparently increasing queue size was very helpful for HDD, but not helpful for SSD.&lt;br /&gt;Let me explain about backgrounds.&lt;br /&gt;&lt;br /&gt;On Linux side, I/O requests are handled by the following order:&lt;br /&gt;system calls(pwrite/etc)&lt;br /&gt;-&gt; Filesystem&lt;br /&gt;-&gt; I/O scheduler&lt;br /&gt;-&gt; Device Driver/Disks&lt;br /&gt;&lt;br /&gt;I/O scheduler sorts incoming I/O requests by logical block addresses, then sending them to a device driver.&lt;br /&gt;I/O scheduler does not depend on storage devices so is helpful for some parts (i.e. minimizing disk seek overheads), not helpful for other parts (i.e. minimizing disk rotation overheads).&lt;br /&gt;&lt;br /&gt;On the other hand, Command Queuing (TCQ/NCQ) also sorts I/O requests for optimization. SAS and recent SATA II disks support command queuing. The goal is partly duplicate from I/O scheduler. But TCQ can minimize not only disk seeks but also disk rotation overhead (&lt;a href="http://en.wikipedia.org/wiki/Native_Command_Queuing"&gt;See the link to wikipedia&lt;/a&gt;). The disadvantage of TCQ is that queue size is very limited (normally 32-64).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Based on the above, sorting almost all random I/O requests on I/O sheculer then sending them to TCQ would be nice.&lt;br /&gt;&lt;br /&gt;Suppose 100,000 random read I/O requests are coming.&lt;br /&gt;When I/O scheduler queue size is 128 (default in many cases), TCQ gets I/O requests by almost random order, so pretty high disk seek overhead happens for each action (requests within single queue is dispersed).&lt;br /&gt;&lt;br /&gt;When I/O scheduler queue size is 100,000, TCQ gets I/O requests by fully sorted order, so seek overhead can be much smaller.&lt;br /&gt;&lt;br /&gt;Increasing queue size does not have any effect on SSD because no disk seek happens.&lt;br /&gt;&lt;br /&gt;This would explain my benchmarking results.&lt;br /&gt;&lt;br /&gt;I/O scheduler queue size settings is not helpful for InnoDB because InnoDB internally sorts I/O requests to optimize disk seek overheads, and sending limited number of i/o requests controlled by InnoDB internal i/o threads. So the role is duplicate between InnoDB itself and I/O scheduler queue. Note that TCQ improves InnoDB throughput because disk rotation overheads are significantly reduced and such optimizations can not be done from application/kernel side.&lt;br /&gt;&lt;br /&gt;MyISAM does nothing special (highly depending on OS) so this helps.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Updated in Apr 30: Added detailed benchmark conditions for people who are interested..&lt;br /&gt;&lt;br /&gt;Here is a test script. I ran a single-threaded stored procedure on a same machine.&lt;br /&gt;&lt;pre&gt;create table aa (id int auto_increment primary key,&lt;br /&gt;b1 int,&lt;br /&gt;b2 int,&lt;br /&gt;b3 int,&lt;br /&gt;c varchar(100),&lt;br /&gt;index(b1), index(b2), index(b3)) engine=myisam;&lt;br /&gt;&lt;br /&gt;drop procedure sp_aa;&lt;br /&gt;delimiter //&lt;br /&gt;create procedure sp_aa(IN count INTEGER)&lt;br /&gt;BEGIN&lt;br /&gt;DECLARE time_a, time_b BIGINT DEFAULT 0;&lt;br /&gt;DECLARE done INTEGER DEFAULT 0;&lt;br /&gt;DECLARE i INTEGER DEFAULT 1;&lt;br /&gt;WHILE done != 1 DO&lt;br /&gt;insert into aa values (i,rand()*count,rand()*count,rand()*count,repeat(1,40));&lt;br /&gt;SET i = i + 1;&lt;br /&gt;IF i % 1000000 = 1 THEN&lt;br /&gt;SELECT unix_timestamp() into time_a from dual;&lt;br /&gt;SELECT i, from_unixtime(time_a), time_a - time_b from dual;&lt;br /&gt;SET time_b = time_a;&lt;br /&gt;END IF;&lt;br /&gt;IF i &gt; count THEN&lt;br /&gt;SET done = 1;&lt;br /&gt;END IF;&lt;br /&gt;END WHILE;&lt;br /&gt;END;&lt;br /&gt;//&lt;br /&gt;delimiter ;&lt;br /&gt;&lt;br /&gt;mysql test -vvv -e "call sp_aa(300000000)"&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Then wait for a long long time...&lt;br /&gt;# Default insert. no insert-delayed, no disable-keys, no delay-key-write, no mmap&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;H/W, OS, MySQL settings&lt;br /&gt;Sun Fire X4150&lt;br /&gt;CPU: Intel Xeon, 8 cores&lt;br /&gt;RAM: 32GB (but limit filesystem cache size up to 5GB, no swapping happened)&lt;br /&gt;HDD: SAS 15,000RPM, 2 disks, RAID 1, write cache enabled&lt;br /&gt;SSD: Intel X25-E, Single drive, write cache enabled&lt;br /&gt;OS: RedHat Enterprise Linux 5.3 (2.6.18-128.el5)&lt;br /&gt;Filesystem: ext3&lt;br /&gt;I/O Scheduler: deadline&lt;br /&gt;MySQL 5.1.33&lt;br /&gt;key_buffer_size: 2G&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;i/o stats:&lt;br /&gt;queue size=128 (default)&lt;br /&gt;&lt;pre&gt;avg-cpu:  %user   %nice %system %iowait  %steal   %idle&lt;br /&gt;          0.01    0.00    0.08   24.69    0.00   75.22&lt;br /&gt;&lt;br /&gt;Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s&lt;br /&gt;sdb               0.00     0.87  0.00 575.60     0.00 &lt;br /&gt;&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;2.25     8.01   142.44  247.03   1.74 100.00&lt;br /&gt;(At running 12 hours, 13 mil rows were inserted)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;queue size=100000&lt;br /&gt;&lt;pre&gt;avg-cpu:  %user   %nice %system %iowait  %steal   %idle&lt;br /&gt;          2.06    0.00    5.32   29.66    0.00   62.96&lt;br /&gt;&lt;br /&gt;Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s&lt;br /&gt;sdb               0.00  2487.33  0.00 2042.60     0.00&lt;br /&gt;&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;35.11    35.20 84402.36 43582.33   0.49 100.01&lt;br /&gt;(At running 1.5 hours, 41 mil rows were inserted)&lt;br /&gt;...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;avg-cpu:  %user   %nice %system %iowait  %steal   %idle&lt;br /&gt;         0.07    0.00    0.19   24.82    0.00   74.91&lt;br /&gt;&lt;br /&gt;Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s  &lt;br /&gt;sdb               0.00     9.03  1.70 756.03     0.01   &lt;br /&gt;&lt;br /&gt;wMB/s avgrq-sz avgqu-sz   await  svctm  %util&lt;br /&gt;5.56    15.04 31981.72 127560.45   1.32 100.00&lt;br /&gt;(At running 12 hours, 77 mil rows were inserted, &lt;br /&gt;index size was 4.5 GB)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;If running many more hours(days) so that index size exceeds filesystem cache, a lot of disk reads will happen. (I didn't have time to run that test before UC)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-2703325631920990989?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/2703325631920990989/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/04/linux-io-scheduler-queue-size-and.html#comment-form' title='24 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/2703325631920990989'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/2703325631920990989'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/04/linux-io-scheduler-queue-size-and.html' title='Linux I/O scheduler queue size and MyISAM performance'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/__ybECuKG5tc/SfhITDU6TeI/AAAAAAAAAAc/6dsvxjfg3m8/s72-c/iosched-myisam-hdd.JPG' height='72' width='72'/><thr:total>24</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3043493300793589377.post-1401627806265257273</id><published>2009-04-25T03:45:00.000+09:00</published><updated>2009-04-25T03:44:07.018+09:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><title type='text'>Mastering the Art of Indexing - slides available</title><content type='html'>At the MySQL Conference &amp;amp; Expo 2009, I did a presentation "&lt;a href="http://www.mysqlconf.com/mysql2009/public/schedule/detail/6661"&gt;Mastering the Art of Indexing&lt;/a&gt;" , and now you can get the &lt;a href="http://assets.en.oreilly.com/1/event/21/Mastering%20the%20Art%20of%20Indexing%20Presentation.pdf"&gt;slides from MySQL Conference site&lt;/a&gt; .&lt;br /&gt;I was excited that so many people attended to my session. I hope the session will help you to improve indexing techniques. Feedbacks are welcome.&lt;br /&gt;&lt;br /&gt;I was asked from some people at the conference where my blog is.  I am mostly publishing MySQL technical information in Japanese, but now I believe it's a good time to start publishing in English. I plan to post mainly benchmarking &amp;amp; performance analysis, high availability solutions, internationalization (which sometimes people have problems), etc.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3043493300793589377-1401627806265257273?l=yoshinorimatsunobu.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://yoshinorimatsunobu.blogspot.com/feeds/1401627806265257273/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/04/mastering-art-of-indexing-slides.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1401627806265257273'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3043493300793589377/posts/default/1401627806265257273'/><link rel='alternate' type='text/html' href='http://yoshinorimatsunobu.blogspot.com/2009/04/mastering-art-of-indexing-slides.html' title='Mastering the Art of Indexing - slides available'/><author><name>Yoshinori Matsunobu</name><uri>http://www.blogger.com/profile/14180479977952026421</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://2.bp.blogspot.com/-8tKvKJn_lrE/Tjd0c6CdydI/AAAAAAAAAF0/n_dADF7PLrI/s220/yoshinori.jpg'/></author><thr:total>8</thr:total></entry></feed>
