Yoshinori Matsunobu's blog

Towards Bigger Small Data

2017-11-09T10:23:00.000-08:00

In MySQL ecosystems, it has been a very common practice to shard MySQL databases by application keys, and to manage multiple small sized MySQL instances. Main reason of the sharding was that a single server could not handle so much data. 10 years ago, typical commodity servers had only 16GB RAM, and typical storage configuration was RAID10 with 6~12 SATA HDDs. There was no affordable flash storage available at that time. Because such machines could handle only a few hundred random IOPS, and buffer pool was so limited, we couldn't put much data on a single server — at most a few hundred GBs per server. Even small-mid sized applications easily exceeded single server capacity. They had to split databases into multiple servers.

Disadvantages of Sharding

Sharding by applications has been a common practice to scale MySQL beyond single machine. But there are a couple of disadvantages like the followings.

You have to write application logic to manage shards. Also, you need to manage many more MySQL instances
Atomic transaction is not supported across multiple shards
Can not take globally consistent backups across multiple shards
Cross instance join is not supported (or very slow)
Hard to use secondary keys efficiently. Unless secondary keys are part of sharding keys, you need to query all shards, which is very expensive, especially if you need sorting

Atomic transaction, join, and secondary keys are big deals. Not all applications can easily give up them. They might not be willing to spend time to implement sharding logic, either.

Shared Nothing SQL database is not general purpose

There are a couple of database products that offer transparent sharding by database internals. MySQL Cluster (NDB) is one of the MySQL engines that has existed for long years, offering shared nothing distributed databases. You can run atomic transactions, run cross-shard joins across multiple instances (data nodes). MySQL Cluster supports NoSQL interface (it is called NDBAPI), to query directly into data nodes, which boost performance significantly. I'm not going to talk about the NoSQL interface, since I'd like to discuss SQL database. Note that MySQL Cluster is a product name and it is not InnoDB. If you want to continue to use InnoDB, Vitess can be a good solution. It helps to build transparent, shared nothing database. Transparent shared nothing databases like MySQL Cluster, Vitess solved some of the issues mentioned above, but there are still issues like below, and users may need to redesign tables and/or rewrite queries as needed.

Limited secondary key support. This was the same as I mentioned above. In MySQL Cluster, by default, rows are distributed by primary keys. If you do not explicitly specify primary keys in WHERE clause, queries need to scan all data nodes, which may significantly limit concurrency and increase latency, depending on how many data nodes you have and other query conditions (e.g. sorting).
Queries using JOINs may need lots of network tound-trips. Lots of improvements have been done in MySQL Cluster for handling joins, but it's still slower than InnoDB, especially if queries are complex.

As a result, to utilize MySQL Cluster, it was generally recommended not using secondary keys, and not using joins frequently. There are many good use cases, especially if people are 100% sure how their databases are used, but it is not recommended as a general purpose SQL database.

Bigger Small Data and its challenges

Several people at Facebook called MySQL services as "Small Data". Combined MySQL instance size was pretty large, but each instance size was small enough (normally up to 1TB). 10 years ago, people had to run small instances because of limited hardware support. Nowadays, commodity servers have more than 256GB RAM and more than 10TB Flash storage. There are many small-mid databases that fit in 10TB. Successful databases grow beyond 10TB, so they will have to shard anyway. But how about experimental projects, and/or many other applications that are expected to grow up to limited size? Instead of spending engineering efforts to manage shards and rewrite tables/queries, why not just put everything on a single server and take all advantages like atomic and consistent transactions, secondary indexes and joins — running "Bigger Small Data"? There are a couple of public MySQL services supporting bigger storage size. Amazon Aurora (AWS) and Alibaba PolarDB (Alibaba Cloud) are both based on MySQL and claim to support more than 60TB instance size. It was not surprising to me that they chose bigger small data rather than shared nothing distributed database, because they had lots of customers who wanted to do whatever they wanted. They couldn’t control customers not to use joins. But being supported does not necessarily mean working as expected. To make MySQL really work with bigger small data, it needs lots of improvements, beyond 8.0, especially improving concurrency and long running operations, including the followings.

Parallel logical dump (mysqldump)
Parallel query
Parallel binary copy
Parallel DDL
Resumable DDL
Better join algorithm
Much faster replication
Handling many more connections
Good resource control, so that some bad users don't eat all resources

These are needed at least, to answer questions like "how can MySQL handle general operations, if our instance grows 10x bigger"?

I'm not worried about short queries — row look-ups by primary keys or secondary keys. These are where MySQL has been great so far. I'm worried about long running queries. The most common case of the long running queries would be full table scan from a single table. In general, logical dump from 10TB table takes (much) more than 10 times, compared to scanning from 1TB table. InnoDB needs to keep history list for consistent reads. Maintaining history list and reading consistent snapshots from rollback segments get more expensive, if rows are heavily updated during running long running transactions. If you run daily backups by mysqldump, you might not tolerate 10x~20x longer backup time — which might not finish in 24 hours. To make logical dump shorter, parallel query support is needed, and this is not currently not supported by MySQL. Physical backup (binary copy) also needs parallelism, though it can be relatively easily implemented, since physical backup tools are written out side of MySQL and are easily extended.

Running ALTER TABLE on 10TB table is also challenging. Amazon Aurora supports instant DDL — adding a nullable column at the end of a table can be done without rebuilding the table. But there are still many DDL operations requiring copying tables. First, you will want to know when it ends. MySQL currently does not tell that. If it is expected to take 20 days, you might be worried what will happen if mysqld restarts before finishing the DDL. It would be great if the database remembers DDL state periodically, and can resume operations when restarting mysqld.

Replication is another technical challenge. MySQL replication is asynchronous. Slaves are often lagged if master instances are heavily updated. On bigger small data, update volume on master can be 10x or even more. How slaves can handle 10x more replication traffics? Amazon Aurora does not have MySQL replication lag issue, if you run it in a single region. Aurora has 6x storage copies in the same region, across three availability zones. You can scale reads in the same region. But scaling reads across multiple regions requires MySQL Replication, and it is challenging unless making MySQL Replication a lot faster. Alibaba PolarDB offers InnoDB physical replication across different datacenters, which is significantly more concurrent and faster than binlog based replication. But you are constrained to one engine (InnoDB, though it's by far the most common engine) and debugging replication issues might be harder, since it's no longer binlog based.

Reliability improvement should not be ignored. 10x larger data means the instance serves many more connections and queries. If handful bad queries take the whole instance unavailable, the impact in bigger small data is much higher than in small instances. Good resource management is needed. High priority queries can be useful too, for making some important low latency queries finish earlier, without being affected by expensive queries.

There are lots of technical challenges to make Bigger Small Data really work. I expect Amazon Aurora will be ahead to implement these important features to make Aurora truly support bigger data. But I'm almost sure that AWS won't release them as open source software. For everybody to get such features, somebody else will have to implement. I hope Oracle will do, but I understand that they need to compete Amazon Aurora and Oracle will not be willing to give their features to AWS for free. More realistic scenario might be multiple companies, including us, implementing features, releasing as open source and contributing to Oracle and/or MariaDB. I think these are interesting technical projects for MySQL for a couple of years.

MyRocks Deep Dive -- slides available

2016-04-19T09:42:00.000-07:00

Yesterday I had a 3-hour tutorial about MyRocks at Percona Live. Now slides are available.

Performance issues and fixes -- MySQL 5.6 Semi-Synchrnous Replication

2015-01-26T17:42:00.000-08:00

Long time ago I wrote a blog post that we started using Semi-Synchronous replication in production at Facebook. We are running our own MySQL facebook-5.6 branch, which added many replication (including Semisync) patches. Some of them were also merged with WebScaleSQL.

Recently I heard from people from community that they were suffering from Semisync performance issues in official MySQL 5.6. I took some time to review our previous MySQL bug reports, then realized that some important bugs were either still "verified" or inefficiently fixed. Two most affecting bug reports were https://bugs.mysql.com/bug.php?id=70342 and http://bugs.mysql.com/bug.php?id=70669. We fixed both at our branch so I haven't paid much attention after that, but people outside Facebook are certainly affected.

In this post, I'm going to describe some effective configurations to get better Semisync throughput on master and slaves, by showing simple benchmark numbers. I used three machines -- client and master and semisync slave --, all running on pure flash. They are located within very close distance. I created 100 databases and enabled Multi-Threaded-Slave, and ran 100 mysqlslap processes for 100 databases, with 30 concurrent connections each (3000 concurrent connections in total). All queries were auto-committed inserts and I used InnoDB storage engine on both master and slaves.

1. Set master_info_repository=TABLE
MySQL 5.6 and 5.7 have a performance bug that writing FILE based master info (and relay log info) files are very expensive. This is especially serious for Semisync replication, since this bug slows down IO thread. On Semisync, slow IO threads takes longer time to send ACK back to the master, so it slows down master throughput as well. Default master_info_repository is FILE, so without changing this parameter to TABLE, you are affected by this bug.
Here are benchmark results between FILE and TABLE.

5.6.22 master_info_repository=FILE : 5870/s
5.6.22 master_info_repository=TABLE: 7365/s

These numbers were the number of commits per second on both master and slave instances. Slave didn't lag for most experiments, thanks to multi-threaded slave feature.
Please don't confuse between master_info_repository and relay_log_info_repository parameters. relay_log_info_repository has to be TABLE, otherwise crash safe slave doesn't work. master_info_repository works on both FILE and TABLE, but I suggest to use TABLE for performance reasons.

2. Reduce durability on master
Older 5.6 had a bug that slaves couldn't continue replication after crashed master's recovery, even if setting fully durable configurations. Here is a closed bug report.
This bug report was closed, but it caused some performance regression. Master extended LOCK_log mutex holding duration -- releasing LOCK_log mutex after calling fsync(). This certainly fixed the bug, but caused performance regression because LOCK_log was very hot mutex in 5.6 -- both Binlog Dump thread and application threads need to hold the lock. Hopefully reducing durability (I mean setting sync_binlog=0 and innodb_flush_log_at_trx_commit=0|2) mitigates the regression a lot. When using Semisync replication, you are most certainly to promote a slave on master failure, so durability on master does not matter much == You can reduce durability on master.

5.6.22 master_info_repository=TABLE, full durability: 7365/s
5.6.22 master_info_repository=TABLE, less durability: 9800/s

3. Loss Less Semisync
MySQL 5.7 improved replication performance a lot. Probably the most effective improvement was that Binlog Dump thread no longer held LOCK_log mutex. In addition to that, 5.7 introduced "Loss-Less Semisync" feature.
If you read my previous Semisync blog post carefully, you may have noticed that 1. we backported "Loss-Less Semisync" from 5.7, and 2. we got better throughput with Loss-Less Semisync than Normal Semisync. This was because Loss-Less Semisync actually reduced mutex contentions -- LOCK_commit and LOCK_binlog_. My 5.7 benchmark result was as follows.

5.7 Normal Semisync: 12302/s
5.7 Loss Less Semisync: 14500/s
(master_info_repository=TABLE, less durable)

Reducing LOCK_log contention on Binlog Dump thread and introducing Loss Less Semisync were major contributors in MySQL 5.7 performance improvements. At Facebook, we ported both in our 5.6 branch. It would be interesting for community if these can be available on other distributions, since using 5.7 in production will not happen anytime soon.

In addition to the above three configurations, there are still some considerations to make Semisync throughput not bad / better.

4. Semisync mysqlbinlog
At Facebook, we implemented mysqlbinlog to speak Semisync protocol, and used it as a Semisync replication reader. On replication slaves, IO thread and SQL thread conflict with internal mutexes. As I mentioned above, slow IO thread slows down Semisync master throughput. Semisync mysqlbinlog doesn't have such slowdown, because it doesn't have SQL thread. So using Semisync mysqlbinlog instead of Semisync slave can improve master throughput.

5.7 Loss Less Semisync + Semisync slave: 14500/s (on both master and slave)
5.7 Loss Less Semisync + Semisync mysqlbinlog + async slave: 48814/s on master, 15363/s on slave

This shows Semisync mysqlbinlog improved master throughput. But this is actually not so good news -- because slave lags a lot. Fundamentally we need to fix mutex contentions between IO thread and SQL threads.

5. GTID
There are some open performance bugs in GTID. Especially 5.7 one is serious. If you really need high throughput Semisync, you need to carefully benchmark with GTID (and ask Oracle to fix!).

Here are whole table definitions and mysqlslap commands I used for benchmark.

for i in `seq 1 100`
do
mysql -e "drop database test$i"
mysql -e "create database test$i"
mysql test$i -e "create table t1 (id int auto_increment primary key, \
value int, value2 varchar(100)) engine=innodb"
done


for i in `seq 1 100`
do
mysqlslap --host=master --query="insert into test$i.t1 \
values (null, 1, 'abcdefghijklmn')" --number-of-queries=100000 --concurrency=30 &
done

Semi-Synchronous Replication at Facebook

2014-04-01T11:31:00.001-07:00

After intensive testing and hack, we started using Semi-Synchronous MySQL Replication at Facebook production environments. Semi-Synchronous Replication itself was ready since MySQL 5.5 (GA was released 3.5 years ago!), but I'm pretty sure not many people have used in production so far. Here are summary of our objective, enhancements and usage patterns. If you want to hear more in depth, please feel free to ask me at Percona Live this week.

Objective / Why Semisync?

The objective of the Semi-Synchronous Replication is simple -- Master Failover without data loss, without full durability.

First, let me describe why the objective is difficult without semisync.

There are a couple of fast slave promotion (master failover) solutions. My own MHA covers both fully automated and semi-automated MySQL failover solution. Fully automated means both failure detection and slave promotion are done automatically. Semi automated means failure detection is not done but slave promotion is done by one command. Time to detect failure is approximately 10 seconds, and actual failover is taking around 5 to 20 seconds, depending on what you are doing during failover (i.e. forcing power off of the crashed master will take at least a few seconds). Total downtime can be less than 30 seconds, if failover works correctly. I'm using term "Fast Failover" in this post, which includes both automated and semi-automated master failover.
In MySQL 5.6, GTID based failover is also possible. Oracle's official tool mysqlfailover automates MySQL master failover using GTID. The latest version of MHA also supports GTID.

Both mysqlfailover and MHA rely on MySQL replication. MySQL replication is asynchronous. So there is a very serious disadvantage -- potential data loss risk on master failover. If you use normal MySQL replication and do automated master failover with MHA/mysqlfailover, you can do failover quickly (a few seconds with MHA), but you always have risks of losing recently committed data.

If you don't want to take any risk of losing data, you can't do fast master failover with normal MySQL replication. You have to do the following steps in case of master failure.

- Always set fully durable settings on master. By fully durable I mean setting innodb_flush_log_at_trx_commit=1 and sync_binlog=1.
- On master crash, wait for a while (10~30 minutes) until the crashed master recovers. Recovery takes long time because it involves OS reboot, storage and filesystem recovery, and InnoDB crash recovery.
- If the crashed master recovers, you can continue services without losing any data. Since all data exist on the master, slaves can continue replication. BTW official 5.6 had a bug causing all slaves broken in this scenario, but this bug was fixed in 5.6.17.
- If the crashed master doesn't recover (H/W failure etc), you need to promote one of slaves to a new master. There is a risk of losing some data but you don't have any other choice.

This "safer" approach has two issues.
- Longer downtime. This is because you have to wait for master's recovery.
- You can't eliminate risks of losing data. If master is dead and never recovers, your risk of losing data is the same as doing fast failover.

So, in bad cases, you have to suffer from both longer down time and losing data.

Semi-Synchronous Replication is helpful to prevent from losing data.

If you do not care about data loss risk, there is no reason to use Semi-Synchronous replication. You can use normal MySQL replication and do fast failover with mysqlfailover or MHA. Facebook is one of the companies to care about data loss risk with MySQL, so that's why we were interested in Semi-Synchronous replication a lot.

Semisync replication was originated from Google in 2007. Official MySQL supported from 5.5. Actual implementation algorithm was substantially different from Google's.

MySQL Cluster and Galera offers synchronous replication protocol in different ways. I do not cover them in this blog post.

Semi-Synchronous Replication currently has two types of different algorithms -- Normal Semisync and Loss-Less Semisync. Let me explain the differences.

Differences between Normal Semisync and Loss-Less Semisync

Loss-Less Semisync is a new Semisync feature supported in official MySQL 5.7. Original implementation was done by Zhou Zhenxing as "Enhanced Semisync" project, and also filed as a bug report. Oracle implemented based on his idea, and named Loss-Less semisync for it. So Enhanced Semisync and Loss-Less Semisync have same meanings. I say Loss-Less semisync in this post.

Normal semisync and loss-less semisync work as below.

1. binlog prepare (doing nothing)
2. innodb prepare (fsync)
3. binlog commit (writing to fscache)
4. binlog commit (fsync)
5. loss-less semisync wait (AFTER_SYNC)
6. innodb commit (releasing row locks, changes are visible to other users)
7. normal semisync wait (AFTER_COMMIT)

On normal semisync(AFTER_COMMIT), committing to InnoDB is done before waiting for ack from semisync slave, so the committed rows are visible from applications, even though semisync slaves may not have received the data. If master is crashed and none of the slaves received the data, the data is lost but applications may have seen them. This is called phantom reads, and in many cases it's problematic.

Loss-less semisync (AFTER_SYNC) avoids the problem. Loss-less semisync commits InnoDB after getting ack from one of semisync slaves. So when committed data is visible from applications, one of the semisync slaves have received that. Phantom read risk is much smaller: if both master and the latest semisync slave are down at the same time, data is lost. But it's much less likely to happen compared to normal semisync.

To avoid data loss and phantom reads, Normal Semisync can't meet your expectations. Using Loss-Less Semisync is needed.

With Loss-Less Semi-Synchronous replication, committed data should be on one of the slaves, so you can recover from the latest slave. You can always do fast failover here.

Reduced Durability

When you do fast failover, you can set reduced durable settings on master as well as slaves. Reduced durability means innodb_flush_log_at_trx_commit != 1 and sync_binlog != 1. With Semi-Synchronous replication, you can immediately start failover when master is down. When promoting a slave to the new master, identify the latest slave (highly likely one of the Semi-Synchronous slaves but not guaranteed) and apply differential logs to the new master. Master's durability does not matter here, because there is no way to access master's data during failover. So you can safely reduce durability. Reducing durability has a lot of benefits.

- Reducing latency on (group) commit because it doesn't wait for fsync().
- Reducing IOPS because the number of fsync() calls is significantly reduced: from every commit to every second. Overall disk workloads can be reduced. This is especially helpful if you can't rely on battery/flash backed write cache.
- Reducing write amplification. Write volume can be reduced a lot, even less than half in some cases. This is important especially when using flash devices, because less write volume increases flash life expectancy.

Requirements for Semisync Deployment

To make Semisync work, you need at least one semisync reader (slave with semisync enabled) within the same (or very close) datacenter as the master. This is for latency. When semisync is enabled, round-trip time(RTT) between master and one of the semisync slaves is added to transaction commit latency. If none of the semisync slave is located within close datacenter, RTT many take tens or hundreds of milliseconds, which means you can commit only 10~100 times from single client. For most environments, this will not work. You need a slave within close datacenter.

To make fast failover work without data loss, you need to make sure Semi-Synchronous Replication is always enabled. MySQL Semisync has a couple of points where optionally semisync is disabled:
- Exceeding timeout (exceeding rpl_semi_sync_master_timeout milliseconds to get ACK from all of the semisync slaves)
- No semisync slave (can be controlled via rpl_semi_sync_master_wait_no_slave)
- Executing SET GLOBAL rpl_semi_sync_master_enabled=0

If you want to enable semisync always, you make sure these scenario won't happen. Set infinite or very long timeout, and have at least two semisync readers.

Facebook Enhancements to Semi-Synchronous Replication

We spent a lot of time for testing Semi-Synchronous replication in 2013. We found some S1 bugs, serious performance problems, and some administration issues. Our MySQL Engineering team and Performance team worked for fixing issues and finally our Operations team deployed Semisync in production.

Here are our major enhancements.

Backporting Loss-Less Semisync from 5.7

As described above, Loss-Less Semisync is needed to prevent data loss and phantom reads, so we backported Loss-Less Semisync patch from official MySQL 5.7 to our Facebook MySQL 5.6 branch. It will be merged to WebScaleSQL branch soon.

Interestingly, when we tested semisync performance, Loss-less semisync gave better throughput than normal semisync, especially when the number of clients is large. Normal semisync caused more mutex contentions, which was alleviated with loss-less semisync. Since Loss-less semisync has better data protection mechanism, we concluded there is no reason to use normal semisync here.

Semisync mysqlbinlog

Starting from MySQL 5.6, mysqlbinlog supported remote binlog backups, by using --raw and --read-from-remote-server. On remote binlog backups, mysqlbinlog works like a MySQL slave. mysqlbinlog connects to a master, executing BINLOG DUMP command, then receiving binlog events via MySQL replication protocol. This is useful when you want to take backups of the master's binary logs. Slave's relay logs and binary logs are not identical to master's binary logs, so they can't directly be used as backups of the master's binary logs.

We extended mysqlbinlog to speak Semisync protocol. The reason of the enhancement is that we wanted to use "semisync mysqlbinlog" as a replacement of local semisync slaves. We usually run slaves on remote datacenters, and we don't always need local slaves to serve read requests / redundancy. On the other hand, as described at above "Requirements for Semisync Deployment" section, in practice at least two local semisync readers are needed to make semisync work. We didn't like to run additional two dedicated slaves per master just for semisync. So we invented semisync mysqlbinlog and use it instead of semisync slaves, as shown in the below figure.

Compared to semisync slave, semisync mysqlbinlog has a lot of efficiency wins.

- semisync slave has lots of CPU overheads such as query parsing, making optimizer plans. semisync mysqlbinlog does not have such overhead.
- semisync slave writes 2x (relay log and binary log). semisync mysqlbinlog writes binary log only.
- For semisync slave, the way to write to relay log is not efficient. IO thread writes to kernel buffer per each binlog event. For regular auto-committed transactions, it consists of three binlog events (query BEGIN, query body, and commit XID). When using InnoDB only, writing to kernel buffer for every XID event is enough (though it does not cover DDL). By writing to kernel buffer for every XID event, it makes the frequency of kernel buf flush by less than 1/3. semisync mysqlbinlog could easily do such optimizations. We have not done yet, but it is even possible to make mysqlbinlog send back ACK before writing, to a file, and the extension is very easy.
- Slave causes contention between SQL thread and I/O thread, so IO thread itself slows down, which slows down semisync master throughput too. Semisync binlog does not have such overhead because there is no SQL thread.

With mysqlbinlog reader, master failover step becomes a bit tricky. This is because mysqlbinlog is not mysqld process so it doesn't accept any MySQL command, such as CHANGE MASTER. When doing master failover, it is highly likely that one of local mysqlbinlog has the latest binary log events, and the events should be applied to a new master. New MHA version (0.56) supported the feature.

In this configuration, mysqlbinlog processes need to be highly available. If all semisync mysqlbinlog processes are down, semisync is stopped or suffering from long wait time..

Reducing plugin_lock mutex contention

Prior to MySQL 5.6.17, there was a performance bug that transaction commit throughput dropped significantly when there were non-semisync many slaves or binlog readers, even if there was only a few semisync readers. On typical deployments, there are two or three semisync readers and multiple non-semisync readers, so performance drop with many non-semisync readers was annoying.
The performance drop was caused by "plugin_lock" MySQL internal mutex on master. For those who don't know, semisync is a plugin in MySQL, and it's not installed by default. The plugin_lock mutex was needed by semisync binlog dump threads only, but actually the mutex was held by all binlog dump threads. We looked into the problem further.
First we tried replacing plugin_lock mutex with read/write mutex. It actually did not help much. But Linux profiling tools showed that plugin_lock still caused contentions. During profiling, we learned that most/all glibc rw-locks had an internal lock (mutex-like thing) on which threads could stall. The pattern was get lock, get exclusive access to cache line to modify data, release lock. This was relatively expensive for plugin_lock mutex, since it doesn't do any expensive I/O inside.

So switching plugin_lock to read/write lock was actually a bad idea. It was needed to remove below plugin related locks as long as possible. There are four major plugin related mutexes in MySQL.
- plugin_lock
- plugin_lock_list
- plugin_unlock
- plugin_unlock_list

We also noticed that Delegate classes had read/write locks and they caused very hot contentions (especially Binlog_transmit_delegate::lock). The read/write lock protects a list, so probably switching to lock-free list was possible. BTW we noticed that performance schema did not collect mutex statistics on the mutexes on Delegate classes (bug#70577).

The real problem was all of the above locks were held not only by semisync binlog readers, but also non-semisync binlog readers.

Based on the above factors, we concluded removing all plugin mutexes was not easy, then we decided to optimize to hold these locks by semisync binlog readers only, and not holding by non-semisync binlog readers. The below is a benchmark result.

x-axis was the number of non-semisync binlog readers, y-axis was concurrent INSERT throughput from 100 clients. The number of semisync binlog readers was always 1 to 3. Detailed benchmark conditions were described in a bug report.
Hopefully our patches were finally merged to 5.6.17 and 5.7 so everybody can get benefits easily.

With all of the enhancements, we could get pretty good benchmark results with semisync.

This is a mysqlslap insert benchmark on the master, with one semisync slave/mysqlbinlog running. x-axis is the number of clients, y-axis is the number of inserts on the master. Enhanced means loss-less semisync.
Normal slave is traditional (non-semisync) slave. Enhanced mysqlbinlog is our semisync usage pattern. As you can see, loss-less semisync beats normal semisync due to internal mutex contention reductions. semisync mysqlbinlog also beats semisync slave because of much less overheads. This shows that loss-less semisync scales pretty well.

Conclusion and Future Plans

After several performance improvements, Semi-Synchronous replication became good enough for us. From performance point of view, I expect that single-threaded application performance will be next low-hanging fruits. On our benchmarks, we got around ~2500 transaction commits per second with semisync (0.4ms per commit). Without semisync, it was easy to get ~10000 transaction commits per second (0.1ms per commit). Of course semisync adds RTT overhead, but on local datacenter network, RTT is much lower than 0.3ms. I think there is another semisync overhead here, so will revisit this issue and will work with Oracle Replication team and outside experts.

MHA 0.56 is now available

2014-03-31T21:50:00.000-07:00

I released MHA version 0.56 today. Downloads are available here. MHA 0.56 includes below features.

Supporting MySQL 5.6 GTID. If GTID and auto position is enabled, MHA automatically does failover with GTID SQL syntax, not using traditional relay log based failover. You don't need any explicit configuration within MHA to use GTID based failover.
Supporting MySQL 5.6 Multi-Threaded slave
Supporting MySQL 5.6 binlog checksum
MHA supports new section [binlogN]. In binlog section, you can define mysqlbinlog streaming servers. When MHA does GTID based failover, MHA checks binlog servers, and if binlog servers are ahead of other slaves, MHA applies differential binlog events from the binlog server to the new master before recovery. When MHA does non-GTID based (traditional) failover, MHA ignores binlog servers. More details can be found on documentation.
Supporting custom mysql and mysqlbinlog location
Adding ping_type=INSERT for checking connectivity for the master. This is useful if master does not accept any writes (i.e. disk error)
Added --orig_master_is_new_slave, --orig_master_ssh_user and --new_master_ssh_user for master_ip_online_change_script
Added --skip_change_master, --skip_disable_read_only, --wait_until_gtid_in_sync on masterha_manager and masterha_master_switch (failover mode).

Speaking about MySQL5.6 and WebScaleSQL at Percona Live

2014-03-27T11:38:00.001-07:00

At Percona Live, Steaphan Greene and I will talk about MySQL 5.6 and WebScaleSQL at Facebook.

2 April 1:20PM - 2:10PM @ Ballroom E
http://www.percona.com/live/mysql-conference-2014/sessions/mysql-56-facebook-2014-edition

In addition to that, I have two more talks this year.

Performance Monitoring at Scale
3 April 2:00PM - 2:50PM @ Ballroom G
http://www.percona.com/live/mysql-conference-2014/sessions/performance-monitoring-scale

Global Transaction ID at Facebook
4 April 12:50PM - 1:40PM @ Ballroom E
http://www.percona.com/live/mysql-conference-2014/sessions/global-transaction-id-facebook

Many people from Facebook speak at Percona Live this year. Please take a look at an interview from Percona to see what we are going to speak.

I assume many of my blog subscribers have already heard about WebScaleSQL that was announced this morning. MySQL Conference in April is the biggest MySQL conference in the world so it's a perfect timing to release something and collaborate with experts. I hope to meet with many people there.

How sync_file_range() really works

2014-03-13T13:31:00.000-07:00

There is a relatively new and platform dependent flushing function called sync_file_range(). Some databases (not MySQL) use sync_file_range() internally.
Recently I investigated stall issues caused by buffered write and sync_file_range(). I learned a lot during investigation but I don't think these behaviors are well known to the public. Here I summarize my understandings.

Understanding differences between sync_file_range() and fsync()/fdatasync()

sync_file_range() has some important behavior differences from fsync().

sync_file_range() has a flag to flush to disk asynchronously. fsync() always flushes to disk synchronously. sync_file_range(SYNC_FILE_RANGE_WRITE) does async writes (async sync_file_range()), sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does sync writes (sync sync_file_range()). With async sync_file_range(), you can *usually* call sync_file_range() very quickly and let Linux flush pages to disk later. As I describe later, async sync_file_range() is actually not always asynchronous, and is sometimes blocked for writeback. It is also important that I/O errors can't be notified when using async sync_file_range().
sync_file_range() allows to set file ranges (starting offset and size) to flush to disk. fsync() always flushes all dirty pages of the file. Ranges are rounded to page unit size. For example, sync_file_range(fd, 100, 300) will flush from offset 0 to 4096 (flushing page#1), not limited from offset 100 to 300. This is because minimum I/O unit is page.
sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does not wait for metadata flushing. fsync() waits until flushing both data and metadata are done. fdatasync() skips flushing metadata if file size does not change (fsync() also skips flushing metadata in that case, depending on filesystem). sync_file_range() does not wait metadata flushing even though file size changes. If a file is not overwritten (=appended), sync_file_range() does not guarantee the file can be recovered after crash, while fsync()/fdatasync() guarantee that.

sync_file_range() behavior highly depends on kernel version and filesystem.

xfs does neighbor page flushing, in addition to specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16). This works very well for HDD because I/O unit size becomes bigger. In general, synchronously writing 1MB * 1000 times is much faster than writing 4KB * 256,000 times. ext3 and ext4 don't do neighbor page flushing.

sync_file_range() is generally faster than fsync() because it can control dirty page ranges and skips waiting for metadata flushing. But sync_file_range() can't be used for guaranteeing durability, especially when file size changes.

Practical usage of the sync_file_range() is where you don't need full durability but you want to control(reduce) dirty pages. For example, Facebook's HBase uses sync_file_range() for compactions and HLog writes. HBase does not need full durability (fsync()) per write because HBase relies on HDFS and HDFS can recover from HDFS replicas. Compactions write huge volume of data so periodically calling sync_file_range() makes sense to avoid burst writes. Calling sync_file_range() 1MB * 1000 times periodically gives more stable workloads than flushing 1GB at one time. RocksDB also uses sync_file_range().

Async sync_file_range is not always asynchronous

Sometimes you might want to flush pages/files more earlier than relying on kernel threads (bdflush), in order to avoid burst writes. fsync() and sync sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)) can be used for that purpose, but both takes longer time (~10ms) on HDD if RAID write cache is disabled. You probably don't want to execute from user-facing thread.
How about using async sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE)) from user-facing thread? It's supposed not to wait for i/o, so latency should be minimal. But I don't recommend using sync_file_range() from user facing thread like that. This is actually not always asynchronous, and there are many cases it takes time for waiting for disk i/o.
I'm showing a couple of examples where async sync_file_range() takes longer time. In the following examples, I assume stable page writes are already disabled.

Stall Example 1: Small range sync_file_range()

single thread
  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 100000; i++) {
    write(fd, buf, 1000); // not aligned page write
    sync_file_range(fd, i*1000, 1000, SYNC_FILE_RANGE_WRITE); // async
  }

In example 1, with stable page write fix, write() won't wait for dirty pages written to disk(writeback). But sync_file_range() actually waits for writeback.
When stable page write is disabled, there is a possibility that a page is both writeback in progress and marked dirty. Below is an example scenario.

1. write() -> marking page 1 dirty
2. sync_file_range(SYNC_FILE_RANGE_WRITE) -> sending writeback request on page 1
3. write() -> making page 1 dirty (not waiting with stable page write fix)
4. sync_file_range(SYNC_FILE_RANGE_WRITE) -> waiting until page 1 is written back

In this case, the second sync_file_range(SYNC_FILE_RANGE_WRITE) is blocked until flushing to disk triggered by the first sync_file_range() is done, which may take tens of milliseconds.
Here is an example stack trace when sync_file_range() is blocked.

                sleep_on_page
                __wait_on_bit
                wait_on_page_bit
                write_cache_pages
                generic_writepages
                xfs_vm_writepages
                do_writepages
                __filemap_fdatawrite_range
                filemap_fdatawrite_range
                SyS_sync_file_range
                tracesys
                sync_file_range
                __libc_start_main

Stall example 2: Bulk sync_file_range()

What happens if calling write() multiple times then call sync_file_range(SYNC_FILE_RANGE_WRITE) for multiple pages at once? In below example, calling write() 21 times then triggering flush by sync_file_range().

  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 21; i++) {
    write(fd, buf, 1000); // not aligned page write
  }
  sync_file_range(fd, 0, 16384, SYNC_FILE_RANGE_WRITE);
  for(i=22; i< 42; i++) {
    write(fd, buf, 1000);
  }
  sync_file_range(fd, 16384, 32768, SYNC_FILE_RANGE_WRITE);

Unfortunately, sync_file_range() also may take time in this case too. It works as below in xfs. Since xfs does neighbor page flushing via sync_file_range(), there is a possibility that a page is both under writeback in progress and marked dirty.

1. write -> page 1~6 become dirty
2. sync_file_range (page 1~4) -> triggering page 1~4 and 5, 6 for flushing (in xfs)
3. write -> page 6~11 become dirty
4. sync_file_range (page 5..8) -> waiting for page 6 to be flushed to disk

Note that if write volume (and overall disk busy rate) is lower enough than disk speed, page 6 should be flushed to disk before starting second sync_file_range(). In that case it shouldn't wait anything.

Stall example 3: Aligned page writes

The main reason why async sync_file_range() was blocked is that write() was not aligned by page size. What if we are doing fully aligned page write (writing 4KB multiple)?
With aligned page write, async sync_file_range() does not wait shown at Example 1 and 2, and gives much better throughput. But, even with aligned page write, sometimes async sync_file_range() waits for disk i/o.
sync_file_range() submits page write i/o requests to disks. If there are many outstanding i/o read/write requests in a disk queue, new i/o requests are blocked until there is a free slot available in the queue. This blocks sync_file_range() too.
Queue size is managed under /sys/block/sdX/queue/nr_requests. You may increase to larger values.

echo 1024 > /sys/block/sda/queue/nr_requests

This mitigates stalls at sync_file_range() on busy disks. But this won't solve problems entirely. If you submit many more write i/o requests, read requests take more time to serve (write-starving-reads) which very negatively affects user-facing query latency.

Solution for the stalls

Make sure use Linux kernels supporting disabling stable page write. Otherwise write() would be blocked. My previous post covers this topic. sync_file_range(SYNC_FILE_RANGE_WRITE) is supposed to by asynchronous, but is actually blocked for writeback in many patterns, so it's not recommended calling sync_file_range() from user-facing thread, if you really care about latency. Calling sync_file_range() from a background (not user-facing) thread would be better solution here.
Buffered write and sync_file_range() are important for some databases like HBase and RocksDB. For HBase/Hadoop, using JBOD is one of the well known best practices. HLog writes are buffered, and not flushed to disk per write(put operation). There are some HBase/Hadoop distributions supporting sync_file_range() to reduce outstanding dirty pages. From Operating System point of view, HLog files are appended, and file size is not small (64MB by default). This means all HLog writes go to a single disk with JBOD configuration, which means the single disk tends to be overloaded. An overloaded disk takes longer time for flushing dirty pages (via sync_file_range or bdflush), which may block further sync_file_range(). To get better latency, using Linux Kernel supporting to disable stable page write, and calling sync_file_range() from background threads (not from user-facing thread) are important.

Why buffered writes are sometimes stalled

2014-03-10T15:08:00.001-07:00

Many people think buffered write (write()/pwrite()) is fast because it does not do disk access. But this is not always true. Buffered write sometimes does disk access by itself, or waits for some disk accesses by other threads. Here are three common cases where write() takes longer time (== causing stalls).

1. Read Modify Write

Suppose the following logic. Opening aaa.dat without O_DIRECT/O_SYNC, writing 1000 bytes sequentially for 100,000 times, then flushing by fsync().

  fd=open("aaa.dat", O_WRONLY);
  for(i=0; i< 100000; i++) {
    write(fd, buf, 1000);
  }
  fsync(fd);

You might think each write() will finish fast enough (at least less than 0.1ms) because it shouldn't do any disk access. But it is not always true.
Operating System manages I/O by page. It's 4KB for most Linux environments. If you'd modify 1000 bytes of the 4KB page from offset 0, Linux first needs to read the 4KB page, modify first 1000 bytes, then write the page back. The page will be sooner or later written to disk. Yes, reading the page is needed. This is called RMW (Read Modify Write). If the page was not cached in filesystem cache (page cache), reading the page from disk is needed, which may take tens of milliseconds on HDD.
This problem often happens when overwriting large files. You can easily repeat the problem by the following steps.

1. Creating a large file (cached in fs cache)

 dd if=/dev/zero of=aaa bs=4096 count=1000000

2. Uncache the file (i.e. echo 3 > /proc/sys/vm/drop_caches)
3. Writing to the file (using write()/pwrite()) => the target page does not exist in fs cache. So reading from disk. You can verify that by iostat.

There are a couple of solutions to avoid slow Read Modify Write.

Appending a file, not updating in place

Appending a file means newly allocated pages are always cached, so slow Read Modify Write issue doesn't happen.
In MySQL, binary logs are appended, not overwritten. InnoDB log files are always overwritten so this workaround can't be used.
Note that if you need full durability (calling fsync/fdatasync() per each write()), appending is much more expensive than overwriting for most filesystems. It is well known that sync_binlog=1 is very slow in MySQL ~5.5, and the main reasons were group commit was broken and appending + fsync() was not fast. In 5.6 group commit was supported so multi-threaded write throughput improved a lot.

Always cache target files within filesystem cache

If target pages are cached in filesystem cache (page cache), write() doesn't hit disk reads.
The obvious disadvantage is that this approach wastes memory. RAM is expensive. If you have 128GB InnoDB log files, you won't like to give 128GB RAM for InnoDB log files. The RAM should be allocated for InnoDB buffer pool.

Aligning write() I/O unit size by 4KB multiple

If you always write with Linux page size aligned (4KB multiple), Read Modify Write issue can be avoided.
One approach to do aligned write is zero-filling. Below is an example.

  memset(buf, 0, 4096); // zerofill
  memset(buf, data1, 1000); // set application data (1000 bytes)
  pwrite(fd, buf, 4096, 0); // write 4KB

  memset(buf2, 0, 4096);
  memset(buf2, data2, 1000);
  pwrite(fd, buf2, 4096, 4096);
  ....

In this example, you write 1000 bytes of application data twice, but actually writing 4KB data twice, 8KB in total. 8192-2000=6192 bytes are zero-filled data. Disk reads don't happen by this approach, even if they are not cached in filesystem cache.
This approach needs more space. In the above case you wasted 6192 bytes.
Another approach is remembering application data offset, but writing by 4KB aligned.

  memset(buf, 0, 4096); // zerofill
  memset(buf, data1, 1000); // set application data (1000 bytes)
  pwrite(fd, buf, 4096, 0);

  memset(buf, 0, 4096);
  memset(buf+1000, data2, 1000); // set next application data from real offset
  pwrite(fd, buf, 4096, 0);
  ....

This approach doesn't waste space.
InnoDB log file write unit is not aligned by 4KB, so Read Modify Write issue exists. Some MySQL fork distributions fix the problem by writing 4KB aligned data. Facebook MySQL takes the latter approach, and we're using in production. Percona Server also supports aligned write but it's stated as beta quality.

2. write() may be blocked for "stable page writes"

Dirty pages

Most write commands don't flush to disk immediately. write()/pwrite() functions write to Linux page cache and mark them dirty, unless the target file is opened with O_DIRECT/O_SYNC. write() and pwrite() are basically the same except that pwrite() has an option to set offset. I mean write() here as dirty page write.
Dirty pages are sooner or later flushed to disk. This is done by many processes/functions, such as bdflush, fsync(), sync_file_range(). Behavior to flush to disk highly depends on filesystem. I consider only XFS here.
When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done. This is called "Stable Page Write". This may cause write() stalls, especially when using slower disks. Without write cache, flushing to disk takes ~10ms usually, ~100ms in bad cases.

Suppose the following example. There are two clients. One is writing 10 bytes repeatedly via write(), the other is calling fsync() to the same file.

Thread 1:
  fd=open("aaa.dat", O_WRONLY);
  while(1) {
    t1 = get_timestamp
    write(fd, buf, 10);
    t2 = get_timestamp
  }

Thread 2:
  fd=open("aaa.dat", O_WRONLY);
  while(1) {
    fsync(fd);
    sleep(1);
  }

If you run on a slower disks (HDD with write cache disabled), you may notice sometimes write() (t2-t1) takes more than 10ms. Taking time for fsync() is expected because fsync() flushes dirty pages and metadata into disk, but buffered write also sometimes takes time due to stable page write.

Another annoying issue is while write() is blocked, the write() holds an exclusive inode mutex. It blocks all writes and disk reads to/from the same file (all pages within the same file).

Disabling Stable Page Writes

Hopefully there is a patch to disable Stable Page Write, and some Linux distributions support it.
With this patch, on most filesystems write() no longer waits for dirty page writeback. It helps not to cause write() latency spikes.

3. Waiting for journal block allocation in ext3/4

If you are using ext3 or ext4, you may still suffer from occasional write() stalls, even if disabling stable page writes. This happens when write() waits for journal block allocation. Example stack trace is as follows. Easier workaround is using xfs, which gives no such stalls.

  Returning from: 0xffffffff81167660 : __wait_on_buffer+0x0/0x30 [kernel]
  Returning to : 0xffffffff811ff920 : jbd2_log_do_checkpoint+0x490/0x4b0 [kernel]
  0xffffffff811ff9bf : __jbd2_log_wait_for_space+0x7f/0x190 [kernel]
  0xffffffff811fab00 : start_this_handle+0x3b0/0x4e0 [kernel]
  0xffffffff811faceb : jbd2__journal_start+0xbb/0x100 [kernel]
  0xffffffff811fad3e : jbd2_journal_start+0xe/0x10 [kernel]
  0xffffffff811db37e : ext4_journal_start_sb+0x7e/0x1d0 [kernel]
  0xffffffff811bd757 : ext4_da_write_begin+0x77/0x210 [kernel]
  0xffffffff810deaea : generic_file_buffered_write+0x10a/0x290 [kernel]
  0xffffffff810e0171 : __generic_file_aio_write+0x231/0x440 [kernel]
  0xffffffff810e03ed : generic_file_aio_write+0x6d/0xe0 [kernel]
  0xffffffff811b732f : ext4_file_write+0xbf/0x260 [kernel]
  0xffffffff8113907a : do_sync_write+0xda/0x120 [kernel]
  0xffffffff8113993e : vfs_write+0xae/0x180 [kernel]
  0xffffffff81139df2 : sys_pwrite64+0xa2/0xb0 [kernel]
  0xffffffff8156a57b : system_call_fastpath+0x16/0x1b [kernel]

Summary

Question: Why does buffered write() sometimes stall? It just writes to kernel buffer and doesn't hit disk.
Answer:
  1. write() does disk read when needed. To avoid this issue you need to append a file, not overwrite. Or use OS page aligned writes.
  2. write() may be blocked for "stable page writes". To avoid this issue you need to use newer Linux kernel supporting disabling stable page writes.
  3. If you really care about latency, I don't recommend using ext. Use xfs instead.

Many of the issues can be mitigated by using battery/flash backed write cache on raid controller, but this is not always possible, and battery often expires.
In the next post, I'll describe about why sync_file_range(SYNC_FILE_RANGE_WRITE) sometimes stalls.

Single thread performance regression in 5.6 - Replication

2013-12-16T11:12:00.000-08:00

At Facebook, we have upgraded most of MySQL database tiers to 5.6, except very few tiers that have a special requirement -- very fast single threaded replication speed.

As Oli mentioned, single threaded performance is worse in 5.6. The regression is actually not visible in most cases. For remote clients, the performance regression is almost negligible because network latency is longer than 5.1->5.6 overhead. If clients are running locally but MySQL server is disk i/o bound, the overhead is negligible too because disk i/o latency is much longer than 5.1->5.6 overhead.

But the regression is obvious when clients run locally and queries are CPU bound. The most well known local client program for MySQL is SQL Thread (Replication Thread). Yes, 5.6 has a slower replication performance problem, if SQL thread is CPU bound.

If all of the following conditions are met, SQL thread in 5.6 is significantly slower than 5.1/5.5.

Almost all working sets fit in memory = Almost all INSERT/UPDATE/DELETE statements are done without reading disks
Very heavily inserted/updated and committed (7000~15000 commits/s)
5.6 Multi-threaded slave can not be used (due to application side restrictions etc)
You want to use crash safe slave feature and optionally GTID

Here is a micro-benchmark result. Please refer a bug report for details (how to repeat etc).

Fig 1. Single thread replication speed (commits per second)

On this graph, 5.1, 5.5, and 5.6 FILE are not crash safe. fb5.1 and 5.6 TABLE are crash safe. 5.6p has some experimental patches. Crash safe means even if slave mysqld/OS are crashed, you can recover the slave and continue replication without restoring MySQL databases. To make crash safe slave work, you have to use InnoDB storage engine only, and in 5.6 you need to set relay_log_info_repository=TABLE and relay_log_recovery=1. Durability (sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1) is NOT required.

There are a few observations from the benchmark.

17% drop from 5.1.69 to 5.5.33. I did not profile 5.5 much, but I suspect the regression was mostly caused by Metadata Locking feature introduced from 5.5.
From 5.5.33 to 5.6.15(FILE), 20% drop if log-slave-updates was enabled, 18% drop if log-slave-updates was disabled. In 5.6, FILE based replication has an inefficiency bug that it writes to OS filesystem cache too often (bug#70342). 5.6.15p(FILE) applied the fix for the bug, but was still 15% lower than 5.5.33 if log-slave-updates was enabled, 2.5% lower if log-slave-updates was disabled. This shows writing to binlog (from single threaded SQL thread) in 5.6 takes more time than 5.1/5.5.
Performance drop for 5.6.15 TABLE was serious. Compared to fb5.1 (crash safe), performance drop was 51.2% (14520/s->6103/s). slave_relay_log_info table is opened/fetched/updated/closed per every transaction commit. By keeping the table opened and skipping fetching row (updating only), performance improved to 7528/s (with log-slave-updates) and 9993/s (without log-slave-updates), but these were still much lower than fb5.1 (12505/s and 17056/s).

How can we fix and/or avoid the problem?

From MySQL development point of view, there are some room for performance optimizations.

Writing to slave_relay_log_info table should be optimized more. Currently the table is opened/fetched/updated/closed via Storage Engine API for each commit. It will be more efficient by updating the table via Embedded InnoDB API, which is currently used from InnoDB memcached plugin. slave_relay_log_info is a system table, and is updated by SQL thread (and MTS worker threads) only, so some tricky optimizations can be done here.
Writing to binlog should be faster in 5.6. 5.6 added many features such as group commit, so it is not surprising to get lower single threaded performance without any optimization.
5.1 to 5.5 performance drop was likely caused by Metadata Locking feature. I have not profiled in depth yet, but I think some functions (i.e. MDL_context::acquire_lock()) can be refactored to get better performance.

From DBA point of view, there are couple of workarounds. First of all, the regression won't happen in most cases. Most application data should be much larger than RAM, and SQL thread is doing many I/O, so CPU overhead is negligible. Even if working sets fit in memory, many applications don't need over 7000 commits per second on single instance. If you don't need high commit speed, you won't hit the problem either.
But some applications may really hit the problem -- a kind of queuing application is a typical example. At Facebook we are continuing to use 5.1 for such applications.
If you want to use 5.6 for such applications, you may disable crash safe slave, because usually database size is very small. If database size is small enough (i.e. < 200GB), you can easily restore from other running master/slaves without taking hours. If you don't need crash safe slave feature, you can use FILE based (default) replication, which is still slower than 5.1/5.5 (see "5.6.15 FILE" on the above graph) but is much better than TABLE based.
Also, Consider not using log-slave-updates. log-slave-udpates is needed when you use GTID. If you want master failover solution without GTID, MHA is a good enough solution and it does not require log-slave-updates.

Making full table scan 10x faster in InnoDB

2013-10-10T11:19:00.000-07:00

At MySQL Connect 2013, I talked about how we used MySQL 5.6 at Facebook, and explained some of new features we added to our Facebook MySQL 5.6 source tree. In this post, I'm going to talk about how we made full table scan faster in InnoDB.

Faster full table scan in InnoDB

In general, almost all queries from applications are using indexes, and reading very few rows (0..1 on primary key lookups and 0..hundreds on range scans). But sometimes we run full table scans. Typical full table scan examples are logical backups (mysqldump) and online schema changes (SELECT ... INTO OUTFILE).

We take logical backups by mysqldump at Facebook. As you know MySQL offers both physical and logical backup commands/utilities. Logical backup has some advantages against physical backup. For example:

Logical backup size is much smaller. 3x-10x size difference is not uncommon.
Relatively easier to parse backups. On physical backup, if we can't restore a database by some serious reasons such as checksum failure, it is very difficult to dig into InnoDB internal data structures and fix corruptions. We trust logical backups rather than physical backups

Major drawbacks of the logical backup are that full backup and full restore are much slower than physical backup/restore.

The slowness of full logical backup often causes problems. It may take very long time if database size grows a lot and tables are fragmented. At Facebook, we suffered from mysqldump performance problem that we couldn't finish full logical backup within reasonable amount of time on some HDD and Flashcache based servers. We knew that InnoDB wasn't efficient at full table scan because InnoDB did not actually do sequential reads, but did random reads mostly. This was a known issue for long years. As database storage capacity has been growing, the slow full table scan issue has been getting serious to us. So we decided to enhance InnoDB to do faster sequential reads. Finally our Database Engineering team implemented "Logical Readahead" feature in InnoDB. With logical readahead, our full table scan speed improved 9~10 times than before under usual production workloads. Under heavy production workloads, full table scan speed became 15~20 times faster.

Issues of full table scan on large, fragmented tables

When doing full table scan, InnoDB scans pages and rows by primary key order. This applies to all InnoDB tables, including fragmented tables. If primary key pages (pages where primary keys and rows are stored) are not fragmented, full table scan is pretty fast because reading order is close to physical storage order. This is similar to reading files by OS command (dd/cat/etc) like below.

# dd if=/data/mysql/dbname/large_table.ibd of=/dev/null bs=16k iflag=direct

You may find that even on commodity HDD servers, you can read more than 100MB/s multiplied by "number of drives" from disks. Over 1GB/s is not uncommon.

Unfortunately, in many cases primary key pages are fragmented. For example, if you need to manage user_id and object_id mappings, primary key will be (user_id, object_id). Insert ordering does not match with user_id ordering, so new inserts/updates very often cause page splits. New split pages will be allocated at far from current pages. This means pages get fragmented.

If primary key pages are fragmented, full table scan becomes very slow. Fig 1 illustrates the problem. After InnoDB reads leaf page #3, it has to read page #5230, and after that it has to read page #4. Page #5230 is far from page #3 and #4, so disk read ordering becomes almost random, not sequential. It is very well known that random reads on HDD is much slower than sequential reads. One very effective approach to improve random reads is using SSD, but per-GB cost on SSD is still more expensive than HDD so using SSD is not always possible.

Fig 1. Full table scan is not actually doing sequential reads

Does Linear Read Ahead really help?

InnoDB supports prefetching feature called "Linear Read Ahead". With linear read ahead, InnoDB reads an extent (64 consecutive pages: 1MB if not compressed) at one time if N pages are accessed sequentially (N can be configured by innodb_read_ahead_threshold parameter, default is 56). But actually this does not help so much. One extent (64 pages) is very small range. For most large fragmented tables, next page does not exist in the same extent. See above fig 1 for example. After reading page#3, InnoDB needs to read page#5230. Page#3 does not exist in the same extent as #5230, so linear read ahead won't help here. This is pretty common case for large fragmented tables. That's why Linear read ahead does not help much to improve full table scan performance.

Optimization approach 1: Physical Read Ahead

As described above, the problem of slow full table scan was because InnoDB did mostly random reads. To make it faster, making InnoDB do sequential reads was needed. The first approach I came up with was creating an UDF (Use Defined Function) to read ibd file (InnoDB data file) sequentially. After executing the UDF, pages in the ibd file should be within InnoDB buffer pool, so no random read happens when doing full table scan. Here is an example usage.

mysql> SELECT buf_warmup ("db1", "large_table"); /* loading into buf pool */
mysql> SELECT * FROM large_application_table; /* in-memory select */

buf_warmup() is a udf that reads entire ibd file of database "db1", table "large_table". It takes time to read the entire ibd file from disk, but reads are sequential so much faster than random reads. When I tested, I could get ~5x overall faster time than normal linear read ahead.
This proved that sequentially reading ibd files helped to improve throughput, but there were a couple of disadvantages:

If table size is bigger than InnoDB buffer pool size, this approach does not work
Reading entire ibd file means that not only primary key pages, but also secondary index pages and unused pages have to be read and cached into InnoDB buffer pool, even though only primary key pages are needed for full table scan. This is very inefficient if you have many secondary indexes.
Applications have to be changed to call UDF.

This looked "good enough" solution, but our database engineering team came up with a better solution called "Logical Read Ahead", so we didn't choose UDF approach.

Logical Read Ahead

Logical Read Ahead (LRA) works as below.

Reading many branch pages of the primary key
Collecting leaf page numbers
Reading many (configurable amount of) leaf pages by page number order (mostly sequential disk reads)
Reading rows by primary key order (same as usual full table scan, but buffer pool hit rate should be very high)

This is illustrated at fig 2 below.

Fig 2: Logical Read Ahead

Logical Read Ahead (LRA) solved issues of Physical Read Ahead. LRA enables InnoDB to read only primary key pages (not reading seconday index pages), and to prefetch configurable number of pages (not entire table) at one time. And LRA does not require any SQL syntax changes.
We added two new session variables to make LRA work. One is "innodb_lra_size" to control how many MBs to prefetch leaf pages. The other is "innodb_lra_sleep" session variable to control how many milliseconds to sleep per prefetch. We tested around 512MB ~ 4096MB prefetch size and 50 milliseconds sleep, and so far we haven't encountered any serious (such as crash/stall/inconsistency) problem. These session variables should be set only when doing full table scan. In our case, mysqldump and some utility scripts (i.e. online schema change tool) turn logical read ahead on.

Submitting multiple async I/O requests at once

Another performance issue we noticed was that i/o unit size in InnoDB was only one page, even if doing readahead. 16KB i/o unit size is too small for sequential reads, and much less efficient than larger I/O unit size.
In 5.6, InnoDB uses Linux Native I/O (aio) by default. If submitting multiple consecutive 16KB read requests at once, Linux internally can merge requests and reads can be done efficiently. But unfortunately InnoDB submitted only one page i/o request at once. I filed a bug report #68659. As written in the bug report, on modern HDD RAID 1+0 environment, I could get more than 1000MB/s disk reads by submitting 64 consecutive pages requests at once, while I got only 160MB/s disk reads by submitting one page request.
To make Logical Read Ahead work better on our environments, we fixed this issue. On our MySQL, InnoDB submits many more page i/o requests before calling io_submit().

Benchmark

In both cases, our production tables (fragmented tables) were used.

1. Full table scan on pure HDD (basic benchmark, no other workload)

Table size	Without LRA	With LRA	Improvement
10GB	10 min 55.00 sec	1 min 15.20 sec	8.71x
50GB	52 min 44.42 sec	6 min 36.98 sec	7.97x

2. Online schema change under heavy workload

Table size	Without LRA	With LRA	Improvement
1GB	7 min 33 sec	24 sec	18.8x

* dump time only, not counting data loading time

Source code

All of our enhancements are available at GitHub.
- Logical read ahead implementation : diff
- Submitting multiple i/o requests at once : diff
- Enabling logical read ahead on mysqldump : diff

Conclusion

InnoDB was not efficient for full table scan, so we fixed it. We did two enhancements, one was implementing logical read ahead, the other was submitting multiple async read i/o requests at once. We have seen 8 to 18 times performance improvements on our production tables, and this has been very helpful to reduce our backup time, schema change time, etc. I hope this feature will be supported in InnoDB by Oracle officially, or at least by major MySQL fork products.

Another reason to disable performance schema

2013-07-31T15:03:00.000-07:00

Here is another micro-benchmark to prove that Performance Schema overhead is so serious.

non-persistent connections from remote machine, 100 concurrency, MySQL 5.6.13

performance_schema	connections per second
0	39525.62
1(default)	26304.53

Benchmark command is the same as what I published long time ago.

I understand almost all applications don't need 20000~30000 connections per second per instance, but latency increase should be considered. It would be better to make performance_schema parameter dynamic, and set 0 as default.

Speaking at MySQL Conference and Expo 2013

2013-04-09T21:12:00.003-07:00

This year I have two sessions and one panel discussion.

- Practical failover design - automated, semi-automated and manual failover (24 April 11:10am - 12:00pm)
- Practices for reducing MySQL database size (25 April 12:50pm - 13:40pm)
- MySQL at Facebook: lots and lots of small data (25 April 1:50pm - 2:40pm, together with Mark, Harrison and Domas)

At HA session I'll speak about lessons learned at DeNA and Facebook.. I have used MHA at both companies, in both fully automated and not fully automated manner. We are also heavily testing GTID in 5.6 so probably I can share some practices as well. I'll speak about this at MySQL and Cloud Database Solutions Day on April 26, too.

At data reduction session, as session title suggests I'll talk about many data reduction techniques I have done so far and/or are planning to do at Facebook. Data reduction is very important for us. SSD / Pure Flash are still expensive, and "big data" costs a lot for such as backups, network transfers, etc.

Non-persistent connection performance improvements in 5.6

2012-12-12T12:52:00.000-08:00

One of my favorite performance improvements in 5.6 is "faster non-persistent connections". I couldn't find this information from release notes, but non-persistent connection is really faster in 5.6.

Here are sysbench micro-benchmark results. I tested from 100 concurrent clients, running remotely. Clients simply connected and disconnected repeatedly.

version	connections per second
5.1.65	20712
5.5.28	24000
5.5.25Maria	28073
5.5.25Maria,ThreadPool	27653
5.6.9	37800

The whole sysbench command I used is as below.

[remote_client]$ sysbench --test=oltp --oltp-table-size=2000000 
 --max-requests=1000000 --mysql-table-engine=innodb --db-ps-mode=disable
 --mysql-engine-trx=yes --oltp-read-only --oltp-skip-trx --oltp-dist-type=special
 --oltp-reconnect-mode=query --oltp-connect-delay=0 --db-driver=mysql
 --mysql-user=root --mysql-host=$server_host --mysql-port=$server_port
 --num-threads=100 run

I tested 1000+ connections and got similar numbers.

As you can see, 5.6 is nearly 2x faster than 5.1.

There are many performance improvements in 5.6, so I'm not sure which fix contributed the most. I assume the biggest one is that THD (quite a large C++ class inside MySQL) destructor is no longer called during holding a global mutex. Prior to 5.6, THD destructor is called under a global LOCK_thread_count mutex. This is not efficient. I experienced some serious connection/disconnection problems caused by the global mutex in MySQL Cluster a few years ago, and there were some fixes in MySQL Cluster source tree. Now in 5.6, THD destructor is called outside of the LOCK_thread_count mutex as below. This is great news.

5.1/5.5: one_thread_per_connection_end() -> unlink_thd()
  (void) pthread_mutex_lock(&LOCK_thread_count);
  thread_count--;
  delete thd;
  ...

5.6.9: one_thread_per_connection_end()
  ...
  mysql_mutex_unlock(&LOCK_thread_count);
  delete thd;

5.6 also fixed some connection performance issues reported by Domas. Some of major issues are still ongoing. So once all of them are fixed, we can expect even better performance:)

Apparently persistent connection is much more efficient than non-persistent, but this is not always possible. In addition to that, 5.6 improves query performance in high concurrency as well. I believe many production engineers will welcome these performance improvements.

Automating master failover is possible but needs care

2012-09-18T13:09:00.002-07:00

I was asked from a few people about my opinion of the Github's recent service outage. As a creator of MHA, I have lots of MySQL failover experiences.
Here are my points about failover design. Most of them duplicate with Robert's points.

- "Too Many Connections" is not a reason to start automated failover
- Do not repeat failover

I know some unsuccessful failover stories that "1. failover happens because master is unreachable (getting too many connections errors) due to heavy loads 2. failover happens again because the new master is unreachable due to heavy loads 3. failover happens again....". On database servers, newly promoted master is slower because of poor cache hit rate. On traditional active/standby environment, database cache on the new master is empty so you'll suffer from 10x or even worse performance for the time being. On master/slave environment, slave has cache so performance is much better than standby server, but you can't expect better performance than master.
It does not make any sense to repeat failover within short time, and automated failover should not happen just because master is overloaded. If master is overloaded due to H/W problems (i.e. raid battery failure, disk block failure, etc), failover will need to be performed, but I think this can be manually done.

MHA does not start failover if specific error codes are returned (i.e. 1203: ER_TOO_MANY_USER_CONNECTIONS). And MHA does not repeat failover if 1. last failover failed with errors or 2. last failover happened within N minutes (480 minutes by default) ago.

- Do not failover if it is unclear master is dead

This is very important to avoid split brain. In many cases data inconsistency is more problematic than longer downtime. You need to make sure on the master that no mysqld process is running / will not run. Even though master is not reachable via TCP/IP connection attempts, mysqld may be just during crash recovery. Forcing shutdown on the mater (power off) is my favorite approach, but may take long time depending on H/W.
MHA has a helper script to kill (i.e. power off) master. When I developed MHA, I spent long time for investigating how to speed up shutting down machines.

- Prepare tools for manual failover

There are some cases that automating failover is really scary - typical example is a datacenter failure. If the whole datacenter is not reachable, it is not easy to automatically check master's status, and probably remotely shutting down master is not possible. And it would be unclear when the datacenter is recovered. In such cases I think automated failover should not be performed, but manual failover should be done. Proper alerts should be sent immediately, so that DBAs can start analyzing problems and start manual failover quickly. On master/slave environments, slaves' relay log positions might be different each other. Checking all slaves' status and if needed fixing consistency by parsing relay logs is painful. MHA will be helpful in such situations, and actually I have used MHA many more times for manual failover than automated failover.

Relocating to US and joining Facebook

2012-03-26T19:00:00.005-07:00

I recently joined Facebook. I relocated to Redwood City (less than 10km from Facebook Menlo Park office) and I commute by bicycle (or caltrain + shuttle on rainy days). I'm currently taking a bootcamp training with Lachlan and other new employees, and will soon join database team.

I really enjoyed MySQL life at DeNA. This was actually the first time experience for me to manage hundreds to thousands of rapidly growing servers. I believe DeNA is the most technically advanced MySQL users in Japan. As you may know, DeNA is a creator of HandlerSocket NoSQL plugin. I created MHA for automating master failover and semi-online master switch. When I worked as a consultant at MySQL/Sun/Oracle, it was very difficult to allocate enough time for creating new products since I spent most of the time for 1-5 day consulting delivery to achieve sales/billing targets. I like short-term consulting, but I like long-term development/enhancement engagements, too. This was the biggest reason why I joined DeNA in 2010. Profitable and large services companies like DeNA have had many technically interesting problems, and it makes sense to allocate enough time for creating useful tools. I really appreciate all database and infrastructure team members, and social gaming developers at DeNA. I can still recommend Japanese software geeks to join DeNA.

As you know, Facebook is the most advanced MySQL users in the world. I'm excited to work on huge global environments and to work with world-famous colleagues. I need some time to get accustomed to a life in the Bay Area, but I hope it won't take long time. I'm looking forward to seeing many ex-MySQL colleagues and friends based in San Francisco and Bay Area. See you at upcoming MySQL Conference in Santa Clara!

Speaking at Game Developer Conference 2012

2012-02-27T01:29:00.001-08:00

I believe most of my blog readers are not game developers, but I'll have a talk "Scaling and Stabilizing Large Server Side Infrastructure" in March 8th at Game Developer Conference (GDC) 2012.
Social games are very interesting from devops perspective. Social games tend to grow (and sometimes shrink) rapidly. To keep high enough ARPU (Average Revenue Per User), high availability and non-stop master maintenance are important for us. Users are very strict for immediate response, so performance optimizations (including minimal replication delay) are also high priority. I'd like to share our experiences with audiences, and I'd like to get feedbacks.

MHA for MySQL 0.53 released

2012-01-08T17:40:00.002-08:00

MHA for MySQL (Master High Availability Manager and tools for MySQL) version 0.53 has been released. Downloads are available here.

Here are release notes.

New features:
* Supporting RESET SLAVE ALL from MySQL 5.5.16
* Supporting "skip_reset_slave" parameter to avoid running CHANGE MASTER TO on the promoted slave
* Doing master's health checks optionally via MySQL CONNECT, in addition to SELECT
* Supporting "mysql --binary-mode" from MySQL 5.6.3
* Supporting ssh_host and ssh_port parameters
* Supporting ssh_options parameter
* Supporting --check_only for online master switch (dry-run)
* When doing online master switch, MHA checks whether long queries are running on the new master. This is important to reduce workloads on the new master. Query time limit can be controlled via --running_seconds_limit.
When executing SIGINT on online master switch, MHA tries to disconnect established connections via MHA.
* Additionally checking replication filtering rules on online master switch

Bug fixes:
* MHA Manager looks for relay-log.info in wrong location when setting relay_log_info_file
* Wrong option for master_ip_failover_script
* Timeout settings for SSH connection health check does not always work
* Modifying a rpm spec file to create valid rpm package for 64bit RHEL6
* Forcing more strict ssh checking. Originally MHA checks master's reachability by just connecting via SSH and exiting with return code 0. This in some cases does not work especially if SSH works but data files are not accessible. In this fix, MHA checks master's ssh reachability by executing save_binary_logs command (dry run). MHA Client also needs to be updated to 0.53.
* Zombie process remains on master ping timeout
* Do not execute SET GLOBAL read_only=(0|1) if not needed

How to upgrade

Upgrading MHA from older version to 0.53 is easy. Simply stop masterha_manager program, replace all MHA Node and MHA Manager packages to 0.53, and restart masterha_manager. You do not need to stop mysqld or MySQL Replication.

MHA for MySQL, project Info

Project top page: http://code.google.com/p/mysql-master-ha/

Documentation: http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6

Source tarball, deb and rpm packages (stable release): http://code.google.com/p/mysql-master-ha/downloads/list

The latest source repository (development tree, url changed): https://github.com/yoshinorim/mha4mysql-manager (Manager source) and https://github.com/yoshinorim/mha4mysql-node (Per-MySQL server source)

Commercial support for MHA is available from SkySQL.

Making slave pre-fetching work better with SSD

2011-10-24T11:35:00.000-07:00

In the recent few weeks I have spent some time for creating yet another slave prefetching tool named "Replication Booster (for MySQL)", written in C/C++ and using Binlog API. Now I'm happy to say that I have released an initial version at GitHub repository.

The objective of Replication Booster is same as mk-slave-prefetch: avoiding or reducing replication delay under disk i/o bound workloads. This is done by prefetching relay logs events, converting to SELECT, and executing SELECT before SQL Thread executes the events. Then SQL thread can be much faster because target blocks are already cached.

On my benchmarking environment Replication Booster works pretty well.

On HDD bound benchmarks(update by pk), SQL thread's qps was:
- Normal Slave (without prefetch): 400 update/s
- With prefetch: 1,500 update/s

On SSD (SAS SSD) bound benchmarks(update by pk), SQL thread's qps was:
- Normal Slave: 1,780 update/s
- With prefetch: 5,400 update/s

It is great that slave could handle many more updates per second without replication delay on disk i/o bound benchmarks. It works very well on SSD, too. The below is a graph that shows Seconds_Behind_Master is gradually decreasing by using Replication Booster.

In this benchmark I executed 4,000 updates per second on master. On the slave server, by default slave delayed continuously because the slave could handle only 1,779 updates per second. When starting using Replication Booster on the slave, the slave could execute 5,418 updates per second. This was higher than the master's qps so Seconds_Behind_Master gradually decreased. After the slave caught up with the master, the slave could execute as same volume of updates as the master (4,000 update/s), so no replication delay happened. This means on this environment we can raise maximum update traffics many more (1,780 update/s -> 4,000-5,400 update/s) without investing for new H/W.

I also tested on some of our production slaves (not used for services) and it showed good results, too. I could get 30-300% improvements, depending on cache hit ratio. If data was fully cached, of course I didn't get any benefit, but it didn't cause negative impacts either.

I hope this tool is interesting to you.

In this blog post, I'll explain backgrounds for developing this tool and basic design. I believe many more optimizations can be done in future. Your feedbacks are welcome.

Good concurrency, bad single threaded performance

I mentioned at Percona Live London that using SSD on slaves is a good practice to reduce replication delay, and SATA/SAS SSD is practical enough because unit price is much cheaper than PCI-E SSD, and SATA/SAS SSD shows not bad concurrency with many drives when using recent RAID controller(most applications actually do not need 30,000-50,000 read iops, even though running many MySQL instances on the same machine). It is certainly an advantage that many SATA/SAS drives (6-10) can be installed on 1U box.

The biggest thing I'm concerned about using SATA/SAS SSD is single thread read iops. You can get only 2,000 read iops from SATA/SAS SSD with RAID controller. If you do not use RAID controller, it is not impossible to get 3,000+ read iops, but this is still much lower than using PCI-Express SSD. You can get 10,000 signle thread read iops from PCI-Express SSD.

When using SATA/SAS SSD, it is easy to predict that slave delays much earlier than using PCI-E SSD. Especially if running multiple MySQL instances per single server, innodb_buffer_pool_size has to be small (i.e. 4GB-12GB), so lots of disk reads will happen. By using 6-10 SATA/SAS drives, maximum throughput can be competitive enough against PCI-Express SSD, but single thread read iops is not improved. This is an issue.

"Slave prefetching" is a well known, great approach to make SQL Thread faster.

What is slave prefetching?

The concept of "slave prefetching" is (I think) well known, but I briefly describe here in case you don't know..

SQL Thread is single threaded. When SQL thread has to do lots of disk i/o by itself, replication is easily delayed. In almost all cases of slave lagging, I/O thread has received all binary log events (and saved as relay logs), but SQL thread delays execution due to massive random disk i/o. So there are many relay log events from SQL thread's current position (Relay_Log_Pos) to the end of relay log (EOF of relay logs).

Random disk reads happen when target blocks(records/indexes) are not cached. If they are cached, random reads won't happen. If you can cache all entries before SQL Thread executes, SQL thread does not have to do random disk reads. Then SQL thread can be much faster.

How can you do that? Read relay logs before SQL Thread executes, covert DML statements (especially UPDATE) to SELECT, then execute SELECT on the slave in parallel.

I believe this concept was introduced to MySQL community by Paul Tackfield at YouTube 4-5 years ago. mk-slave-prefetch is an open source implementation.

Desire for C/C++ based, raw relay log event hanlding tool

At first I tested mk-slave-prefetch on my benchmarks. But as far as I tested, unfortunately it didn't work as I expected. I think the main reasons are as below:

* mk-slave-prefetch uses mysqlbinlog to parse relay logs. But mysqlbinlog is not as flexible and fast as reading raw relay log events. For example, mysqlbinlog output events have to go through the file to the main prefetching program. mysqlbinlog is an external command line tool, so the main prefetching program has to fork a new process to run mysqlbinlog, which opens and closes relay logs every time.

* mk-slave-prefetch is written in Perl. In general, a prefetching tool has to be fast enough to read, convert and execute SELECT statements before SQL thread executes. The tool has to be multi-threaded. The tool probably has to run on the same machine as MySQL slave, in order to minimize network overheads. The resource consumption (CPU and memory) should be small enough so that it doesn't hurt MySQL server performance.
I don't believe Perl is a good programming language for developing such a tool.

I believe C/C++ is the best for programming language for this purpose. And I believe handling raw relay log events is much more efficient than using mysqlbinlog.

Based on the above reasons, I decided to develop a new slave prefeching tool by myself. I had some experiences for parsing binary/relay logs when developing MHA, so at first I planned to create a simple relay log parser program. But immediately I changed my mind, and tried mysql-replication-listener (Binlog API). Binlog API is a newly released utility tool from Oracle MySQL team. Binlog API has a "file driver" interface, which enables to parse binary log or relay log file and handle events one by one. By using Binlog API, handling raw binlog events becomes much easier. For example, it's easy to parse binlog events, get updated entries, store to external software such as Lucene/Hadoop, etc.

Oracle says Binlog API is pre-alpha. But as far as I have tested for slave prefetching purpose, it works very well. It's fast enough, and I didn't encounter any crashing or memory leak issues. So I decided to develop a new slave prefetching tool using Binlog API.

Introduction to Replication Booster for MySQL

I named the new slave prefetching tool as "Replication Booster". Keywords "slave" and "prefetch" were already used by mk-slave-prefetch, so I used different words.

The below figure is a basic architecture of Replication Booster.

Design notes

- Replication Booster is a separated tool (runs as a MySQL client). It works with normal MySQL 5.0/5.1/5.5/5.6. Starting/stopping Replication Booster is possible without doing anything on MySQL server side.

- Replication Booster is written in C/C++, and using boost::regex for converting UPDATE/DELETE to SELECT. Binlog API also uses boost.

- Using Binlog API to parse relay logs, not using mysqlbinlog
  - Using file driver, not tcp driver. file driver does not connect to MySQL server, and just reading relay log files. So even if file driver has bugs, impacts are rather limited (If it has memory leak, that's serious, but I haven't encountered yet).

- Main thread parses relay log events, picking up query log events, passing to internal FIFO queues
  - Binlog API has an interface to get a binlog event header (event type, timestamp, server-id, etc) and an event body. So it is easy to pick up only query log events.
  - Parsing row based events is not supported yet. It should be worth implementing in the near future.

- Multiple worker threads pop query events from queues, and convert query events to SELECT statements

- A dedicated thread (monitoring thread) keeps track of current SQL Thread's position (Relay_Log_Pos)

- Worker threads do not execute a SELECT statement if the query's position is behind current SQL Thread's position. This is because it's not needed (too late).

- Main thread stops reading relay log events if the event's timestamp is N seconds (default 3) ahead of SQL Thread's timestamp
  - This is for cache efficiency. If reading too many events than needed, it causes negative impacts. In the worst case cache entries that SQL thread needs now are wiped out by newly selected blocks.

- When slave is not delayed, Replication Booster should not cause negative impacts. It shouldn't use noticeable CPU/Disk/Memory resources. It shouldn't prevent MySQL server activities by holding locks, either. Of course, it shouldn't execute converted SELECT statements because they are not useful anymore. The last one is not easy to work on various kinds of environments (i.e. HDD/SSD/etc), but should be controllable by some external configuration parameters

- Bugs on Replication Booster should not result in MySQL server outage.

- Replication Booster works locally. It doesn't allow to connect to remote MySQL servers. This is for performance reasons. Executing tens of thousands of queries per second from this tool remotely will cause massive fcntl() contentions and use high network resources (both bandwidth and CPU time). I don't like that.

Configuration Parameters

--threads: Number of worker threads. Each worker thread converts binlog events and executes SELECT statements. Default is 10 (threads).

--offset-events: Number of binlog events that main thread (relay log reader thread) skips initially when reading relay logs. This number should be high when you have faster storage devices such as SSD. Default is 500 (events).

--seconds-prefetch: Main thread stops reading relay log events when the event's timestamp is --seconds-prefetch seconds ahead of current SQL thread's timestamp. After that the main thread starts reading relay logs from SQL threads's position again. If this value is too high, worker threads will execute many more SELECT statements than necessary. Default value is 3 (seconds).

--millis-sleep: If --seconds-prefetch condition is met, main thread sleeps --millis-sleep milliseconds before starting reading relay log. Default is 10 milliseconds.

- MySQL connection parameters: MySQL slave user, password, socket file or local ip/port

How to verify Replication Booster works on your environments

You may want to run Replication Booster where Seconds_Behind_Master is sometimes growing. If Replication Booster works as expected, you can get the following benefits.

- Seconds_Behind_Master gets decreased, or growth rate of Seconds_Behind_Master has decreased
- Update speed has improved (i.e. Com_update per second has increased) by this tool

Replication Booster has some statistics variables, and prints these statistics when terminating the script (Ctrl+C) like below. If slave delays but "Executed SELECT queries" is almost zero, something is wrong.

Running duration:    847.846 seconds
Statistics:
 Parsed binlog events: 60851473
 Skipped binlog events by offset: 8542280
 Unrelated binlog events: 17444340
 Queries discarded in front: 17431937
 Queries pushed to workers: 17431572
 Queries popped by workers: 5851025
 Old queries popped by workers: 3076
 Queries discarded by workers: 0
 Queries converted to select: 5847949
 Executed SELECT queries: 5847949
 Error SELECT queries: 0
 Number of times to read relay log limit: 1344
 Number of times to reach end of relay log: 261838

I haven't spent so much time on this project yet (just started a few weeks ago). Current algorithm is simple. I believe many more optimizations can be done in future, but even so current benchmark numbers are pretty good. I hope we can use this tool on many places where we want to avoid replication delay but don't want to spend too much money for faster storage devices.

Testing MySQL 5.6.3 network performance improvements

2011-10-04T12:37:00.000-07:00

I'm excited to see the new features in MySQL 5.6.3. Replication enhancements such as parallel SQL threads, crash safe slave and binlog checksum have been desired for years. I really appreciate that MySQL development team has released 5.6.3 in timely manner.

In this blog entry, I'd like to pick up one of my most favorite performance enhancements in MySQL 5.6.3: "network performance improvements". This was reported by Mark Callaghan as bug#54790 over one year ago, and finally it's fixed (I assume it's mainly developed by Davi Arnaut at MySQL dev team) in 5.6.3. The root cause of this problem was that prior to 5.6.3, under high loads MySQL internally had to call fcntl() lots of times to change non-blocking sockets to blocking (with timeout and alarm) and vice versa. fcntl() is serialized on Linux, so it doesn't scale with many CPU cores. You may suffer from this performance issue when clients are not locally based (= in most cases) and query per second is quite high (hundreds of thousands or even more). In many real world environments qps is not so much high, but in the near future these numbers will not be unrealistic, with faster H/W components.

I have tested some simple read-only benchmarks on 16-core and 24-core box, and results in 5.6.3 are great.

These benchmarks are in-memory PK lookup from a remote client. Connections are persistent. The client server never became bottleneck. Apparently the maximum throughput of 5.6.3 is much higher than 5.5 (75% higher on 24 cores, 34% higher on 16 cores). It's interesting that 16 core vs 24 core performance difference is not so much in 5.5 (only 10%), but it's huge in 5.6 (45%). 45% improvement is close to theoretical maximum (16core to 24core: 50%), so for these kinds of benchmarks, 5.6 is really scalable at least up to 24 cores.

Though many MySQL scalability issues could be mitigated by running multiple instances on a single server, you couldn't avoid this performance problem because calling fcntl() is serialized on Linux kernel side. So this performance fix is really important. With MySQL 5.6, I think we can take more scaling-up strategies rather than scaling-out.

How about disk i/o bound benchmarks? On disk bound workloads, network performance improvements do not cause so much impacts of course (since qps is much lower), but as below, 5.6 showed good results compared to 5.5. These performance improvements mainly come from reducing InnoDB internal kernel mutex contentions in 5.6. I'd like to investigate more time to understand internal differences.

These benchmarks were SATA SSD bound (8 drives, RAID1+0) benchmarks. InnoDB buffer pool size was 1GB and active data size was 2.5GB. 5.6 shows better results at higher concurrency.

MHA for MySQL 0.52 released

2011-09-16T03:54:00.000-07:00

I'm happy to announce that MHA for MySQL (Master High Availability Manager and tools for MySQL) version 0.52 has been released. Downloads are available here.
This is a stable release. I have been using MHA on our(DeNA's) cross-datacenter environments. We had tens of failover opportunities and MHA worked successfully in all cases.

Here are release notes.

* multi-master configuration is supported
See my previous blog entry for details.

* Package name changed
To avoid potential trademark issues, I decided to change MHA package names (downloadable file names and GitHub repository name) from "MySQL-MasterHA-Manager" to "mha4mysql-manager", and from "MySQL-MasterHA-Node" to "mha4mysql-node".

- URLs of MHA development repositories on GitHub changed as below.
Manager: https://github.com/yoshinorim/mha4mysql-manager
Node: https://github.com/yoshinorim/mha4mysql-node
If you currently follow previous branches, please modify .git/config and point to new URLs to pull newer versions.

- If you installed MHA node rpm packages (version 0.50) provided from our download site, when upgrading to newer versions, please explicitly uninstall the current rpm package (rpm -e MySQL-MasterHA-Node-0.50-...) and install newer version.

All source codes, test cases, command line programs (i.e. masterha_manager), parameters etc are not changed. So product quality of MHA should not be affected by this change.

* rpm and deb packages are now provided
Prior to 0.52, rpm package for MHA Manager was not provided, and deb package was not provided at all. Now both rpm and deb packages are provided for both Manager and Node(running on MySQL server).
See how to install page for details.

MHA for MySQL, project Info

Speaking at Oracle Open World and Percona Live London

2011-09-08T01:52:00.001-07:00

In October, I'll speak at two big conferences - Oracle Open World and Percona Live London. I appreciate Oracle and Percona to give me opportunities to talk. It's always fun for me to speak at overseas conferences and meet friends.

At Oracle Open World, I have one session: "Scaling and Monitoring MySQL for Rapidly Growing Social Gaming" (here is a timetable).

At Percona Live London, I have one 3-hour tutorial and one 30-minute session .. "Linux and H/W optimizations for MySQL" and "MHA: Introducing automated MySQL master failover solution".

I'll make sure that all of these talks will be practical enough. Basic theories and generic benchmarks will be covered since it's important to understand, but I'll more focus on how we use in production (on 1000+ MySQL servers).

BTW, this is the first time for me to visit London and I'm looking forward to watching Premiere League, though I've heard that it's very difficult to get tickets.

MySQL-MHA: Support for multi-master configuration

2011-08-29T07:31:00.010-07:00

After publishing MySQL MHA in the end of July, I received a few requests for supporting multi-master configuration. So I spent time for extending MHA for supporting multi-master, and now it's published at a separated GitHub branch (a new development tree). I'll take some time for people to evaluate the new feature, and after verifying it's stable enough, I'll merge it to the main branch and will release as a new version (tarball release).

The below is procedures to install MHA Manager multi-master tree.

$ git clone git://github.com/yoshinorim/MySQL-MasterHA-Manager.git

$ cd MySQL-MasterHA-Manager

$ git checkout -b multimaster origin/multimaster

$ perl Makefile.PL

$ make

$ sudo make install

If you have not installed Module::Install Perl module and have difficulties to install it, you can simply copy the module from MHA Manager tarball package like below.

$ cd MySQL-MasterHA-Manager(from github)

$ cp -rp /path/to/MHA-Manager-Tarball/inc ./

$ perl Makefile.PL

...

Here are some notes to make MHA work with multi-master.

* Only one primary master (writable) is allowed. MySQL global variable "read-only=1" must be set on other MySQL masters.

* By default, all managed servers should be in two-tier replication channel. If you want to manage three or more tier replication channel, "multi-tier-slave=1" parameter must be set in the MHA configuration file.

Example replication settings are as follows.

1) Simple two-tier multi-master configuration

  M1(host1,RW) <----> M2(host2,read-only)

       |

 +-----+--------+

S1(host3,R)    S2(host4,R)



=> After failover



  M2(host2,RW)

       |

 +-----+--------+

S1(host3,R)    S2(host4,R)



Configuration example:



[server1]

hostname=host1

candidate_master=1



[server2]

hostname=host2

candidate_master=1



[server3]

hostname=host3



[server4]

hostname=host4

On host2, SET GLOBAL read_only=1 must be set.

2) Three tier multi-master configuration

  M1(host1,RW) <-----------------> M2(host2,read-only)

       |                                |

 +-----+--------+                       +

S1(host3,R)    S2(host4,R)             S3(host5,R)





=> After failover



          M2(host2,RW)

                |

 +--------------+--------------------------+

S1(host3,R)    S2(host4,R)             S3(host5,R)





Configuration example:

[server default]

multi_tier_slave=1



[server1]

hostname=host1

candidate_master=1



[server2]

hostname=host2

candidate_master=1



[server3]

hostname=host3



[server4]

hostname=host4



[server5]

hostname=host5

In this case, host5 is a third-tier slave, so MHA does not manage host5(MHA does not execute CHANGE MASTER on host5 when the primary master host1 fails). When current master host1 is down, host2 will be new master, so host5 can keep replication from host2 without doing anything.

If you want to try MHA, tutorials will be good start.

Announcing MySQL-MHA: "MySQL Master High Availability manager and tools"

2011-07-23T07:49:00.001-07:00

I have published "MySQL MHA" that fully automates MySQL master failover. You can also get commercial support from SkySQL. Let's try MHA today!

Today I'm happy to announce that I have released MySQL-MHA: MySQL Master High Availability manager and tools as an open source software (GPL v2 license). The below is a part of documentation of MHA. I'm glad if you are interested in MHA.

A primary objective of MHA is automating master failover and slave promotion within short (usually 10-30 seconds) downtime, without suffering from replication consistency problems, without spending money for lots of new servers, without performance penalty, without complexity (easy-to-install), and without changing existing deployments. MHA also provides a way for scheduled online master switch: changing currently running master to a new master safely, within a few seconds (normally 0.5-2 seconds) of downtime (blocking writes only).

Difficulties of master failover is one of the biggest issues in MySQL. Many people have been aware of this issue, but in most cases there were not practical solutions. I created MHA to make our (DeNA's) existing 100+ 5.0/5.1/5.5 and future MySQL applications highly available. I think many outside people can also use MHA pretty easily.

Project Info

Project top page: http://code.google.com/p/mysql-master-ha/

Documentation: http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6

Source tarball and rpm package (stable release): http://code.google.com/p/mysql-master-ha/downloads/list

The latest source repository (dev release): https://github.com/yoshinorim/MySQL-MasterHA-Manager (Manager source) and https://github.com/yoshinorim/MySQL-MasterHA-Node (Per-MySQL server source)

Overview

Master Failover is not as trivial as you might think. Suppose you run single master and multiple slaves, which is the most common MySQL deployments. If the master crashes, you need to pick one of the latest slaves, promote it to the new master, and let other slaves start replication from the new master. This is actually not trivial. Even though you could identify the latest slave, other slaves might have not received all binary log events. If you let other slaves connect to the new master and start replication, these slaves lose some transactions. This will cause consistency problems. To avoid consistency problems, you need to identify which binlog events are not sent to each slave, and need to apply lost events to each slave before starting replication. This is very complex approach and manually doing recovery correctly is very difficult. This is illustrated in the slides (especially in p.10 as below) that I presented at the MySQL Conference and Expo 2011.

Fig: Master Failover: What makes it difficult?

Currently most of MySQL Replication users are forced to do manual failover on master crashes. But it is not uncommon to result in more than one hour downtime to complete failover. Each slave is not likely to have received the same relay log events, so you may need to fix consistency problems later. Though master crash does not happen so often, it is really serious once it happens.

MHA is invented to fix these issues. MHA provides the following functionality, and can be useful in many deployments where requirements such as high availability, data integrity, almost non-stop master maintenance are desired.

* Automated master monitoring and failover

MHA has a functionality to monitor MySQL master in an existing replication environment, detecting master failure, and doing master failover automatically. Even though some of slaves have not received the latest relay log events, MHA automatically identifies differential relay log events from the latest slave, and applies differential events to other slaves. So all slaves can be consistent. MHA normally can do failover in seconds (9-12 seconds to detect master failure, optionally 7-10 seconds to power off the master machine to avoid split brain, a few seconds for applying differential relay logs to the new master, so total downtime is normally 10-30 seconds). In addition, you can define a specific slave as a candidate master (setting priorities) in a configuration file. Since MHA fixes consistencies between slaves, you can promote any slave to a new master and consistency problems (which might cause sudden replication failure) will not happen.

* Interactive (manual) Master Failover

You can also use MHA for just failover, not for monitoring master. You can use MHA for master failover interactively.

* Non-interactive master failover

Non-interactive master failover (not monitoring master, but doing failover automatically) is also supported. This feature is useful especially when you have already used a software that monitors MySQL master. For example, you can use Pacemaker(Heartbeat) for detecting master failure and virtual ip address takeover, and use MHA for master failover and slave promotion.

* Online switching master to a different host

In many cases, it is necessary to migrate an existing master to a different machine (i.e. the current master has H/W problems on RAID controller or RAM, you want to replace with faster machine, etc). This is not a master crash, but scheduled master maintenance is needed to do that. Scheduled master maintenance causes downtime (at least you can not write master) so should be done as quickly as possible. On the other hand, you should block/kill current running sessions very carefully because consistency problems between different masters might happen (i.e "updating master1, updating master 2, committing master1, getting error on committing master 2" will result in data inconsistency). Both fast master switch and graceful blocking writes are required.
MHA provides a way to do that. You can switch master gracefully within 0.5-2 seconds of writer block. In many cases 0.5-2 seconds of writer downtime is acceptable and you can switch master even without allocating scheduled maintenance window. This means you can take actions such as upgrading to higher versions, faster machine, etc much more easily.

Architecture

When a master crashes, MHA recovers rest slaves as below.

Fig: Steps for recovery

Basic algorithms are described in the slides presented at the MySQL Conference and Expo 2011, especially from page no.13 to no.34.

In relay log files on slaves, master's binary log positions are written at "end_log_pos" sections (example). By comparing the latest end_log_pos between slaves, we can identify which relay log events are not sent to each slave. MHA internally recovers slaves (fixes consistency issues) by using this mechanism. In addition to basic algorithms covered in the slides at the MySQL Conf 2011, MHA does some optimizations and enhancements, such as generating differential relay logs very quickly (indenpendent from relay log file size), making recovery work with row based formats, etc.

MHA Components

MHA consists of MHA Manager and MHA Node as below.

Fig: MHA components

MHA Manager has manager programs such as monitoring MySQL master, controlling master failover, etc.

MHA Node has failover helper scripts such as parsing MySQL binary/relay logs, identifying relay log position from which relay logs should be applied to other slaves, applying events to the target slave, etc. MHA Node runs on each MySQL server.

When MHA Manager does failover, MHA manager connects MHA Node via SSH and executes MHA Node commands when needed.

Advantages

* Master failover and slave promotion can be done very quickly

MHA normally can do failover in seconds (9-12 seconds to detect master failure, optionally 7-10 seconds to power off the master machine to avoid split brain, a few seconds or more for applying differential relay logs to the new master, so total downtime is normally 10-30 seconds), unless all slaves delay replication seriously. After recovering the new master, MHA recovers the rest slaves in parallel. Even though you have tens of slaves, it does not affect master recovery time, and you can recover slaves very quickly.

* Master crash does not result in data inconsistency

When the current master crashes, MHA automatically identifies differential relay log events between slaves, and applies to each slave. So finally all slaves can be in sync, as long as all slave servers are alive.
By using together with Semi-Synchronous Replication, (almost) no data loss can also be guaranteed.

* No need to modify current MySQL settings (MHA works with regular MySQL (5.0 or later))

One of the most important design principles of MHA is to make MHA easy to use as long as possible. MHA works with existing traditional MySQL 5.0+ master-slaves replication environments. Though many other HA solutions require to change MySQL deployment settings, MHA does not force such tasks for DBAs. MHA works with the most common two-tier single master and multiple slaves environments. MHA works with both asynchronous and semi-synchronous MySQL replication. Installing/Uninstalling/Starting/Stopping/Upgrading/Downgrading MHA can be done without changing (including starting/stopping) MySQL replication. When you need to upgrade MHA to newer versions, you don't need to stop MySQL. Just replace with newer MHA versions and restart MHA Manager is fine.

MHA works with normal MySQL versions starting from MySQL 5.0. Some HA solutions require special MySQL versions (i.e. MySQL Cluster, MySQL with Global Transaction ID, etc), but you may not like to migrate applications just for master HA. In many cases people have already deployed many legacy MySQL applications and they don't want to spend too much time to migrate to different storage engines or newer bleeding edge distributions just for master HA. MHA works with normal MySQL versions including 5.0/5.1/5.5 so you don't need to migrate.

* No need to increase lots of servers

MHA consists of MHA Manager and MHA Node. MHA Node runs on the MySQL server when failover/recovery happens so it doesn't require additional server. MHA Manager normally runs on a dedicated server so you need to add one (or two for HA) server(s), but MHA Manager can monitor lots of (even 100+) masters from single server, so the total number of servers is not increased so much. Note that it is even possible to run MHA Manager on one of slave servers. In this case total number of servers is not increased at all.

* No performance penalty

MHA works with regular asynchronous or semi-synchronous MySQL replication. When monitoring master server, MHA just sends ping packets to master every N seconds (default 3) and it does not send heavy queries. You can expect as fast performance as regular MySQL replication.

* Works with any storage engine

MHA works with any storage engines as long as MySQL replication works, not limited to InnoDB (crash-safe, transactional storage engine). Even though you use legacy MyISAM environments that are not easy to migrate, you can use MHA.

Production case study

I'm using MHA on our (DeNA's) production environments. We manage more than 100 MySQL applications (master/slave pairs) from a few old (32bit, 3GB RAM) manager servers (one manager per datacenter), and so far working very well. MHA does not spend resources at monitoring stage so managing hundreds of MySQL applications from single manager running on an old machine is totally possible (CPU util is 0-3% in total).
We have been frequently using MHA for online master switch. Some popular social games grow more rapidly than we expect. In many cases scaling out (sharding) is chosen, but scaling up (increasing RAM, replacing HDD with SSD, etc) is sometimes better than scaling out. We switch master from a slower machine to a faster machine (and vice versa) by using MHA (MHA has a separated online master switch command), and we have been able to switch more than 10 masters with only 0.5-1 second of downtime (not being able to connect to master) each. 0.5-1 second downtime is acceptable in our cases. Social game users (especially paying users) tend to be very strict on performance and availability, but we haven't received any inquiries/complaints when switching masters with MHA.

SkySQL provides commercial support for MHA

After I presented about MHA at the MySQL Conference in April, many people told me that they were interested in trying MHA. I'm happy if many people use my software and satisfied with it. On the other hand, I'm a full time employee at DeNA, and DeNA does not provide software support/consulting business so I can't provide 24x7 support/consulting by myself. What if you want such services? Hopefully SkySQL has decided to offer that. You can get 24x7 support of MHA (and of course, MySQL) from SkySQL! I have many ex-MySQL friends at SkySQL and they have excellent expertise to provide MySQL related support services. If you are interested, go to SkySQL website and talk with their sales representatives.

I'm attending OSCON and introduce MHA at my session, so if you are interested and staying at OSCON, I'd like to talk with you.

New Oracle ACE Director

2011-06-10T07:27:00.001-07:00

A few days ago Oracle offered me a position of Oracle ACE Director, and I accepted it. Thanks Oracle. I'm glad to see that Oracle has evaluated my activities and we can have a good relationships each other. Oracle has been great for MySQL so far and I'd be happy to keep contributions.

Since I live in Tokyo and my native language is Japanese, I usually do community activities in Japanese (speaking at conferences in Tokyo, publishing books written in Japanese, etc). I'd like to do more in English, and my employer DeNA thankfully supports it. Right now it's confirmed that I'll speak at OSCON.Data in July, and at Oracle Open World in October. Personally I haven't traveled to Europe for years so it's great if I have an opportunity to speak at MySQL/Linux conferences in Europe.

Proper handling of insert-mostly, select-recently datasets

2011-05-20T03:48:00.004-07:00

Some kinds of large tables such as chat messages, blog entries, etc have the following characteristics.

* huge number of records, huge data and index size
* insert and select mostly
* select from only recent data
* select by secondary index (i.e. user_id)
* secondary index entries are inserted in random order

What are optimal ways to handle these tables? The below single large table does not perform well.

CREATE TABLE message (
 message_id BINGINT UNSIGNED PRIMARY KEY,
 user_id INT UNSIGNED,
 body VARCHAR(255),
 ...
 created DATETIME,
 INDEX(user_id)
) ENGINE=InnoDB;

The cause of poor performance is a secondary index on user_id. user_id is inserted in random order. Index size grows, and sooner or later it will exceed RAM size. Once index size on user_id exceeds RAM size, inserting into message table causes massive random read disk i/o, which reduces throughput significantly.
The below is a simple insert benchmark. Once random read disk i/o starts happening, throughput drops hugely. "Sequential order" means index entries are inserted sequentially, "Random order" means randomly.

This figure was what I presented at the MySQL Conference and Expo 2009. It's pretty old, but basic principles have not changed.

Using Range Partitioning

How can we make it faster? One of the best approaches in MySQL is using range partitioning, partitioned by date or primary key. This is one of my favorite features in MySQL. By using range partitioning, only the latest partition is actively accessed. Data/indexes in the rest partitions are much less accessed so they don't occupy buffer pool. Each partition size can be small enough to fit in memory, so insert performance does not drop.

CREATE TABLE message (
 message_id BIGINT UNSIGNED,
 user_id INT UNSIGNED,
 body VARCHAR(255),
 ...
 created DATETIME,
 INDEX(message_id)
 INDEX(user_id)
) engine=InnoDB
 PARTITION BY RANGE(to_days(d1)) (
 PARTITION p201103 VALUES LESS THAN (to_days('2011-03-01')),
 PARTITION p201104 VALUES LESS THAN (to_days('2011-04-01')),
 PARTITION p201105 VALUES LESS THAN (to_days('2011-05-01')),
 PARTITION p201106 VALUES LESS THAN (to_days('2011-06-01')),
 PARTITION p201107 VALUES LESS THAN (to_days('2011-07-01')),
...
);

As long as INSERT statements do inserts order by partition key and other SQL statements fetch only the recent data, no random disk reads will happen. Partitioning itself has some CPU overheads, but it's almost negligible in the real workloads, compared to disk i/o overheads.

(update:) Index and data size on each partition can be measured from information schema.

mysql> SELECT partition_name, index_length, data_length, table_rows FROM 
information_schema.partitions WHERE table_name='message';
+----------------+--------------+-------------+------------+
| partition_name | index_length | data_length | table_rows |
+----------------+--------------+-------------+------------+
| p201103        |  15565062144 | 15527313408 |  145146231 |
| p201104        |  15522070528 | 15507390464 |  205873280 |
| p201105        |   9736028160 |  9945743360 |   88653190 |
| p201106        |        32768 |       16384 |          0 |
+----------------+--------------+-------------+------------+
6 rows in set (0.13 sec)

In MySQL 5.0 or earlier versions where range partitioning is not supported, creating daily/weekly/monthly tables is a good way as a workaround, though applications have to aware of table name differences.


CREATE TABLE message_201103 ..
CREATE TABLE message_201104 ..
CREATE TABLE message_201105 ..

How about Database Sharding?

As you know, database sharding is very common approach for handling huge data. Is sharding good for handling these tables? Probably not. Database sharding is mainly used to reduce slow disk i/o by reducing data size per server. In the above case, inserts can be done in memory regardless of data size(10,000+ inserts/second), so from performance point of view splitting tables is not needed as long as applications can keep up with in-memory insert speed. From database size vs storage size point of view (disk capacity point of view), you'll need to archive or purge older data.

Actually I have seen a couple of times that people use NoSQLs supporting transparent sharding(Auto-Sharding) for these kinds of tables: such as MongoDB, Cassandra. Data size will sooner or later exceed disk size, so using unlimited horizontally scaling database sounds reasonable. But if the database products don't support range partitioning, sharding becomes much less optimal for handling these tables. Suppose you have 3000GB datasets and only recent 30GB data are mostly accessed. With MySQL 5.1+ range partitioning, you can simply manage one large(3000GB) table with weekly/daily partitions (as long as disk space is available). Only the latest partitions (30GB) are actively accessed and the rest partitions (2970GB) are less likely accessed. Single commodity database server can probably handle enough workloads.

On the other hand, if you shard 3000GB database without range partitioning support, you might need 300GB x 10 shards because the whole secondary indexes are accessed. This means you need 10 times more servers.

Transparent sharding is good for application developers, but I believe range partitioning is a mandatory feature for handling insert-mostly, select-recently huge datasets.