Yoshinori Matsunobu's blog: 2011

Titles in this page

Monday, October 24, 2011

Making slave pre-fetching work better with SSD

In the recent few weeks I have spent some time for creating yet another slave prefetching tool named "Replication Booster (for MySQL)", written in C/C++ and using Binlog API. Now I'm happy to say that I have released an initial version at GitHub repository.

The objective of Replication Booster is same as mk-slave-prefetch: avoiding or reducing replication delay under disk i/o bound workloads. This is done by prefetching relay logs events, converting to SELECT, and executing SELECT before SQL Thread executes the events. Then SQL thread can be much faster because target blocks are already cached.

On my benchmarking environment Replication Booster works pretty well.

On HDD bound benchmarks(update by pk), SQL thread's qps was:
- Normal Slave (without prefetch): 400 update/s
- With prefetch: 1,500 update/s

On SSD (SAS SSD) bound benchmarks(update by pk), SQL thread's qps was:
- Normal Slave: 1,780 update/s
- With prefetch: 5,400 update/s

It is great that slave could handle many more updates per second without replication delay on disk i/o bound benchmarks. It works very well on SSD, too. The below is a graph that shows Seconds_Behind_Master is gradually decreasing by using Replication Booster.

In this benchmark I executed 4,000 updates per second on master. On the slave server, by default slave delayed continuously because the slave could handle only 1,779 updates per second. When starting using Replication Booster on the slave, the slave could execute 5,418 updates per second. This was higher than the master's qps so Seconds_Behind_Master gradually decreased. After the slave caught up with the master, the slave could execute as same volume of updates as the master (4,000 update/s), so no replication delay happened. This means on this environment we can raise maximum update traffics many more (1,780 update/s -> 4,000-5,400 update/s) without investing for new H/W.

I also tested on some of our production slaves (not used for services) and it showed good results, too. I could get 30-300% improvements, depending on cache hit ratio. If data was fully cached, of course I didn't get any benefit, but it didn't cause negative impacts either.

I hope this tool is interesting to you.

In this blog post, I'll explain backgrounds for developing this tool and basic design. I believe many more optimizations can be done in future. Your feedbacks are welcome.

Good concurrency, bad single threaded performance

I mentioned at Percona Live London that using SSD on slaves is a good practice to reduce replication delay, and SATA/SAS SSD is practical enough because unit price is much cheaper than PCI-E SSD, and SATA/SAS SSD shows not bad concurrency with many drives when using recent RAID controller(most applications actually do not need 30,000-50,000 read iops, even though running many MySQL instances on the same machine). It is certainly an advantage that many SATA/SAS drives (6-10) can be installed on 1U box.

The biggest thing I'm concerned about using SATA/SAS SSD is single thread read iops. You can get only 2,000 read iops from SATA/SAS SSD with RAID controller. If you do not use RAID controller, it is not impossible to get 3,000+ read iops, but this is still much lower than using PCI-Express SSD. You can get 10,000 signle thread read iops from PCI-Express SSD.

When using SATA/SAS SSD, it is easy to predict that slave delays much earlier than using PCI-E SSD. Especially if running multiple MySQL instances per single server, innodb_buffer_pool_size has to be small (i.e. 4GB-12GB), so lots of disk reads will happen. By using 6-10 SATA/SAS drives, maximum throughput can be competitive enough against PCI-Express SSD, but single thread read iops is not improved. This is an issue.

"Slave prefetching" is a well known, great approach to make SQL Thread faster.

What is slave prefetching?

The concept of "slave prefetching" is (I think) well known, but I briefly describe here in case you don't know..

SQL Thread is single threaded. When SQL thread has to do lots of disk i/o by itself, replication is easily delayed. In almost all cases of slave lagging, I/O thread has received all binary log events (and saved as relay logs), but SQL thread delays execution due to massive random disk i/o. So there are many relay log events from SQL thread's current position (Relay_Log_Pos) to the end of relay log (EOF of relay logs).

Random disk reads happen when target blocks(records/indexes) are not cached. If they are cached, random reads won't happen. If you can cache all entries before SQL Thread executes, SQL thread does not have to do random disk reads. Then SQL thread can be much faster.

How can you do that? Read relay logs before SQL Thread executes, covert DML statements (especially UPDATE) to SELECT, then execute SELECT on the slave in parallel.

I believe this concept was introduced to MySQL community by Paul Tackfield at YouTube 4-5 years ago. mk-slave-prefetch is an open source implementation.

Desire for C/C++ based, raw relay log event hanlding tool

At first I tested mk-slave-prefetch on my benchmarks. But as far as I tested, unfortunately it didn't work as I expected. I think the main reasons are as below:

* mk-slave-prefetch uses mysqlbinlog to parse relay logs. But mysqlbinlog is not as flexible and fast as reading raw relay log events. For example, mysqlbinlog output events have to go through the file to the main prefetching program. mysqlbinlog is an external command line tool, so the main prefetching program has to fork a new process to run mysqlbinlog, which opens and closes relay logs every time.

* mk-slave-prefetch is written in Perl. In general, a prefetching tool has to be fast enough to read, convert and execute SELECT statements before SQL thread executes. The tool has to be multi-threaded. The tool probably has to run on the same machine as MySQL slave, in order to minimize network overheads. The resource consumption (CPU and memory) should be small enough so that it doesn't hurt MySQL server performance.
I don't believe Perl is a good programming language for developing such a tool.

I believe C/C++ is the best for programming language for this purpose. And I believe handling raw relay log events is much more efficient than using mysqlbinlog.

Based on the above reasons, I decided to develop a new slave prefeching tool by myself. I had some experiences for parsing binary/relay logs when developing MHA, so at first I planned to create a simple relay log parser program. But immediately I changed my mind, and tried mysql-replication-listener (Binlog API). Binlog API is a newly released utility tool from Oracle MySQL team. Binlog API has a "file driver" interface, which enables to parse binary log or relay log file and handle events one by one. By using Binlog API, handling raw binlog events becomes much easier. For example, it's easy to parse binlog events, get updated entries, store to external software such as Lucene/Hadoop, etc.

Oracle says Binlog API is pre-alpha. But as far as I have tested for slave prefetching purpose, it works very well. It's fast enough, and I didn't encounter any crashing or memory leak issues. So I decided to develop a new slave prefetching tool using Binlog API.

Introduction to Replication Booster for MySQL

I named the new slave prefetching tool as "Replication Booster". Keywords "slave" and "prefetch" were already used by mk-slave-prefetch, so I used different words.

The below figure is a basic architecture of Replication Booster.

Design notes

- Replication Booster is a separated tool (runs as a MySQL client). It works with normal MySQL 5.0/5.1/5.5/5.6. Starting/stopping Replication Booster is possible without doing anything on MySQL server side.

- Replication Booster is written in C/C++, and using boost::regex for converting UPDATE/DELETE to SELECT. Binlog API also uses boost.

- Using Binlog API to parse relay logs, not using mysqlbinlog
  - Using file driver, not tcp driver. file driver does not connect to MySQL server, and just reading relay log files. So even if file driver has bugs, impacts are rather limited (If it has memory leak, that's serious, but I haven't encountered yet).

- Main thread parses relay log events, picking up query log events, passing to internal FIFO queues
  - Binlog API has an interface to get a binlog event header (event type, timestamp, server-id, etc) and an event body. So it is easy to pick up only query log events.
  - Parsing row based events is not supported yet. It should be worth implementing in the near future.

- Multiple worker threads pop query events from queues, and convert query events to SELECT statements

- A dedicated thread (monitoring thread) keeps track of current SQL Thread's position (Relay_Log_Pos)

- Worker threads do not execute a SELECT statement if the query's position is behind current SQL Thread's position. This is because it's not needed (too late).

- Main thread stops reading relay log events if the event's timestamp is N seconds (default 3) ahead of SQL Thread's timestamp
  - This is for cache efficiency. If reading too many events than needed, it causes negative impacts. In the worst case cache entries that SQL thread needs now are wiped out by newly selected blocks.

- When slave is not delayed, Replication Booster should not cause negative impacts. It shouldn't use noticeable CPU/Disk/Memory resources. It shouldn't prevent MySQL server activities by holding locks, either. Of course, it shouldn't execute converted SELECT statements because they are not useful anymore. The last one is not easy to work on various kinds of environments (i.e. HDD/SSD/etc), but should be controllable by some external configuration parameters

- Bugs on Replication Booster should not result in MySQL server outage.

- Replication Booster works locally. It doesn't allow to connect to remote MySQL servers. This is for performance reasons. Executing tens of thousands of queries per second from this tool remotely will cause massive fcntl() contentions and use high network resources (both bandwidth and CPU time). I don't like that.

Configuration Parameters

--threads: Number of worker threads. Each worker thread converts binlog events and executes SELECT statements. Default is 10 (threads).

--offset-events: Number of binlog events that main thread (relay log reader thread) skips initially when reading relay logs. This number should be high when you have faster storage devices such as SSD. Default is 500 (events).

--seconds-prefetch: Main thread stops reading relay log events when the event's timestamp is --seconds-prefetch seconds ahead of current SQL thread's timestamp. After that the main thread starts reading relay logs from SQL threads's position again. If this value is too high, worker threads will execute many more SELECT statements than necessary. Default value is 3 (seconds).

--millis-sleep: If --seconds-prefetch condition is met, main thread sleeps --millis-sleep milliseconds before starting reading relay log. Default is 10 milliseconds.

- MySQL connection parameters: MySQL slave user, password, socket file or local ip/port

How to verify Replication Booster works on your environments

You may want to run Replication Booster where Seconds_Behind_Master is sometimes growing. If Replication Booster works as expected, you can get the following benefits.

- Seconds_Behind_Master gets decreased, or growth rate of Seconds_Behind_Master has decreased
- Update speed has improved (i.e. Com_update per second has increased) by this tool

Replication Booster has some statistics variables, and prints these statistics when terminating the script (Ctrl+C) like below. If slave delays but "Executed SELECT queries" is almost zero, something is wrong.

Running duration:    847.846 seconds
Statistics:
 Parsed binlog events: 60851473
 Skipped binlog events by offset: 8542280
 Unrelated binlog events: 17444340
 Queries discarded in front: 17431937
 Queries pushed to workers: 17431572
 Queries popped by workers: 5851025
 Old queries popped by workers: 3076
 Queries discarded by workers: 0
 Queries converted to select: 5847949
 Executed SELECT queries: 5847949
 Error SELECT queries: 0
 Number of times to read relay log limit: 1344
 Number of times to reach end of relay log: 261838

I haven't spent so much time on this project yet (just started a few weeks ago). Current algorithm is simple. I believe many more optimizations can be done in future, but even so current benchmark numbers are pretty good. I hope we can use this tool on many places where we want to avoid replication delay but don't want to spend too much money for faster storage devices.

Tuesday, October 4, 2011

Testing MySQL 5.6.3 network performance improvements

I'm excited to see the new features in MySQL 5.6.3. Replication enhancements such as parallel SQL threads, crash safe slave and binlog checksum have been desired for years. I really appreciate that MySQL development team has released 5.6.3 in timely manner.

In this blog entry, I'd like to pick up one of my most favorite performance enhancements in MySQL 5.6.3: "network performance improvements". This was reported by Mark Callaghan as bug#54790 over one year ago, and finally it's fixed (I assume it's mainly developed by Davi Arnaut at MySQL dev team) in 5.6.3. The root cause of this problem was that prior to 5.6.3, under high loads MySQL internally had to call fcntl() lots of times to change non-blocking sockets to blocking (with timeout and alarm) and vice versa. fcntl() is serialized on Linux, so it doesn't scale with many CPU cores. You may suffer from this performance issue when clients are not locally based (= in most cases) and query per second is quite high (hundreds of thousands or even more). In many real world environments qps is not so much high, but in the near future these numbers will not be unrealistic, with faster H/W components.

I have tested some simple read-only benchmarks on 16-core and 24-core box, and results in 5.6.3 are great.

These benchmarks are in-memory PK lookup from a remote client. Connections are persistent. The client server never became bottleneck. Apparently the maximum throughput of 5.6.3 is much higher than 5.5 (75% higher on 24 cores, 34% higher on 16 cores). It's interesting that 16 core vs 24 core performance difference is not so much in 5.5 (only 10%), but it's huge in 5.6 (45%). 45% improvement is close to theoretical maximum (16core to 24core: 50%), so for these kinds of benchmarks, 5.6 is really scalable at least up to 24 cores.

Though many MySQL scalability issues could be mitigated by running multiple instances on a single server, you couldn't avoid this performance problem because calling fcntl() is serialized on Linux kernel side. So this performance fix is really important. With MySQL 5.6, I think we can take more scaling-up strategies rather than scaling-out.

How about disk i/o bound benchmarks? On disk bound workloads, network performance improvements do not cause so much impacts of course (since qps is much lower), but as below, 5.6 showed good results compared to 5.5. These performance improvements mainly come from reducing InnoDB internal kernel mutex contentions in 5.6. I'd like to investigate more time to understand internal differences.

These benchmarks were SATA SSD bound (8 drives, RAID1+0) benchmarks. InnoDB buffer pool size was 1GB and active data size was 2.5GB. 5.6 shows better results at higher concurrency.

Friday, September 16, 2011

MHA for MySQL 0.52 released

I'm happy to announce that MHA for MySQL (Master High Availability Manager and tools for MySQL) version 0.52 has been released. Downloads are available here.
This is a stable release. I have been using MHA on our(DeNA's) cross-datacenter environments. We had tens of failover opportunities and MHA worked successfully in all cases.

Here are release notes.

* multi-master configuration is supported
See my previous blog entry for details.

* Package name changed
To avoid potential trademark issues, I decided to change MHA package names (downloadable file names and GitHub repository name) from "MySQL-MasterHA-Manager" to "mha4mysql-manager", and from "MySQL-MasterHA-Node" to "mha4mysql-node".

- URLs of MHA development repositories on GitHub changed as below.
Manager: https://github.com/yoshinorim/mha4mysql-manager
Node: https://github.com/yoshinorim/mha4mysql-node
If you currently follow previous branches, please modify .git/config and point to new URLs to pull newer versions.

- If you installed MHA node rpm packages (version 0.50) provided from our download site, when upgrading to newer versions, please explicitly uninstall the current rpm package (rpm -e MySQL-MasterHA-Node-0.50-...) and install newer version.

All source codes, test cases, command line programs (i.e. masterha_manager), parameters etc are not changed. So product quality of MHA should not be affected by this change.

* rpm and deb packages are now provided
Prior to 0.52, rpm package for MHA Manager was not provided, and deb package was not provided at all. Now both rpm and deb packages are provided for both Manager and Node(running on MySQL server).
See how to install page for details.

MHA for MySQL, project Info

Project top page: http://code.google.com/p/mysql-master-ha/

Documentation: http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6

Source tarball, deb and rpm packages (stable release): http://code.google.com/p/mysql-master-ha/downloads/list

The latest source repository (development tree, url changed): https://github.com/yoshinorim/mha4mysql-manager (Manager source) and https://github.com/yoshinorim/mha4mysql-node (Per-MySQL server source)

Commercial support for MHA is available from SkySQL.

Thursday, September 8, 2011

Speaking at Oracle Open World and Percona Live London

In October, I'll speak at two big conferences - Oracle Open World and Percona Live London. I appreciate Oracle and Percona to give me opportunities to talk. It's always fun for me to speak at overseas conferences and meet friends.

At Oracle Open World, I have one session: "Scaling and Monitoring MySQL for Rapidly Growing Social Gaming" (here is a timetable).

At Percona Live London, I have one 3-hour tutorial and one 30-minute session .. "Linux and H/W optimizations for MySQL" and "MHA: Introducing automated MySQL master failover solution".

I'll make sure that all of these talks will be practical enough. Basic theories and generic benchmarks will be covered since it's important to understand, but I'll more focus on how we use in production (on 1000+ MySQL servers).

BTW, this is the first time for me to visit London and I'm looking forward to watching Premiere League, though I've heard that it's very difficult to get tickets.

Monday, August 29, 2011

MySQL-MHA: Support for multi-master configuration

After publishing MySQL MHA in the end of July, I received a few requests for supporting multi-master configuration. So I spent time for extending MHA for supporting multi-master, and now it's published at a separated GitHub branch (a new development tree). I'll take some time for people to evaluate the new feature, and after verifying it's stable enough, I'll merge it to the main branch and will release as a new version (tarball release).

The below is procedures to install MHA Manager multi-master tree.

$ git clone git://github.com/yoshinorim/MySQL-MasterHA-Manager.git

$ cd MySQL-MasterHA-Manager

$ git checkout -b multimaster origin/multimaster

$ perl Makefile.PL

$ make

$ sudo make install

If you have not installed Module::Install Perl module and have difficulties to install it, you can simply copy the module from MHA Manager tarball package like below.

$ cd MySQL-MasterHA-Manager(from github)

$ cp -rp /path/to/MHA-Manager-Tarball/inc ./

$ perl Makefile.PL

...

Here are some notes to make MHA work with multi-master.

* Only one primary master (writable) is allowed. MySQL global variable "read-only=1" must be set on other MySQL masters.

* By default, all managed servers should be in two-tier replication channel. If you want to manage three or more tier replication channel, "multi-tier-slave=1" parameter must be set in the MHA configuration file.

Example replication settings are as follows.

1) Simple two-tier multi-master configuration

  M1(host1,RW) <----> M2(host2,read-only)

       |

 +-----+--------+

S1(host3,R)    S2(host4,R)



=> After failover



  M2(host2,RW)

       |

 +-----+--------+

S1(host3,R)    S2(host4,R)



Configuration example:



[server1]

hostname=host1

candidate_master=1



[server2]

hostname=host2

candidate_master=1



[server3]

hostname=host3



[server4]

hostname=host4

On host2, SET GLOBAL read_only=1 must be set.

2) Three tier multi-master configuration

  M1(host1,RW) <-----------------> M2(host2,read-only)

       |                                |

 +-----+--------+                       +

S1(host3,R)    S2(host4,R)             S3(host5,R)





=> After failover



          M2(host2,RW)

                |

 +--------------+--------------------------+

S1(host3,R)    S2(host4,R)             S3(host5,R)





Configuration example:

[server default]

multi_tier_slave=1



[server1]

hostname=host1

candidate_master=1



[server2]

hostname=host2

candidate_master=1



[server3]

hostname=host3



[server4]

hostname=host4



[server5]

hostname=host5

In this case, host5 is a third-tier slave, so MHA does not manage host5(MHA does not execute CHANGE MASTER on host5 when the primary master host1 fails). When current master host1 is down, host2 will be new master, so host5 can keep replication from host2 without doing anything.

If you want to try MHA, tutorials will be good start.

Saturday, July 23, 2011

Announcing MySQL-MHA: "MySQL Master High Availability manager and tools"

I have published "MySQL MHA" that fully automates MySQL master failover. You can also get commercial support from SkySQL. Let's try MHA today!

Today I'm happy to announce that I have released MySQL-MHA: MySQL Master High Availability manager and tools as an open source software (GPL v2 license). The below is a part of documentation of MHA. I'm glad if you are interested in MHA.

A primary objective of MHA is automating master failover and slave promotion within short (usually 10-30 seconds) downtime, without suffering from replication consistency problems, without spending money for lots of new servers, without performance penalty, without complexity (easy-to-install), and without changing existing deployments. MHA also provides a way for scheduled online master switch: changing currently running master to a new master safely, within a few seconds (normally 0.5-2 seconds) of downtime (blocking writes only).

Difficulties of master failover is one of the biggest issues in MySQL. Many people have been aware of this issue, but in most cases there were not practical solutions. I created MHA to make our (DeNA's) existing 100+ 5.0/5.1/5.5 and future MySQL applications highly available. I think many outside people can also use MHA pretty easily.

Project Info

Project top page: http://code.google.com/p/mysql-master-ha/

Documentation: http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6

Source tarball and rpm package (stable release): http://code.google.com/p/mysql-master-ha/downloads/list

The latest source repository (dev release): https://github.com/yoshinorim/MySQL-MasterHA-Manager (Manager source) and https://github.com/yoshinorim/MySQL-MasterHA-Node (Per-MySQL server source)

Overview

Master Failover is not as trivial as you might think. Suppose you run single master and multiple slaves, which is the most common MySQL deployments. If the master crashes, you need to pick one of the latest slaves, promote it to the new master, and let other slaves start replication from the new master. This is actually not trivial. Even though you could identify the latest slave, other slaves might have not received all binary log events. If you let other slaves connect to the new master and start replication, these slaves lose some transactions. This will cause consistency problems. To avoid consistency problems, you need to identify which binlog events are not sent to each slave, and need to apply lost events to each slave before starting replication. This is very complex approach and manually doing recovery correctly is very difficult. This is illustrated in the slides (especially in p.10 as below) that I presented at the MySQL Conference and Expo 2011.

Fig: Master Failover: What makes it difficult?

Currently most of MySQL Replication users are forced to do manual failover on master crashes. But it is not uncommon to result in more than one hour downtime to complete failover. Each slave is not likely to have received the same relay log events, so you may need to fix consistency problems later. Though master crash does not happen so often, it is really serious once it happens.

MHA is invented to fix these issues. MHA provides the following functionality, and can be useful in many deployments where requirements such as high availability, data integrity, almost non-stop master maintenance are desired.

* Automated master monitoring and failover

MHA has a functionality to monitor MySQL master in an existing replication environment, detecting master failure, and doing master failover automatically. Even though some of slaves have not received the latest relay log events, MHA automatically identifies differential relay log events from the latest slave, and applies differential events to other slaves. So all slaves can be consistent. MHA normally can do failover in seconds (9-12 seconds to detect master failure, optionally 7-10 seconds to power off the master machine to avoid split brain, a few seconds for applying differential relay logs to the new master, so total downtime is normally 10-30 seconds). In addition, you can define a specific slave as a candidate master (setting priorities) in a configuration file. Since MHA fixes consistencies between slaves, you can promote any slave to a new master and consistency problems (which might cause sudden replication failure) will not happen.

* Interactive (manual) Master Failover

You can also use MHA for just failover, not for monitoring master. You can use MHA for master failover interactively.

* Non-interactive master failover

Non-interactive master failover (not monitoring master, but doing failover automatically) is also supported. This feature is useful especially when you have already used a software that monitors MySQL master. For example, you can use Pacemaker(Heartbeat) for detecting master failure and virtual ip address takeover, and use MHA for master failover and slave promotion.

* Online switching master to a different host

In many cases, it is necessary to migrate an existing master to a different machine (i.e. the current master has H/W problems on RAID controller or RAM, you want to replace with faster machine, etc). This is not a master crash, but scheduled master maintenance is needed to do that. Scheduled master maintenance causes downtime (at least you can not write master) so should be done as quickly as possible. On the other hand, you should block/kill current running sessions very carefully because consistency problems between different masters might happen (i.e "updating master1, updating master 2, committing master1, getting error on committing master 2" will result in data inconsistency). Both fast master switch and graceful blocking writes are required.
MHA provides a way to do that. You can switch master gracefully within 0.5-2 seconds of writer block. In many cases 0.5-2 seconds of writer downtime is acceptable and you can switch master even without allocating scheduled maintenance window. This means you can take actions such as upgrading to higher versions, faster machine, etc much more easily.

Architecture

When a master crashes, MHA recovers rest slaves as below.

Fig: Steps for recovery

Basic algorithms are described in the slides presented at the MySQL Conference and Expo 2011, especially from page no.13 to no.34.

In relay log files on slaves, master's binary log positions are written at "end_log_pos" sections (example). By comparing the latest end_log_pos between slaves, we can identify which relay log events are not sent to each slave. MHA internally recovers slaves (fixes consistency issues) by using this mechanism. In addition to basic algorithms covered in the slides at the MySQL Conf 2011, MHA does some optimizations and enhancements, such as generating differential relay logs very quickly (indenpendent from relay log file size), making recovery work with row based formats, etc.

MHA Components

MHA consists of MHA Manager and MHA Node as below.

Fig: MHA components

MHA Manager has manager programs such as monitoring MySQL master, controlling master failover, etc.

MHA Node has failover helper scripts such as parsing MySQL binary/relay logs, identifying relay log position from which relay logs should be applied to other slaves, applying events to the target slave, etc. MHA Node runs on each MySQL server.

When MHA Manager does failover, MHA manager connects MHA Node via SSH and executes MHA Node commands when needed.

Advantages

* Master failover and slave promotion can be done very quickly

MHA normally can do failover in seconds (9-12 seconds to detect master failure, optionally 7-10 seconds to power off the master machine to avoid split brain, a few seconds or more for applying differential relay logs to the new master, so total downtime is normally 10-30 seconds), unless all slaves delay replication seriously. After recovering the new master, MHA recovers the rest slaves in parallel. Even though you have tens of slaves, it does not affect master recovery time, and you can recover slaves very quickly.

* Master crash does not result in data inconsistency

When the current master crashes, MHA automatically identifies differential relay log events between slaves, and applies to each slave. So finally all slaves can be in sync, as long as all slave servers are alive.
By using together with Semi-Synchronous Replication, (almost) no data loss can also be guaranteed.

* No need to modify current MySQL settings (MHA works with regular MySQL (5.0 or later))

One of the most important design principles of MHA is to make MHA easy to use as long as possible. MHA works with existing traditional MySQL 5.0+ master-slaves replication environments. Though many other HA solutions require to change MySQL deployment settings, MHA does not force such tasks for DBAs. MHA works with the most common two-tier single master and multiple slaves environments. MHA works with both asynchronous and semi-synchronous MySQL replication. Installing/Uninstalling/Starting/Stopping/Upgrading/Downgrading MHA can be done without changing (including starting/stopping) MySQL replication. When you need to upgrade MHA to newer versions, you don't need to stop MySQL. Just replace with newer MHA versions and restart MHA Manager is fine.

MHA works with normal MySQL versions starting from MySQL 5.0. Some HA solutions require special MySQL versions (i.e. MySQL Cluster, MySQL with Global Transaction ID, etc), but you may not like to migrate applications just for master HA. In many cases people have already deployed many legacy MySQL applications and they don't want to spend too much time to migrate to different storage engines or newer bleeding edge distributions just for master HA. MHA works with normal MySQL versions including 5.0/5.1/5.5 so you don't need to migrate.

* No need to increase lots of servers

MHA consists of MHA Manager and MHA Node. MHA Node runs on the MySQL server when failover/recovery happens so it doesn't require additional server. MHA Manager normally runs on a dedicated server so you need to add one (or two for HA) server(s), but MHA Manager can monitor lots of (even 100+) masters from single server, so the total number of servers is not increased so much. Note that it is even possible to run MHA Manager on one of slave servers. In this case total number of servers is not increased at all.

* No performance penalty

MHA works with regular asynchronous or semi-synchronous MySQL replication. When monitoring master server, MHA just sends ping packets to master every N seconds (default 3) and it does not send heavy queries. You can expect as fast performance as regular MySQL replication.

* Works with any storage engine

MHA works with any storage engines as long as MySQL replication works, not limited to InnoDB (crash-safe, transactional storage engine). Even though you use legacy MyISAM environments that are not easy to migrate, you can use MHA.

Production case study

I'm using MHA on our (DeNA's) production environments. We manage more than 100 MySQL applications (master/slave pairs) from a few old (32bit, 3GB RAM) manager servers (one manager per datacenter), and so far working very well. MHA does not spend resources at monitoring stage so managing hundreds of MySQL applications from single manager running on an old machine is totally possible (CPU util is 0-3% in total).
We have been frequently using MHA for online master switch. Some popular social games grow more rapidly than we expect. In many cases scaling out (sharding) is chosen, but scaling up (increasing RAM, replacing HDD with SSD, etc) is sometimes better than scaling out. We switch master from a slower machine to a faster machine (and vice versa) by using MHA (MHA has a separated online master switch command), and we have been able to switch more than 10 masters with only 0.5-1 second of downtime (not being able to connect to master) each. 0.5-1 second downtime is acceptable in our cases. Social game users (especially paying users) tend to be very strict on performance and availability, but we haven't received any inquiries/complaints when switching masters with MHA.

SkySQL provides commercial support for MHA

After I presented about MHA at the MySQL Conference in April, many people told me that they were interested in trying MHA. I'm happy if many people use my software and satisfied with it. On the other hand, I'm a full time employee at DeNA, and DeNA does not provide software support/consulting business so I can't provide 24x7 support/consulting by myself. What if you want such services? Hopefully SkySQL has decided to offer that. You can get 24x7 support of MHA (and of course, MySQL) from SkySQL! I have many ex-MySQL friends at SkySQL and they have excellent expertise to provide MySQL related support services. If you are interested, go to SkySQL website and talk with their sales representatives.

I'm attending OSCON and introduce MHA at my session, so if you are interested and staying at OSCON, I'd like to talk with you.

Friday, June 10, 2011

New Oracle ACE Director

A few days ago Oracle offered me a position of Oracle ACE Director, and I accepted it. Thanks Oracle. I'm glad to see that Oracle has evaluated my activities and we can have a good relationships each other. Oracle has been great for MySQL so far and I'd be happy to keep contributions.

Since I live in Tokyo and my native language is Japanese, I usually do community activities in Japanese (speaking at conferences in Tokyo, publishing books written in Japanese, etc). I'd like to do more in English, and my employer DeNA thankfully supports it. Right now it's confirmed that I'll speak at OSCON.Data in July, and at Oracle Open World in October. Personally I haven't traveled to Europe for years so it's great if I have an opportunity to speak at MySQL/Linux conferences in Europe.

Friday, May 20, 2011

Proper handling of insert-mostly, select-recently datasets

Some kinds of large tables such as chat messages, blog entries, etc have the following characteristics.

* huge number of records, huge data and index size
* insert and select mostly
* select from only recent data
* select by secondary index (i.e. user_id)
* secondary index entries are inserted in random order

What are optimal ways to handle these tables? The below single large table does not perform well.

CREATE TABLE message (
 message_id BINGINT UNSIGNED PRIMARY KEY,
 user_id INT UNSIGNED,
 body VARCHAR(255),
 ...
 created DATETIME,
 INDEX(user_id)
) ENGINE=InnoDB;

The cause of poor performance is a secondary index on user_id. user_id is inserted in random order. Index size grows, and sooner or later it will exceed RAM size. Once index size on user_id exceeds RAM size, inserting into message table causes massive random read disk i/o, which reduces throughput significantly.
The below is a simple insert benchmark. Once random read disk i/o starts happening, throughput drops hugely. "Sequential order" means index entries are inserted sequentially, "Random order" means randomly.

This figure was what I presented at the MySQL Conference and Expo 2009. It's pretty old, but basic principles have not changed.

Using Range Partitioning

How can we make it faster? One of the best approaches in MySQL is using range partitioning, partitioned by date or primary key. This is one of my favorite features in MySQL. By using range partitioning, only the latest partition is actively accessed. Data/indexes in the rest partitions are much less accessed so they don't occupy buffer pool. Each partition size can be small enough to fit in memory, so insert performance does not drop.

CREATE TABLE message (
 message_id BIGINT UNSIGNED,
 user_id INT UNSIGNED,
 body VARCHAR(255),
 ...
 created DATETIME,
 INDEX(message_id)
 INDEX(user_id)
) engine=InnoDB
 PARTITION BY RANGE(to_days(d1)) (
 PARTITION p201103 VALUES LESS THAN (to_days('2011-03-01')),
 PARTITION p201104 VALUES LESS THAN (to_days('2011-04-01')),
 PARTITION p201105 VALUES LESS THAN (to_days('2011-05-01')),
 PARTITION p201106 VALUES LESS THAN (to_days('2011-06-01')),
 PARTITION p201107 VALUES LESS THAN (to_days('2011-07-01')),
...
);

As long as INSERT statements do inserts order by partition key and other SQL statements fetch only the recent data, no random disk reads will happen. Partitioning itself has some CPU overheads, but it's almost negligible in the real workloads, compared to disk i/o overheads.

(update:) Index and data size on each partition can be measured from information schema.

mysql> SELECT partition_name, index_length, data_length, table_rows FROM 
information_schema.partitions WHERE table_name='message';
+----------------+--------------+-------------+------------+
| partition_name | index_length | data_length | table_rows |
+----------------+--------------+-------------+------------+
| p201103        |  15565062144 | 15527313408 |  145146231 |
| p201104        |  15522070528 | 15507390464 |  205873280 |
| p201105        |   9736028160 |  9945743360 |   88653190 |
| p201106        |        32768 |       16384 |          0 |
+----------------+--------------+-------------+------------+
6 rows in set (0.13 sec)

In MySQL 5.0 or earlier versions where range partitioning is not supported, creating daily/weekly/monthly tables is a good way as a workaround, though applications have to aware of table name differences.


CREATE TABLE message_201103 ..
CREATE TABLE message_201104 ..
CREATE TABLE message_201105 ..

How about Database Sharding?

As you know, database sharding is very common approach for handling huge data. Is sharding good for handling these tables? Probably not. Database sharding is mainly used to reduce slow disk i/o by reducing data size per server. In the above case, inserts can be done in memory regardless of data size(10,000+ inserts/second), so from performance point of view splitting tables is not needed as long as applications can keep up with in-memory insert speed. From database size vs storage size point of view (disk capacity point of view), you'll need to archive or purge older data.

Actually I have seen a couple of times that people use NoSQLs supporting transparent sharding(Auto-Sharding) for these kinds of tables: such as MongoDB, Cassandra. Data size will sooner or later exceed disk size, so using unlimited horizontally scaling database sounds reasonable. But if the database products don't support range partitioning, sharding becomes much less optimal for handling these tables. Suppose you have 3000GB datasets and only recent 30GB data are mostly accessed. With MySQL 5.1+ range partitioning, you can simply manage one large(3000GB) table with weekly/daily partitions (as long as disk space is available). Only the latest partitions (30GB) are actively accessed and the rest partitions (2970GB) are less likely accessed. Single commodity database server can probably handle enough workloads.

On the other hand, if you shard 3000GB database without range partitioning support, you might need 300GB x 10 shards because the whole secondary indexes are accessed. This means you need 10 times more servers.

Transparent sharding is good for application developers, but I believe range partitioning is a mandatory feature for handling insert-mostly, select-recently huge datasets.

Wednesday, April 13, 2011

Slides: Automated, Non-Stop MySQL operations and failover

On Tuesday I presented "Automated, Non-Stop MySQL Operations and Failover" at the MySQL Conference and Expo 2011, and published the slides at SlideShare. I thought this talk was very complicated and it was not easy to understand in 45-minute session. Now slides are online so I assume attendees will be easier to understand what steps are needed for master failover and slave promotion.
As I mentioned during the talk, I'm planning to release the tool (monitoring master failure, promoting slave automatically or manually, and optionally switching alive master manually) as an open source software soon. The tool does all the steps what I covered at the talk so actually you don't need to do the steps manually. If you find any missing steps that need to be covered, I'd appreciate if you point out.
I'm also writing English documentation (user's guide) so please stay tuned if you're interested.

Slides: Linux and H/W optimizations for MySQL

On Monday I presented 3-hour tutorial "Linux and H/W optimizations for MySQL" at the MySQL Conference and Expo 2011, and published the slides at SlideShare.

Wednesday, April 6, 2011

Tracking long running transactions in MySQL

Sometimes I want to trace transactions that have taken long time to execute. If a transaction is holding row locks for 30 seconds, all other clients requesting locks for the same rows will be blocked for up to 30 seconds (or will get Lock Wait Timeout errors if you set innodb_lock_wait_timeout lower). This is serious.

The important step to solve such issues is identifying what kinds of queries are executed by problematic transactions. But tracking long running transactions is not as easy as tracking just slow queries. Suppose you execute the following transaction.

[client1]$ mysql --host=mysql_server db1
mysql> BEGIN;
mysql> SELECT * FROM t1 WHERE id=2 FOR UPDATE;
... (do nothing for 10 seconds)
mysql> UPDATE t1 SET value=20 WHERE id=2;
mysql> COMMIT;

This transaction takes more than 10 seconds to complete, but each query finishes very quickly. So nothing is written to the slow query log. Analyzing slow query logs is not helpful in this case.

If you enable general query log, all queries including above are written. But this hurts performance (writing to general query log is serialized inside MySQL), and logging all queries makes log file size huge. Analyzing too big log files is not so fun. In addition, neither query execution time nor error code is written to the general query log. So just analyzing general query log is not enough to verify whether the above SELECT .. FOR UPDATE took 10 seconds or finished quickly.

How about analyzing statement based binary logs? Unfortunately this is not always helpful. First, uncommitted transactions are not written to the binary log. There are many cases that transactions take long time and finally do rollback. Second, SELECT statements (including SELECT ... FOR UPDATE/LOCK IN SHARE MODE) are not written to the binary log. When you use locking reads, lots of transactions will be likely to be blocked. You probably want to identify which SELECT statement has caused the problem.

Using SHOW ENGINE INNODB STATUS? SHOW ENGINE INNODB STATUS prints active transactions like below.

---TRANSACTION 1B43B50, ACTIVE 24 sec, process no 7388, OS thread id 1235609920 starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 376, 1 row lock(s)
MySQL thread id 23, query id 140396660 client2.example.com 192.168.0.2 root Updating
update t1 set value=100 where id=2
------- TRX HAS BEEN WAITING 24 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 213041 n bits 80 index `PRIMARY` of table `test`.`t1` 
trx id 1B43B50 lock_mode X locks rec but not gap waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 80000002; asc     ;;
 1: len 6; hex 000001b43b4e; asc     ;N;;
 2: len 7; hex 810000012d011c; asc     -  ;;
 3: len 4; hex 80000002; asc     ;;
------------------

---TRANSACTION 1B43B4F, ACTIVE 34 sec, process no 7388, OS thread id 1096673600
2 lock struct(s), heap size 376, 1 row lock(s)
MySQL thread id 22, query id 140396658 client1.example.com 192.168.0.1 root

You can identify that a client from 192.168.0.2 was executing "update t1 set value=100 where id=2" and was waiting for 24 seconds to get some locks. But there is no information about what queries held the lock. A client from 192.168.0.1 (1B43B4F) was running a transaction for 34 seconds, so this was probably the one holding the lock. But there is no information about why it was running for 34 seconds. In this case I want to get all queries that the 1B43B4F executed so far.

If you can modify application programs, tracing long running transactions is not so difficult. Adding below logics will help.

1. Checking transaction start time (t1)
2. Pushing queries (and current timestamp) by the transaction into some in-memory queue
3. Checking transaction end time (t2)
4. If (t2 - t1) exceeds N seconds, printing all queries that the transaction executed
5. Deleting the in-memory queue

The problem is that this approach is not feasible in many cases. Because long running transactions may be executed from every client, you have to implement tracing logic on all application servers. Modifying core database access codes and re-deplying on all application servers is not fun. If you use O/R mappers, things become more difficult. Modifying database driver programs on all application servers? I don't want to do that.

Based on the above reasons, I thought that the most practical approach is tracing slow transactions on MySQL servers, without modifying any existing program (including client programs and MySQL servers). Hopefully there is a way to do that: capturing MySQL packets and tracking transactions and queries. The above 1-5 tracing algorithm should work.
I implemented the tool "MySlowTranCapture" and published at GitHub. Here is an example output.

[mysql-server]# myslowtrancapture -i eth0
Monitoring eth0 interface..
Listening port 3306..
Logging transactions that take more than 4000 milliseconds..

From 192.168.0.1:24441
2011/02/23 09:12:17.258307 ->
begin
2011/02/23 09:12:17.258354 <-
GOT_OK
2011/02/23 09:12:17.264797 ->
select * from diary where diary_id=100 for update
2011/02/23 09:12:17.265087 <-
GOT_RES
2011/02/23 09:12:17.277622 ->
select 1
2011/02/23 09:12:17.277713 <-
GOT_RES
2011/02/23 09:13:01.232620 ->
update diary set diary_date=now() where diary_id=100
2011/02/23 09:13:01.232960 <-
GOT_OK
2011/02/23 09:13:17.360993 ->
commit

From 192.168.0.2:24442
2011/02/23 09:12:20.969288 ->
begin
2011/02/23 09:12:20.969483 <-
GOT_OK
2011/02/23 09:12:20.977699 ->
update diary set diary_date=now() where diary_id=100
2011/02/23 09:13:11.300935 <-
GOT_ERR:Lock wait timeout exceeded; try restarting transaction
2011/02/23 09:13:13.136967 ->
rollback
--------------------

It is easy to identify that the first transaction caused problems. The first transaction held an exclusive lock for diary_id=100 at 09:12:17.264797, and didn't release until 09:13:17.360993. All queries by the transaction completed within a millisecond, so the first transaction was probably sleeping, or took long time to do other logics (i.e. accessing to remote servers and taking long time there) before closing the transaction.

MySlowTranCapture uses libpcap to capture network packets, like tcpdump. Since libpcap loses packets sometimes, it is not guaranteed to capture 100% transactions.
MySlowTranCapture approximately uses 10-30% CPU resources of mysqld process. This is not bad for short-term analyzing purposes (when I tested tcpdump, it used 140% CPU resources of mysqld, and most of CPU time seemed to be spent for writing network packets to local files).
I sometimes use this tool for debugging, too. Sometimes developers execute unnecessarily many SQL statements within single transaction. MySlowTranCapture is useful to trace such transactions on development servers, by setting -t lower (i.e. 50ms).

Monday, April 4, 2011

Speaking at MySQL Conference and Expo 2011

Next week I'll stay at Hyatt Santa Clara to attend the MySQL Conference and Expo 2011. This year I'll present one tutorial and two sessions.

- Linux and H/W optimizations for MySQL (Apr 11, 9:00-12:30)

- Automated, Non-Stop MySQL Operations and Failover (Apr 12, 15:05-15:50)

- Using MySQL As NoSQL - Introduction to HandlerSocket Plugin (Apr 13, 17:15-18:00) (Co-present with Kazuho Oku. He is a former employee at Cybozu Labs, and recently joined DeNA. He is an author of Q4M and MyCached(origin of HandlerSocket))

These talks will be based on my experiences as a former MySQL consultant and our experiences from running real-world social gaming platforms (DeNA is not well-known in US yet, but our service has 2-3 billion page views per day, so it's not small). Speaking 4.5 hours is not easy for a non-native English speaker, but I'll do the best to make the sessions will be interesting and beneficial.
Due to the recent disasters and ongoing rolling blackouts in Japan, most of my colleagues had to cancel the flight, and I also have to go back to Tokyo just after the UC (I booked the flight leaving at Apr 14 6:55pm),so I can't attend to some interesting events scheduled in Apr 15 or later. But it's confirmed that I can stay from Apr 10 to 14 so I look forward to seeing ex-MySQL colleagues and friends!