Monday, October 24, 2011

Making slave pre-fetching work better with SSD

In the recent few weeks I have spent some time for creating yet another slave prefetching tool named "Replication Booster (for MySQL)", written in C/C++ and using Binlog API. Now I'm happy to say that I have released an initial version at GitHub repository.

The objective of Replication Booster is same as mk-slave-prefetch: avoiding or reducing replication delay under disk i/o bound workloads. This is done by prefetching relay logs events, converting to SELECT, and executing SELECT before SQL Thread executes the events. Then SQL thread can be much faster because target blocks are already cached.

On my benchmarking environment Replication Booster works pretty well.

On HDD bound benchmarks(update by pk), SQL thread's qps was:
- Normal Slave (without prefetch): 400 update/s
- With prefetch: 1,500 update/s

On SSD (SAS SSD) bound benchmarks(update by pk), SQL thread's qps was:
- Normal Slave: 1,780 update/s
- With prefetch: 5,400 update/s

It is great that slave could handle many more updates per second without replication delay on disk i/o bound benchmarks. It works very well on SSD, too. The below is a graph that shows Seconds_Behind_Master is gradually decreasing by using Replication Booster.

In this benchmark I executed 4,000 updates per second on master. On the slave server, by default slave delayed continuously because the slave could handle only 1,779 updates per second. When starting using Replication Booster on the slave, the slave could execute 5,418 updates per second. This was higher than the master's qps so Seconds_Behind_Master gradually decreased. After the slave caught up with the master, the slave could execute as same volume of updates as the master (4,000 update/s), so no replication delay happened. This means on this environment we can raise maximum update traffics many more (1,780 update/s -> 4,000-5,400 update/s) without investing for new H/W.

I also tested on some of our production slaves (not used for services) and it showed good results, too. I could get 30-300% improvements, depending on cache hit ratio. If data was fully cached, of course I didn't get any benefit, but it didn't cause negative impacts either.

I hope this tool is interesting to you.

In this blog post, I'll explain backgrounds for developing this tool and basic design. I believe many more optimizations can be done in future. Your feedbacks are welcome.

Good concurrency, bad single threaded performance

I mentioned at Percona Live London that using SSD on slaves is a good practice to reduce replication delay, and SATA/SAS SSD is practical enough because unit price is much cheaper than PCI-E SSD, and SATA/SAS SSD shows not bad concurrency with many drives when using recent RAID controller(most applications actually do not need 30,000-50,000 read iops, even though running many MySQL instances on the same machine). It is certainly an advantage that many SATA/SAS drives (6-10) can be installed on 1U box.

The biggest thing I'm concerned about using SATA/SAS SSD is single thread read iops. You can get only 2,000 read iops from SATA/SAS SSD with RAID controller. If you do not use RAID controller, it is not impossible to get 3,000+ read iops, but this is still much lower than using PCI-Express SSD. You can get 10,000 signle thread read iops from PCI-Express SSD.

When using SATA/SAS SSD, it is easy to predict that slave delays much earlier than using PCI-E SSD. Especially if running multiple MySQL instances per single server, innodb_buffer_pool_size has to be small (i.e. 4GB-12GB), so lots of disk reads will happen. By using 6-10 SATA/SAS drives, maximum throughput can be competitive enough against PCI-Express SSD, but single thread read iops is not improved. This is an issue.

"Slave prefetching" is a well known, great approach to make SQL Thread faster.

What is slave prefetching?

The concept of "slave prefetching" is (I think) well known, but I briefly describe here in case you don't know..

SQL Thread is single threaded. When SQL thread has to do lots of disk i/o by itself, replication is easily delayed. In almost all cases of slave lagging, I/O thread has received all binary log events (and saved as relay logs), but SQL thread delays execution due to massive random disk i/o. So there are many relay log events from SQL thread's current position (Relay_Log_Pos) to the end of relay log (EOF of relay logs).

Random disk reads happen when target blocks(records/indexes) are not cached. If they are cached, random reads won't happen. If you can cache all entries before SQL Thread executes, SQL thread does not have to do random disk reads. Then SQL thread can be much faster.

How can you do that? Read relay logs before SQL Thread executes, covert DML statements (especially UPDATE) to SELECT, then execute SELECT on the slave in parallel.

I believe this concept was introduced to MySQL community by Paul Tackfield at YouTube 4-5 years ago. mk-slave-prefetch is an open source implementation.

Desire for C/C++ based, raw relay log event hanlding tool

At first I tested mk-slave-prefetch on my benchmarks. But as far as I tested, unfortunately it didn't work as I expected. I think the main reasons are as below:

* mk-slave-prefetch uses mysqlbinlog to parse relay logs. But mysqlbinlog is not as flexible and fast as reading raw relay log events. For example, mysqlbinlog output events have to go through the file to the main prefetching program. mysqlbinlog is an external command line tool, so the main prefetching program has to fork a new process to run mysqlbinlog, which opens and closes relay logs every time.

* mk-slave-prefetch is written in Perl. In general, a prefetching tool has to be fast enough to read, convert and execute SELECT statements before SQL thread executes. The tool has to be multi-threaded. The tool probably has to run on the same machine as MySQL slave, in order to minimize network overheads. The resource consumption (CPU and memory) should be small enough so that it doesn't hurt MySQL server performance.
I don't believe Perl is a good programming language for developing such a tool.

I believe C/C++ is the best for programming language for this purpose. And I believe handling raw relay log events is much more efficient than using mysqlbinlog.

Based on the above reasons, I decided to develop a new slave prefeching tool by myself. I had some experiences for parsing binary/relay logs when developing MHA, so at first I planned to create a simple relay log parser program. But immediately I changed my mind, and tried mysql-replication-listener (Binlog API). Binlog API is a newly released utility tool from Oracle MySQL team. Binlog API has a "file driver" interface, which enables to parse binary log or relay log file and handle events one by one. By using Binlog API, handling raw binlog events becomes much easier. For example, it's easy to parse binlog events, get updated entries, store to external software such as Lucene/Hadoop, etc.

Oracle says Binlog API is pre-alpha. But as far as I have tested for slave prefetching purpose, it works very well. It's fast enough, and I didn't encounter any crashing or memory leak issues. So I decided to develop a new slave prefetching tool using Binlog API.

Introduction to Replication Booster for MySQL

I named the new slave prefetching tool as "Replication Booster". Keywords "slave" and "prefetch" were already used by mk-slave-prefetch, so I used different words.

The below figure is a basic architecture of Replication Booster.

Design notes

- Replication Booster is a separated tool (runs as a MySQL client). It works with normal MySQL 5.0/5.1/5.5/5.6. Starting/stopping Replication Booster is possible without doing anything on MySQL server side.

- Replication Booster is written in C/C++, and using boost::regex for converting UPDATE/DELETE to SELECT. Binlog API also uses boost.

- Using Binlog API to parse relay logs, not using mysqlbinlog
  - Using file driver, not tcp driver. file driver does not connect to MySQL server, and just reading relay log files. So even if file driver has bugs, impacts are rather limited (If it has memory leak, that's serious, but I haven't encountered yet).

- Main thread parses relay log events, picking up query log events, passing to internal FIFO queues
  - Binlog API has an interface to get a binlog event header (event type, timestamp, server-id, etc) and an event body. So it is easy to pick up only query log events.
  - Parsing row based events is not supported yet. It should be worth implementing in the near future.

- Multiple worker threads pop query events from queues, and convert query events to SELECT statements

- A dedicated thread (monitoring thread) keeps track of current SQL Thread's position (Relay_Log_Pos)

- Worker threads do not execute a SELECT statement if the query's position is behind current SQL Thread's position. This is because it's not needed (too late).

- Main thread stops reading relay log events if the event's timestamp is N seconds (default 3) ahead of SQL Thread's timestamp
  - This is for cache efficiency. If reading too many events than needed, it causes negative impacts. In the worst case cache entries that SQL thread needs now are wiped out by newly selected blocks.

- When slave is not delayed, Replication Booster should not cause negative impacts. It shouldn't use noticeable CPU/Disk/Memory resources. It shouldn't prevent MySQL server activities by holding locks, either. Of course, it shouldn't execute converted SELECT statements because they are not useful anymore. The last one is not easy to work on various kinds of environments (i.e. HDD/SSD/etc), but should be controllable by some external configuration parameters

- Bugs on Replication Booster should not result in MySQL server outage.

- Replication Booster works locally. It doesn't allow to connect to remote MySQL servers. This is for performance reasons. Executing tens of thousands of queries per second from this tool remotely will cause massive fcntl() contentions and use high network resources (both bandwidth and CPU time). I don't like that.

Configuration Parameters

--threads: Number of worker threads. Each worker thread converts binlog events and executes SELECT statements. Default is 10 (threads).

--offset-events: Number of binlog events that main thread (relay log reader thread) skips initially when reading relay logs. This number should be high when you have faster storage devices such as SSD. Default is 500 (events).

--seconds-prefetch: Main thread stops reading relay log events when the event's timestamp is --seconds-prefetch seconds ahead of current SQL thread's timestamp. After that the main thread starts reading relay logs from SQL threads's position again. If this value is too high, worker threads will execute many more SELECT statements than necessary. Default value is 3 (seconds).

--millis-sleep: If --seconds-prefetch condition is met, main thread sleeps --millis-sleep milliseconds before starting reading relay log. Default is 10 milliseconds.

- MySQL connection parameters: MySQL slave user, password, socket file or local ip/port

How to verify Replication Booster works on your environments

You may want to run Replication Booster where Seconds_Behind_Master is sometimes growing. If Replication Booster works as expected, you can get the following benefits.

  - Seconds_Behind_Master gets decreased, or growth rate of Seconds_Behind_Master has decreased
  - Update speed has improved (i.e. Com_update per second has increased) by this tool

Replication Booster has some statistics variables, and prints these statistics when terminating the script (Ctrl+C) like below. If slave delays but "Executed SELECT queries" is almost zero, something is wrong.
Running duration:    847.846 seconds
Parsed binlog events: 60851473
Skipped binlog events by offset: 8542280
Unrelated binlog events: 17444340
Queries discarded in front: 17431937
Queries pushed to workers: 17431572
Queries popped by workers: 5851025
Old queries popped by workers: 3076
Queries discarded by workers: 0
Queries converted to select: 5847949
Executed SELECT queries: 5847949
Error SELECT queries: 0
Number of times to read relay log limit: 1344
Number of times to reach end of relay log: 261838

I haven't spent so much time on this project yet (just started a few weeks ago). Current algorithm is simple. I believe many more optimizations can be done in future, but even so current benchmark numbers are pretty good. I hope we can use this tool on many places where we want to avoid replication delay but don't want to spend too much money for faster storage devices.

No comments:

Post a Comment