Tuesday, April 1, 2014

Semi-Synchronous Replication at Facebook

 After intensive testing and hack, we started using Semi-Synchronous MySQL Replication at Facebook production environments. Semi-Synchronous Replication itself was ready since MySQL 5.5 (GA was released 3.5 years ago!), but I'm pretty sure not many people have used in production so far. Here are summary of our objective, enhancements and usage patterns. If you want to hear more in depth, please feel free to ask me at Percona Live this week.

Objective / Why Semisync?

  The objective of the Semi-Synchronous Replication is simple -- Master Failover without data loss, without full durability.

 First, let me describe why the objective is difficult without semisync.

 There are a couple of fast slave promotion (master failover) solutions. My own MHA covers both fully automated and semi-automated MySQL failover solution. Fully automated means both failure detection and slave promotion are done automatically. Semi automated means failure detection is not done but slave promotion is done by one command. Time to detect failure is approximately 10 seconds, and actual failover is taking around 5 to 20 seconds, depending on what you are doing during failover (i.e. forcing power off of the crashed master will take at least a few seconds). Total downtime can be less than 30 seconds, if failover works correctly. I'm using term "Fast Failover" in this post, which includes both automated and semi-automated master failover.
 In MySQL 5.6, GTID based failover is also possible. Oracle's official tool mysqlfailover automates MySQL master failover using GTID. The latest version of MHA also supports GTID.

 Both mysqlfailover and MHA rely on MySQL replication. MySQL replication is asynchronous. So there is a very serious disadvantage -- potential data loss risk on master failover. If you use normal MySQL replication and do automated master failover with MHA/mysqlfailover, you can do failover quickly (a few seconds with MHA), but you always have risks of losing recently committed data.

 If you don't want to take any risk of losing data, you can't do fast master failover with normal MySQL replication. You have to do the following steps in case of master failure.

- Always set fully durable settings on master. By fully durable I mean setting innodb_flush_log_at_trx_commit=1 and sync_binlog=1.
- On master crash, wait for a while (10~30 minutes) until the crashed master recovers. Recovery takes long time because it involves OS reboot, storage and filesystem recovery, and InnoDB crash recovery.
- If the crashed master recovers, you can continue services without losing any data. Since all data exist on the master, slaves can continue replication. BTW official 5.6 had a bug causing all slaves broken in this scenario, but this bug was fixed in 5.6.17.
- If the crashed master doesn't recover (H/W failure etc), you need to promote one of slaves to a new master. There is a risk of losing some data but you don't have any other choice.

 This "safer" approach has two issues.
- Longer downtime. This is because you have to wait for master's recovery.
- You can't eliminate risks of losing data. If master is dead and never recovers, your risk of losing data is the same as doing fast failover.

 So, in bad cases, you have to suffer from both longer down time and losing data.

 Semi-Synchronous Replication is helpful to prevent from losing data.

 If you do not care about data loss risk, there is no reason to use Semi-Synchronous replication. You can use normal MySQL replication and do fast failover with mysqlfailover or MHA. Facebook is one of the companies to care about data loss risk with MySQL, so that's why we were interested in Semi-Synchronous replication a lot.

 Semisync replication was originated from Google in 2007. Official MySQL supported from 5.5. Actual implementation algorithm was substantially different from Google's.

 MySQL Cluster and Galera offers synchronous replication protocol in different ways. I do not cover them in this blog post.

 Semi-Synchronous Replication currently has two types of different algorithms -- Normal Semisync and Loss-Less Semisync. Let me explain the differences.

Differences between Normal Semisync and Loss-Less Semisync

 Loss-Less Semisync is a new Semisync feature supported in official MySQL 5.7. Original implementation was done by Zhou Zhenxing as "Enhanced Semisync" project, and also filed as a bug report. Oracle implemented based on his idea, and named Loss-Less semisync for it. So Enhanced Semisync and Loss-Less Semisync have same meanings. I say Loss-Less semisync in this post.

 Normal semisync and loss-less semisync work as below.

1. binlog prepare (doing nothing)
2. innodb prepare (fsync)
3. binlog commit (writing to fscache)
4. binlog commit (fsync)
5. loss-less semisync wait (AFTER_SYNC)
6. innodb commit (releasing row locks, changes are visible to other users)
7. normal semisync wait (AFTER_COMMIT)

 On normal semisync(AFTER_COMMIT), committing to InnoDB is done before waiting for ack from semisync slave, so the committed rows are visible from applications, even though semisync slaves may not have received the data. If master is crashed and none of the slaves received the data, the data is lost but applications may have seen them. This is called phantom reads, and in many cases it's problematic.

 Loss-less semisync (AFTER_SYNC) avoids the problem. Loss-less semisync commits InnoDB after getting ack from one of semisync slaves. So when committed data is visible from applications, one of the semisync slaves have received that. Phantom read risk is much smaller: if both master and the latest semisync slave are down at the same time, data is lost. But it's much less likely to happen compared to normal semisync.

 To avoid data loss and phantom reads, Normal Semisync can't meet your expectations. Using Loss-Less Semisync is needed.
 With Loss-Less Semi-Synchronous replication, committed data should be on one of the slaves, so you can recover from the latest slave. You can always do fast failover here.

Reduced Durability

 When you do fast failover, you can set reduced durable settings on master as well as slaves. Reduced durability means innodb_flush_log_at_trx_commit != 1 and sync_binlog != 1. With Semi-Synchronous replication, you can immediately start failover when master is down. When promoting a slave to the new master, identify the latest slave (highly likely one of the Semi-Synchronous slaves but not guaranteed) and apply differential logs to the new master. Master's durability does not matter here, because there is no way to access master's data during failover. So you can safely reduce durability. Reducing durability has a lot of benefits.
- Reducing latency on (group) commit because it doesn't wait for fsync().
- Reducing IOPS because the number of fsync() calls is significantly reduced: from every commit to every second. Overall disk workloads can be reduced. This is especially helpful if you can't rely on battery/flash backed write cache.
- Reducing write amplification. Write volume can be reduced a lot, even less than half in some cases. This is important especially when using flash devices, because less write volume increases flash life expectancy.

Requirements for Semisync Deployment

 To make Semisync work, you need at least one semisync reader (slave with semisync enabled) within the same (or very close) datacenter as the master. This is for latency. When semisync is enabled, round-trip time(RTT) between master and one of the semisync slaves is added to transaction commit latency. If none of the semisync slave is located within close datacenter, RTT many take tens or hundreds of milliseconds, which means you can commit only 10~100 times from single client. For most environments, this will not work. You need a slave within close datacenter.

 To make fast failover work without data loss, you need to make sure Semi-Synchronous Replication is always enabled. MySQL Semisync has a couple of points where optionally semisync is disabled:
- Exceeding timeout (exceeding rpl_semi_sync_master_timeout milliseconds to get ACK from all of the semisync slaves)
- No semisync slave (can be controlled via rpl_semi_sync_master_wait_no_slave)
- Executing SET GLOBAL rpl_semi_sync_master_enabled=0

 If you want to enable semisync always, you make sure these scenario won't happen. Set infinite or very long timeout, and have at least two semisync readers.
 

 Facebook Enhancements to Semi-Synchronous Replication


 We spent a lot of time for testing Semi-Synchronous replication in 2013. We found some S1 bugs, serious performance problems, and some administration issues. Our MySQL Engineering team and Performance team worked for fixing issues and finally our Operations team deployed Semisync in production.

 Here are our major enhancements.

Backporting Loss-Less Semisync from 5.7

 As described above, Loss-Less Semisync is needed to prevent data loss and phantom reads, so we backported Loss-Less Semisync patch from official MySQL 5.7 to our Facebook MySQL 5.6 branch. It will be merged to WebScaleSQL branch soon.

 Interestingly, when we tested semisync performance, Loss-less semisync gave better throughput than normal semisync, especially when the number of clients is large. Normal semisync caused more mutex contentions, which was alleviated with loss-less semisync. Since Loss-less semisync has better data protection mechanism, we concluded there is no reason to use normal semisync here.

Semisync mysqlbinlog

 Starting from MySQL 5.6, mysqlbinlog supported remote binlog backups, by using --raw and --read-from-remote-server. On remote binlog backups, mysqlbinlog works like a MySQL slave. mysqlbinlog connects to a master, executing BINLOG DUMP command, then receiving binlog events via MySQL replication protocol. This is useful when you want to take backups of the master's binary logs. Slave's relay logs and binary logs are not identical to master's binary logs, so they can't directly be used as backups of the master's binary logs.

 We extended mysqlbinlog to speak Semisync protocol. The reason of the enhancement is that we wanted to use "semisync mysqlbinlog" as a replacement of local semisync slaves. We usually run slaves on remote datacenters, and we don't always need local slaves to serve read requests / redundancy. On the other hand, as described at above "Requirements for Semisync Deployment" section, in practice at least two local semisync readers are needed to make semisync work. We didn't like to run additional two dedicated slaves per master just for semisync. So we invented semisync mysqlbinlog and use it instead of semisync slaves, as shown in the below figure.




 Compared to semisync slave, semisync mysqlbinlog has a lot of efficiency wins.

- semisync slave has lots of CPU overheads such as query parsing, making optimizer plans. semisync mysqlbinlog does not have such overhead.
- semisync slave writes 2x (relay log and binary log). semisync mysqlbinlog writes binary log only.
- For semisync slave, the way to write to relay log is not efficient. IO thread writes to kernel buffer per each binlog event. For regular auto-committed transactions, it consists of three binlog events (query BEGIN, query body, and commit XID). When using InnoDB only, writing to kernel buffer for every XID event is enough (though it does not cover DDL). By writing to kernel buffer for every XID event, it makes the frequency of kernel buf flush by less than 1/3. semisync mysqlbinlog could easily do such optimizations. We have not done yet, but it is even possible to make mysqlbinlog send back ACK before writing, to a file, and the extension is very easy.
-  Slave causes contention between SQL thread and I/O thread, so IO thread itself slows down, which slows down semisync master throughput too. Semisync binlog does not have such overhead because there is no SQL thread.

 With mysqlbinlog reader, master failover step becomes a bit tricky. This is because mysqlbinlog is not mysqld process so it doesn't accept any MySQL command, such as CHANGE MASTER. When doing master failover, it is highly likely that one of local mysqlbinlog has the latest binary log events, and the events should be applied to a new master. New MHA version (0.56) supported the feature.

 In this configuration, mysqlbinlog processes need to be highly available. If all semisync mysqlbinlog processes are down, semisync is stopped or suffering from long wait time..


Reducing plugin_lock mutex contention

  Prior to MySQL 5.6.17, there was a performance bug that transaction commit throughput dropped significantly when there were non-semisync many slaves or binlog readers, even if there was only a few semisync readers. On typical deployments, there are two or three semisync readers and multiple non-semisync readers, so performance drop with many non-semisync readers was annoying.
 The performance drop was caused by "plugin_lock" MySQL internal mutex on master. For those who don't know, semisync is a plugin in MySQL, and it's not installed by default. The plugin_lock mutex was needed by semisync binlog dump threads only, but actually the mutex was held by all binlog dump threads. We looked into the problem further.
 First we tried replacing plugin_lock mutex with read/write mutex. It actually did not help much. But Linux profiling tools showed that plugin_lock still caused contentions. During profiling, we learned that most/all glibc rw-locks had an internal lock (mutex-like thing) on which threads could stall. The pattern was get lock, get exclusive access to cache line to modify data, release lock. This was relatively expensive for plugin_lock mutex, since it doesn't do any expensive I/O inside.

 So switching plugin_lock to read/write lock was actually a bad idea. It was needed to remove below plugin related locks as long as possible. There are four major plugin related mutexes in MySQL.
- plugin_lock
- plugin_lock_list
- plugin_unlock
- plugin_unlock_list

 We also noticed that Delegate classes had read/write locks and they caused very hot contentions (especially Binlog_transmit_delegate::lock). The read/write lock protects a list, so probably switching to lock-free list was possible. BTW we noticed that performance schema did not collect mutex statistics on the mutexes on Delegate classes (bug#70577).

 The real problem was all of the above locks were held not only by semisync binlog readers, but also non-semisync binlog readers.

 Based on the above factors, we concluded removing all plugin mutexes was not easy, then we decided to optimize to hold these locks by semisync binlog readers only, and not holding by non-semisync binlog readers. The below is a benchmark result.



 x-axis was the number of non-semisync binlog readers, y-axis was concurrent INSERT throughput from 100 clients. The number of semisync binlog readers was always 1 to 3. Detailed benchmark conditions were described in a bug report.
 Hopefully our patches were finally merged to 5.6.17 and 5.7 so everybody can get benefits easily.


 With all of the enhancements, we could get pretty good benchmark results with semisync.


 This is a mysqlslap insert benchmark on the master, with one semisync slave/mysqlbinlog running. x-axis is the number of clients, y-axis is the number of inserts on the master. Enhanced means loss-less semisync.
 Normal slave is traditional (non-semisync) slave. Enhanced mysqlbinlog is our semisync usage pattern. As you can see, loss-less semisync beats normal semisync due to internal mutex contention reductions. semisync mysqlbinlog also beats semisync slave because of much less overheads. This shows that loss-less semisync scales pretty well.
 

Conclusion and Future Plans

 After several performance improvements, Semi-Synchronous replication became good enough for us. From performance point of view, I expect that single-threaded application performance will be next low-hanging fruits. On our benchmarks, we got around ~2500 transaction commits per second with semisync (0.4ms per commit). Without semisync, it was easy to get ~10000 transaction commits  per second (0.1ms per commit). Of course semisync adds RTT overhead, but on local datacenter network, RTT is much lower than 0.3ms. I think there is another semisync overhead here, so will revisit this issue and will work with Oracle Replication team and outside experts.


20 comments:

zedware said...

Thanks for sharing. Expecting more SemiSync deployments in production.

Robert Hodges said...

Yoshinori, this is excellent work. Thank you for posting. Do you know if your patches will be picked up by MariaDB, Percona, or Oracle MySQL any time soon?

Yoshinori Matsunobu said...

Robert, semisync mysqlbinlog is relatively tiny patch (https://github.com/facebook/mysql-5.6/commit/9497645c39bb8d340b9e8a893bc83a0e01632cf1) and it of course doesn't affect server side codebase, so I hope it can be merged to Oracle MySQL (or any other distribution) pretty easily.
Loss-less semisync is supported in MySQL 5.7, and plugin_lock mutex contention is fixed in 5.6.17.

Robert Hodges said...

Thanks Yoshinori! This looks like a good motivation to try MySQL 5.7.

Rob Smith said...
This comment has been removed by the author.
Chang said...

Thanks for sharing! one question, how does facebook deploy the semi-sync in different data center but very close(say less than 200 kilometers) ? or if we deploy semi-sync salve in a close different data center, what's latency requirement for network?

Thanks!

Yoshinori Matsunobu said...

@Chang

We deploy semisync mysqlbinlog within the same datacenter to avoid latency issues. If you deploy semisync in a different datacenter, you need carefully measure round-trip time and semisync throughput. If RTT takes 3 milliseconds, single threaded semisync throughput can not exceed 333 commits per second, which may not be acceptable.

Francisco Bordenave said...

Yoshinori, are you planning to publish the code changes to simplify GTID deployment? I've seen PLCME presentation and it looks to be a very wise change.

Thanks.

Vojtech Kurka said...

Yoshinori, thank you, excellent work and article!

I just wonder: what about a power failure (or other disaster) in the whole datacenter? If you omit the durable settings, you might loose a lot of data in that case, don't you? I mean the transactions that are not yet asnchronously replicated to a remote DC. Or are you using durable settings for the local semi-sync binlog readers? Although that might not help if you always need a fast failover.

Vojtech

devon broad said...

US ALL day cash upgrades at intervals Il offers several loan selections for staying wise in conjunction with, day cash upgrades, Salaryday and Size Usecured bank loans, in conjunction with auto title loan Automobile set up Usecured bank loans. selection of us target aiding you choose for the proper loan product to boot to product and services to help you to satisfy your cheap wants.

Natsu Uaganda said...

Get Hourly profit for 200 hours on every hour without any risk and without any work, best business plans ever
AllTimeProfit.com

deny andoid said...

Michael Kors Michael Gros is delighted to announce, vogue blogger Michelle dressed in Michael Kors 2014 summertime sequence navy white striped shorts as well as a windbreaker when using the exact same Goyard Tote Outlet paragraph typical Michael Kors Hamilton handbags, Haji Lane stroll the streets of Kazakhstan pose Lane Singapore. Michael Kors handbags scorching brands Sutton, delicate brown and white strike colour design and style, either hand, can shoulder, with Michael Kors wave position sleeveless dress and gold see, casual style devoid of getting rid of the arrogance of stylish Gucci Handbags Outlet Canada urban a lot of women temperament. Aid us express that Michael Kors Outlet Canada you've observed, Michael Kors has now become a ubiquitous model Longchamp Tote Bags Canada. Appears like Michael Kors manner market place has grown to be general public enemy. Lately, the American luxury brand Goyard Tote Outlet Canada introduced a quarterly light ended June 28 finished the reporting period of time the company's profits rose forty three per cent, this really is Michael Kors thirty third consecutive quarter of earnings expansion.

deny andoid said...

Brand turned his consideration towards East this time. Discovered from detailed define of classic Japanese inspiration, Karen Millen Karen Milian utilising potential factors, to show the impression of contemporary gals. Each and every Karen Millen Sale service or Japanese artwork of origami-style folds of ruffles, or from exaggerated metallic materials, or inspiration in the overall floral print, a unique ornamental topic from. Satin kimono material and Karen Millen Outlet construction fashion dresses and jackets, wide-sleeved shirt and double-breasted coat; exaggerated rosebuds rich black lace Mont Blanc Pens Sale UK overlaid on pale green, provides a solid female suit for ornamental skirts and trousers; temptation underwear the folds, dots and lightweight nude coloration in to the boudoir glamor Chrome Hearts Online; beautifully etched floral oriental ornament for promenade dresses included.

deny andoid said...

Set versions most fitted to the workplace commuter equipment is worn, not merely saves time but will also together with the self into a basic, to make certain that women's gown impeccable taste. Slim classy design and style appears to be really neat Parajumpers Jackets and but female, retro edition accompanied the two elegant and contemporary and shiny, supplying solid stuff will always be a way of eternal existence over and above the restrictions of your time. Three-dimensional brilliant, full of a feeling of Moose Knuckles Canada profile of wool coat, each outstanding and warm. H-type outfits Canada Goose Women Jackets and system appear neat ambiance, you can find no limit to the sizing of, but did Canada Goose Sale Canada not see the bloated, so sense added more comfortable and assured carrying. The cherry pink sweater trip to the bleak autumn delivers brilliant and vitality, with basic white skirt, exhibiting quiet natural beauty. Temperament is mostly a compliment to the lady anxious, it arises from self-confident, proficient, clever. Exceptional confront in due course as time goes by, and temperament is simply the opposite, aided by the passage of your time far more apparent. It is a visitors convincing gesture, straightforward although not very easy.

deny andoid said...

Nike Shox Undergo + iD sequence of trainers working with an impressive cushioning program, which mixes human motion engineering engineering, Max Air cushioning know-how, too as traditional Shox collection moves smoother Nike Shox Norge in contrast to Nike Shox and Nike Zoom cushioning defense technologies. which Nike Free run Sverige groundbreaking NIKE SHOX SHOX cushioning system deserted the friends and family while in the standard foam ankle design, changed using NIKE AIR airbag create, consequently providing a more light-weight comfort and ease and impact security Nike Free Run Sverige Billiga even from the ankle to the toe force distribution. total palm CUSHLON sole Men Vans Half Cab Pro cushioning system deserted songs SHOX collection Vans Shoes Outlet crafted inside the regular foam ankle, changed the usage of NIKE AIR airbag layout, consequently furnishing a far more lightweight influence security and comfort from your ankle into the toe by a uniform power distribution.

sham Zee said...

Amazing replication at facebook. Thanks for sharing....

ramy emam said...

Nice tutorial! You added here nice custom Facebook Like box to blogger. Thanks for sharing nice information for blogger lovers.
فيس بوك

Ruichao Lin said...

hi Yoshinori Matsunobu, i am ruiaylin , myql dba, China . when set sync_binlog=1 and innodb_flush_log_at_trx_commit=1 , the master machine crash, start it again , in my option ,There may lose one transaction data (for some reason like raid cache write policy is write back and without bbu etc ), am i right ?
expect you reply

ktaka said...
This comment has been removed by the author.
ktaka said...

Hi Matsunobu-san,

I hava a question about loss-less semi sync senario.

What could happen if the master crashes just after the slave send loss-less semisync ack, but just before master does innodb commit ? Could it be possible for the slave to have uncommited data in binlog ? Could this cause the situation, where the new master promoted from the slave to have the never commited data ?

Post a Comment