I was asked from a few people about my opinion of the Github's recent service outage. As a creator of MHA, I have lots of MySQL failover experiences.
Here are my points about failover design. Most of them duplicate with Robert's points.
- "Too Many Connections" is not a reason to start automated failover
- Do not repeat failover
I know some unsuccessful failover stories that "1. failover happens because master is unreachable (getting too many connections errors) due to heavy loads 2. failover happens again because the new master is unreachable due to heavy loads 3. failover happens again....". On database servers, newly promoted master is slower because of poor cache hit rate. On traditional active/standby environment, database cache on the new master is empty so you'll suffer from 10x or even worse performance for the time being. On master/slave environment, slave has cache so performance is much better than standby server, but you can't expect better performance than master.
It does not make any sense to repeat failover within short time, and automated failover should not happen just because master is overloaded. If master is overloaded due to H/W problems (i.e. raid battery failure, disk block failure, etc), failover will need to be performed, but I think this can be manually done.
MHA does not start failover if specific error codes are returned (i.e. 1203: ER_TOO_MANY_USER_CONNECTIONS). And MHA does not repeat failover if 1. last failover failed with errors or 2. last failover happened within N minutes (480 minutes by default) ago.
- Do not failover if it is unclear master is dead
This is very important to avoid split brain. In many cases data inconsistency is more problematic than longer downtime. You need to make sure on the master that no mysqld process is running / will not run. Even though master is not reachable via TCP/IP connection attempts, mysqld may be just during crash recovery. Forcing shutdown on the mater (power off) is my favorite approach, but may take long time depending on H/W.
MHA has a helper script to kill (i.e. power off) master. When I developed MHA, I spent long time for investigating how to speed up shutting down machines.
- Prepare tools for manual failover
There are some cases that automating failover is really scary - typical example is a datacenter failure. If the whole datacenter is not reachable, it is not easy to automatically check master's status, and probably remotely shutting down master is not possible. And it would be unclear when the datacenter is recovered. In such cases I think automated failover should not be performed, but manual failover should be done. Proper alerts should be sent immediately, so that DBAs can start analyzing problems and start manual failover quickly. On master/slave environments, slaves' relay log positions might be different each other. Checking all slaves' status and if needed fixing consistency by parsing relay logs is painful. MHA will be helpful in such situations, and actually I have used MHA many more times for manual failover than automated failover.