Yoshinori Matsunobu's blog

Titles in this page

Monday, April 4, 2011

Speaking at MySQL Conference and Expo 2011

Next week I'll stay at Hyatt Santa Clara to attend the MySQL Conference and Expo 2011. This year I'll present one tutorial and two sessions.

- Linux and H/W optimizations for MySQL (Apr 11, 9:00-12:30)

- Automated, Non-Stop MySQL Operations and Failover (Apr 12, 15:05-15:50)

- Using MySQL As NoSQL - Introduction to HandlerSocket Plugin (Apr 13, 17:15-18:00) (Co-present with Kazuho Oku. He is a former employee at Cybozu Labs, and recently joined DeNA. He is an author of Q4M and MyCached(origin of HandlerSocket))

These talks will be based on my experiences as a former MySQL consultant and our experiences from running real-world social gaming platforms (DeNA is not well-known in US yet, but our service has 2-3 billion page views per day, so it's not small). Speaking 4.5 hours is not easy for a non-native English speaker, but I'll do the best to make the sessions will be interesting and beneficial.
Due to the recent disasters and ongoing rolling blackouts in Japan, most of my colleagues had to cancel the flight, and I also have to go back to Tokyo just after the UC (I booked the flight leaving at Apr 14 6:55pm),so I can't attend to some interesting events scheduled in Apr 15 or later. But it's confirmed that I can stay from Apr 10 to 14 so I look forward to seeing ex-MySQL colleagues and friends!

Thursday, December 16, 2010

Thanks for releasing MySQL 5.5 GA!

I am excited to see that MySQL 5.5 GA (5.5.8) has been finally released . All of my ex-colleagues at MySQL/Oracle have done amazing jobs. I was also positively surprised that 5.5 was released on schedule:). So far I have felt that Oracle is leading MySQL development and product management very well.
5.5 has lots of practical features not only for web services providers, but also for enterprise users. Especially I like the following features.

Improved concurrency

Very often lots of concurrent sessions access to the same table. In such cases, one of MySQL internal global mutexes (LOCK_open) becomes very hot and serious concurrency problems have happened. Increasing table_cache have caused even negative impacts (because MySQL has to do linear search here. See bug#33948 for details). In 5.5, the problem has been fixed. Here is a very simple concurrent PK lookup benchmark result(using mysqlslap). It's good to see that the worst performance line has been significantly improved in 5.5.

Other global mutexes such as InnoDB kernel mutex are still hot, but overall performance becomes much better.
It's also great that the limit of 1023 concurrent data-modifying transactions has been raised in 5.5.

4-byte UTF-8

I'm not sure how many people are aware of this issue, but this is important for users in APAC regions such as China/Japan/Korea. Most of multi-byte characters consume 2 or 3 bytes in UTF-8 per character, but some characters (a few hundred Japanese characters) consume 4 bytes. Prior to MySQL 5.5, 4-byte UTF-8 characters were not supported. If people need to handle these characters, they have to use local character sets such as cp932, gbk5. Now we can use UTF-8 safely (You need to use "utf8mb4" charset).

Audit Plugin Interface

Lots of users in financial industries have wanted this feature for a long time. They frequently want to audit following information.
- Login/Access Timestamp
- Failed login/access information
- Database username
- Client IP/Hostname or terminal name
- Accessed table/view/other objects name and action
- Full executed query string
- The number of affected/retrieved rows
MySQL general query log has most of the above information, but it does not have any filtering feature so you have to parse huge general logs, which is not efficient. By writing audit plugins, you can log only specific events.

Semi-Synchronous Replication

By using Semi-Synchronous replication, the likelihood of the data loss on slaves in case of master crash will be much more reduced. Since this is not fully synchronous approach, you still have risks to lose data in case of crash. But in practice this can be very helpful for less strict users (i.e. web services providers).

"Change buffering" in InnoDB (speeding up DELETEs)

InnoDB have supported "Insert Buffering" feature for a long time. Insert Buffering can significantly improve insert performance if you have lots of non-unique secondary indexes.
On the other hand, insert buffering does not have any effect for delete-marking or purging index entries. In general, modifying indexes cause random disk reads when target index leaf blocks are not cached, which significantly slows down response time. That's one of the reasons why bulk deletes(and updates that modifies lots of secondary indexes) were not fast in InnoDB. In 5.5, "Insert Buffering" feature was extended to "Change Buffering", which is effective not only for inserts, but also for deletes. Here is a slide (p.18) from MySQL team that shows the effect of Change Buffering (160x faster response time for bulk deleting 100k rows).

I am also interested in Performance Schema. I like FILE_SUMMARY_BY_INSTANCE table which enables to fetch per-file disk i/o statistics. If Performance Schema can also be used to fetch per-index statistics (like userstats patch) and can be disabled/enabled dynamically, that would be great.

Monday, November 8, 2010

Handling long texts/blobs in InnoDB - 1 to 1 relationship, covering index

I have seen that 1 to 1 relationship is sometimes used for MySQL(InnoDB) to avoid significant performance slowdown. To understand the performance difference, it is necessary to understand how InnoDB stores column values to data blocks. Suppose you have the following "diary" table (for storing blog entries).

CREATE TABLE diary (
  diary_id INT UNSIGNED  AUTO_INCREMENT,
  user_id INT UNSIGNED NOT NULL,
  post_date TIMESTAMP NOT NULL,
  status TINYINT UNSIGNED NOT NULL,
  rating FLOAT NOT NULL,
  title VARCHAR(100) NOT NULL,
  body TEXT,
  PRIMARY KEY (diary_id), 
  INDEX user_date(user_id, post_date),
  INDEX user_rating(user_id, rating)
) CHARSET utf8 ENGINE=InnoDB;

"body" column contains diary text, which is pretty large (1KB/row). On the other hand, the rest columns are small enough (less than 50 bytes per row in total). This table is mostly used from applications for fetching titles, not fetching body. 90% SELECT queries will be like this:
1) SELECT user_id, post_date, title FROM diary WHERE diary_id=?

And the rest 10% SELECT queries will be:
2) SELECT body FROM diary WHERE diary_id=?

You probably guess that 1) is much faster than 2) because 1) fetches only <50 bytes while 2) fetches +1KB data. But unfortunately, in most cases 1) is as slow as 2). Even though 100% SELECT queries are 1), overall throughput will be poor if diary table is much larger than innodb buffer pool size. The reason is how InnoDB stores column values. The below is a rough image.

InnoDB stores large texts/blobs separetely from other columns if no more space is available within the same block. The separated page is called "Overflow Page". But this doesn't work for the diary table. This is mainly because diary body(1KB) is much smaller than InnoDB block size(16KB). When inserting diary entries, InnoDB block space is normally available enough to store all columns including body. As a result, body is stored next to the rest columns, and InnoDB blocks are occupied mostly by body.

InnoDB's I/O unit is per block basis. Even though you do not fetch body, InnoDB internally has to read blocks that contain body. In other words, single InnoDB block can not have many diary entries. So a lot of random disk reads happen here.
I talked about this topic at the MySQL Conference & Expo this year. I explained two solutions, one is using 1 to 1 relationship and the other is using covering index.

1 to 1 relationship solution is creating below two tables:


CREATE TABLE diary_head (
  diary_id INT UNSIGNED  AUTO_INCREMENT,
  user_id INT UNSIGNED NOT NULL,
  post_date TIMESTAMP NOT NULL,
  status TINYINT UNSIGNED NOT NULL,
  rating FLOAT NOT NULL,
  title VARCHAR(100) NOT NULL,
  PRIMARY KEY (diary_id), 
  INDEX user_date(user_id, post_date),
  INDEX user_rating(user_id, rating)
) CHARSET utf8 ENGINE=InnoDB;

CREATE TABLE diary_body (
  diary_id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  body TEXT
) CHARSET utf8 ENGINE=InnoDB;

diary_head table does not have the largest text column(body). So diary_head table size can be much smaller than the original diary table. InnoDB blocks that have diary_head's column values are frequently accessed, but the number of the blocks can be smaller. So these InnoDB blocks will be cached within InnoDB buffer pool very well. Most of SELECT queries do not require random disk i/o, so total throughput can be increased.

Another approach is creating a covering index that covers all columns except body. You do not need to create 1 to 1 tables in this case.

mysql> ALTER TABLE diary ADD INDEX diary_covering (diary_id, user_id, post_date, 
    -> status, rating, title);

By fetching these columns through diary_covering index instead of the PRIMARY key, InnoDB does not read blocks that contain body. diary_covering index size will be as small as the above diary_head table, so it will be cached very well.

Both 1 to 1 relationship and an additional covering index perform pretty well(The benchmarking results are written in the conference slides). But neither is straightforward for developers. Using 1 to 1 relationship forces developers to use joins to fetch all diary columns, and forces them to maintain consistency (or use foreign key constraints). Using additional covering index sometimes increases data size significantly. And more, you need to add FORCE INDEX in the SELECT statement like below so that MySQL can use the covering index insted of the PRIMARY key.

mysql> SELECT user_id, post_date, title FROM diary 
    -> FORCE INDEX (diary_covering)  WHERE diary_id=?

This is needed because MySQL optimizer chooses PRIMARY KEY for PK lookups because the optimizer guesses PK lookup (unique lookup) is the fastest than any other type of index lookup, even though it is not true in this case. Using SQL hints is sometimes not easy, especially when you use O/R mapping tools.

Personally I hope that InnoDB team or outside experts supports new InnoDB data format which is optimal for handling such large data types. PBXT handles these types of columns pretty well. PBXT stores large texts separately from other smaller columns, and it won't read large text data when SQL statement doesn't touch the large columns. If such kinds of data formats (storing specified columns in DDL into separated blocks) are supported in InnoDB, that will be nice.

Tuesday, October 19, 2010

Using MySQL as a NoSQL - A story for exceeding 750,000 qps on a commodity server

UPDATE: Oracle officially released memcached daemon plugin that talks with InnoDB. I'm glad to see that NoSQL+MySQL has become an official solution. It's still preview release but will be very promising. Let's try it to make it better!

Most of high scale web applications use MySQL + memcached. Many of them use also NoSQL like TokyoCabinet/Tyrant. In some cases people have dropped MySQL and have shifted to NoSQL. One of the biggest reasons for such a movement is that it is said that NoSQL performs better than MySQL for simple access patterns such as primary key lookups. Most of queries from web applications are simple so this seems like a reasonable decision.
Like many other high scale web sites, we at DeNA(*) had similar issues for years. But we reached a different conclusion. We are using "only MySQL". We still use memcached for front-end caching (i.e. preprocessed HTML, count/summary info), but we do not use memcached for caching rows. We do not use NoSQL, either. Why? Because we could get much better performance from MySQL than from other NoSQL products. In our benchmarks, we could get 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients. We also have got excellent performance on production environments.
Maybe you can't believe the numbers, but this is a real story. In this long blog post, I'd like to share our experiences.
(*) For those who do not know.. I left Oracle in August 2010. Now I work at DeNA, one of the largest social game platform providers in Japan.

Is SQL really good for fast PK lookups?

How many times do you need to run PK lookups per second? Our applications at DeNA need to execute lots of PK lookups, such as fetching user info by user id, fetching diary info by diary id. memcached and NoSQL certainly fit very well for such requirements. When you run simple multi-threaded "memcached get" benchmarks, you can probably execute 400,000+ get operations per second, even though memcached clients are located on remote servers. When I tested with the latest libmemcached and memcached, I could get 420,000 get per sec on a 2.5GHz x 8 core Nehalem box with a quad-port Broadcom Gigabit Ethernet card.

How frequently can MySQL execute PK lookups? Benchmarking is easy. Just run concurrent queries from sysbench, super-smack, mysqlslap, etc.

[matsunobu@host ~]$ mysqlslap --query="select user_name,..  
from test.user where user_id=1" \ 
--number-of-queries=10000000 --concurrency=30 --host=xxx -uroot

You can easily check how many InnoDB rows are read per second.

[matsunobu@host ~]$ mysqladmin extended-status -i 1 -r -uroot \
| grep -e "Com_select"
...
| Com_select                            | 107069     |
| Com_select                            | 108873     |
| Com_select                            | 108921     |
| Com_select                            | 109511     |
| Com_select                            | 108084     |
| Com_select                            | 108483     |
| Com_select                            | 108115     |
...

100,000+ queries per second seems not bad, but much slower than memcached. What is MySQL actually doing? From vmstat output, both %user and %system were high.

[matsunobu@host ~]$ vmstat 1
 r  b  swpd   free   buff  cache      in     cs us sy id wa st
23  0     0 963004 224216 29937708 58242 163470 59 28 12  0  0
24  0     0 963312 224216 29937708 57725 164855 59 28 13  0  0
19  0     0 963232 224216 29937708 58127 164196 60 28 12  0  0
16  0     0 963260 224216 29937708 58021 165275 60 28 12  0  0
20  0     0 963308 224216 29937708 57865 165041 60 28 12  0  0

Oprofile output told more about where CPU resources were spent.

samples  %        app name                 symbol name
259130    4.5199  mysqld                   MYSQLparse(void*)
196841    3.4334  mysqld                   my_pthread_fastmutex_lock
106439    1.8566  libc-2.5.so              _int_malloc
94583     1.6498  bnx2                     /bnx2
84550     1.4748  ha_innodb_plugin.so.0.0.0 ut_delay
67945     1.1851  mysqld                   _ZL20make_join_statistics
P4JOINP10TABLE_LISTP4ItemP16st_dynamic_array
63435     1.1065  mysqld                   JOIN::optimize()
55825     0.9737  vmlinux                  wakeup_stack_begin
55054     0.9603  mysqld                   MYSQLlex(void*, void*)
50833     0.8867  libpthread-2.5.so        pthread_mutex_trylock
49602     0.8652  ha_innodb_plugin.so.0.0.0 row_search_for_mysql
47518     0.8288  libc-2.5.so              memcpy
46957     0.8190  vmlinux                  .text.elf_core_dump
46499     0.8111  libc-2.5.so              malloc

MYSQLparse() and MYSQLlex() are called during SQL parsing phase. make_join_statistics() and JOIN::optimize() are called during query optimization phase. These are "SQL" overhead. It's obvious that performance drops were caused by mostly SQL layer, not by "InnoDB(storage)" layer. MySQL has to do a lot of things like below while memcached/NoSQL do not neeed to do.

* Parsing SQL statements
* Opening, locking tables
* Making SQL execution plans
* Unlocking, closing tables

MySQL also has to do lots of concurrency controls. For example, fcntl() are called lots of times during sending/receiving network packets. Global mutexes such as LOCK_open, LOCK_thread_count are taken/relesed very frequently. That's why my_pthread_fastmutex_lock() were ranked #2 in the oprofile output and %system were not small.
Both MySQL development team and external community are aware of concurrency issues. Some issues have already been solved in 5.5. I'm glad to see that lots of fixes have been done so far.
But it is also important that %user reached 60%. Mutex contentions result in %system increase, not %user increase. Even though all mutex issues inside MySQL are fixed, we can not expect 300,000 queries per second.
You may be heard about HANDLER statement. Unfortunately HANDLER statement was not so much helpful to improve throughput because query parsing, opening/closing tables still be needed.

CPU efficiency is important for in-memory workloads

If little active data fit in memory, SQL overheads become relatively negligible. This is simply because disk i/o costs are much much higher. We do not need to care so much about SQL costs in this case.
But, on some of our hot MySQL servers, almost all data fit in memory and they became completely CPU bound. Profiling results were similar to what I described above: SQL layer spent most of resources. We needed to execute lots of primary key lookups(i.e. SELECT x FROM t WHERE id=?) or limited range scans. Even though 70-80% of queries were simple PK lookups from the same table (difference was just values in WHERE), every time MySQL had to parse/open/lock/unlock/close, which seemed not efficient for us.

Have you heard about NDBAPI?

Is there any good solution to reduce CPU resources/contentions around SQL layer in MySQL? If you are using MySQL Cluster, NDBAPI would be the best solution. When I worked at MySQL/Sun/Oracle as a consultant, I had seen lots of customers who were dissapointed at SQL Node + NDB performance, then became happy after they could get N times bettern performance by writing NDBAPI clients. You can use both NDBAPI and SQL in MySQL Cluster. It's recommended using NDBAPI for frequent access patterns, and using SQL + MySQL + NDB for ad-hoc or infrequent query patterns.
This was what we wanted. We wanted faster access APIs, but we also wanted to use SQL for ad-hoc or complex queries. But DeNA is using InnoDB, like many other web services. Switching to NDB is not trivial. Embedded InnoDB does neither support SQL nor network interface so it's not an option for us.

Developing "HandlerSocket Plugin" - a MySQL plugin that speaks NoSQL network protocols

We thought that the best approach was implementing a NoSQL network server inside MySQL. That is, writing a network server as a MySQL plugin (daemon plugin) which listens on specific ports, accepting NoSQL protocols/APIs, then accessing to InnoDB directly by using MySQL internal storage engine APIs. This approach is similar to NDBAPI, but it can talk with InnoDB.
This concept was initially invented and prototyped by Kazuho Oku at Cybozu Labs last year. He wrote MyCached UDF that speaks memcached protocols. My colleague Akira Higuchi implemented another plugin: HandlerSocket. The below picture shows about what Hanldersocket can do.

< Fig 1 > What is Hanldersocket?

HandlerSocket is a MySQL daemon plugin so that applications can use MySQL like NoSQL. The biggest purpose of the HandlerSocket is that it talks with storage engines like InnoDB without SQL-related overheads. To access MySQL tables, of course HandlerSocket needs to open/close tables. But HandlerSocket does not open/close tables every time. It keeps tables opened for reuse. Opening/closing tables is very costly and causes serious mutex contentions so it's very helpful to improve performance. Of course HandlerSocket closes tables when traffics become small etc so that it won't block administrative commands (DDL) forever.
What is different from using MySQL + memcached? By comparing Fig 1 with Fig 2, I think you'll notice lots of differences. Fig 2 shows typical memcached and MySQL usage. memcached is aggressively used for caching database records. This is mainly because memcached get operation is much faster than in-memory / on-disk PK lookups in MySQL. If HandlerSocket can fetch records as fast as memcached, we don't need memcached for caching records.

< Fig 2 > Common architecture pattern for MySQL + memcached

Using HandlerSocket

As an example, here is a "user" table. Suppose we need to fetch user information by user_id.

CREATE TABLE user (
  user_id INT UNSIGNED PRIMARY KEY,
  user_name VARCHAR(50),
  user_email VARCHAR(255),
  created DATETIME 
) ENGINE=InnoDB;

In MySQL, fetching user info can be done by, of course, SELECT statements.

mysql> SELECT user_name, user_email, created FROM user WHERE user_id=101;
+---------------+-----------------------+---------------------+
| user_name     | user_email            | created             |
+---------------+-----------------------+---------------------+
| Yukari Takeba | yukari.takeba@dena.jp | 2010-02-03 11:22:33 |
+---------------+-----------------------+---------------------+
1 row in set (0.00 sec)

Let me show how we can do the same thing with HandlerSocket.

* Installing HandlerSocket
Installation steps are described here. Basic steps are as below:

1. Download HandlerSocket here

2. Building HandlerSocket (both client and server codes)
./configure --with-mysql-source=... --with-mysql-bindir=... ; make; make install
3. Installing HandlerSocket into MySQL
mysql> INSTALL PLUGIN handlersocket soname 'handlersocket.so';

Since HandlerSocket is a MySQL plugin, you can use it like other plugins such as InnoDB Plugin, Q4M, Spider, etc. That is, you do not need to modify MySQL source code itself. MySQL version has to be 5.1 or later. You need both MySQL source code and MySQL binary to build HandlerSocket.

* Writing HandlerSocket client code
We provice C++ and Perl client libraries. Here is a sample Perl code to fetch a row by pk lookup.


#!/usr/bin/perl

use strict;
use warnings;
use Net::HandlerSocket;

#1. establishing a connection
my $args = { host => 'ip_to_remote_host', port => 9998 };
my $hs = new Net::HandlerSocket($args);

#2. initializing an index so that we can use in main logics.
 # MySQL tables will be opened here (if not opened)
my $res = $hs->open_index(0, 'test', 'user', 'PRIMARY',
    'user_name,user_email,created');
die $hs->get_error() if $res != 0;

#3. main logic
 #fetching rows by id
 #execute_single (index id, cond, cond value, max rows, offset)
$res = $hs->execute_single(0, '=', [ '101' ], 1, 0);
die $hs->get_error() if $res->[0] != 0;
shift(@$res);
for (my $row = 0; $row < 1; ++$row) {
  my $user_name= $res->[$row + 0];
  my $user_email= $res->[$row + 1];
  my $created= $res->[$row + 2];
  print "$user_name\t$user_email\t$created\n";
}

#4. closing the connection
$hs->close();

The above code fetches user_name, user_email and created columns from user table, looking by user_id=101. So you'll get the same results as the above SELECT statement.

[matsunobu@host ~]$ perl sample.pl
Yukari Takeba   yukari.takeba@dena.jp   2010-02-03 11:22:33

For most web applications, it's a good practice to keep lightweight HandlerSocket connections established (persistent connections), so that lots of requests can focus on main logic (the #3 in the above code).
HandlerSocket protocol is a small-sized text based protocol. Like memcached text protocol, you can use telnet to get rows through HandlerSocket.

[matsunobu@host ~]$ telnet 192.168.1.2 9998
Trying 192.168.1.2...
Connected to xxx.dena.jp (192.168.1.2).
Escape character is '^]'.
P       0       test    user    PRIMARY user_name,user_email,created
0       1
0       =       1       101
0       3       Yukari Takeba   yukari.takeba@dena.jp   2010-02-03 11:22:33

(Green lines are request packets, fields must be separated by TAB)

Benchmarking

Now it's good time to show our benchmarking results. I used the above user table, and tested how many PK lookup operations can be done from multi-threaded remote clients. All user data fit in memory (I tested 1,000,000 rows). I also tested memcached with similar data (I used libmemcached and memcached_get() to fetch a user data). In MySQL via SQL tests, I used a traditional SELECT statement: "SELECT user_name, user_email, created FROM user WHERE user_id=? ". Both memcached and HandlerSocket client codes were written in C/C++. All client programs were located on remote hosts, connecting to MySQL/memcached via TCP/IP.

The highest throughput was as follows:


                           approx qps     server CPU util
MySQL via SQL                105,000      %us 60%  %sy 28%
memcached                    420,000      %us  8%  %sy 88%
MySQL via HandlerSocket      750,000      %us 45%  %sy 53%

MySQL via HandlerSocket could get over 7.5 times higher throughput than traditional MySQL via SQL statements, even though %us was 3/4. This shows that SQL-layer in MySQL is very costly and skipping the layer certainly improves performance dramatically. It is also interesting that MySQL via HandlerSocket was 178% faster than memcached, and memcached spent too much %system resources. Though memcached is an excellent product, there are still rooms for optimizations.

The below is oprofile outputs, gathered during MySQL via HandlerSocket tests. CPU resources were spent on core operations such as network packets handling, fetching rows, etc (bnx2 is a network device driver program).

samples  %        app name                 symbol name
984785    5.9118  bnx2                     /bnx2
847486    5.0876  ha_innodb_plugin.so.0.0.0 ut_delay
545303    3.2735  ha_innodb_plugin.so.0.0.0 btr_search_guess_on_hash
317570    1.9064  ha_innodb_plugin.so.0.0.0 row_search_for_mysql
298271    1.7906  vmlinux                  tcp_ack
291739    1.7513  libc-2.5.so              vfprintf
264704    1.5891  vmlinux                  .text.super_90_sync
248546    1.4921  vmlinux                  blk_recount_segments
244474    1.4676  libc-2.5.so              _int_malloc
226738    1.3611  ha_innodb_plugin.so.0.0.0 _ZL14build_template
P19row_prebuilt_structP3THDP8st_tablej
206057    1.2370  HandlerSocket.so         dena::hstcpsvr_worker::run_one_ep()
183330    1.1006  ha_innodb_plugin.so.0.0.0 mutex_spin_wait
175738    1.0550  HandlerSocket.so         dena::dbcontext::
cmd_find_internal(dena::dbcallback_i&, dena::prep_stmt const&, 
ha_rkey_function, dena::cmd_exec_args const&)
169967    1.0203  ha_innodb_plugin.so.0.0.0 buf_page_get_known_nowait
165337    0.9925  libc-2.5.so              memcpy
149611    0.8981  ha_innodb_plugin.so.0.0.0 row_sel_store_mysql_rec
148967    0.8943  vmlinux                  generic_make_request

Since MySQL via HandlerSocket runs inside MySQL and goes to InnoDB, you can get statistics from regular MySQL commands such as SHOW GLOBAL STATUS. It's worth to see 750,000+ Innodb_rows_read.

$ mysqladmin extended-status -uroot -i 1 -r | grep "InnoDB_rows_read"
...
| Innodb_rows_read                      | 750192     |
| Innodb_rows_read                      | 751510     |
| Innodb_rows_read                      | 757558     |
| Innodb_rows_read                      | 747060     |
| Innodb_rows_read                      | 748474     |
| Innodb_rows_read                      | 759344     |
| Innodb_rows_read                      | 753081     |
| Innodb_rows_read                      | 754375     |
...

Detailed specs were as follows.
Model: Dell PowerEdge R710
CPU: Nehalem 8 cores, E5540 @ 2.53GHz
RAM: 32GB (all data fit in the buffer pool)
MySQL Version: 5.1.50 with InnoDB Plugin
memcached/libmemcached version: 1.4.5(memcached), 0.44(libmemcached)
Network: Broadcom NetXtreme II BCM5709 1000Base-T (Onboard, quad-port, using three ports)
* Both memcached and HandlerSocket were network i/o bound. When I tested with a single port, I got around 260,000 qps on MySQL via HandlerSocket, 220,000 qps on memcached.

Features and Advantages of HandlerSocket

HandlerSocket has lots of features and advantages like below. Some of them are really beneficial for us.

* Supporting lots of query patterns
HandlerSocket supports PK/unique lookups, non-unique index lookups, range scan, LIMIT, and INSERT/UPDATE/DELETE. Operations that do not use any index are not supported. multi_get operations (similar to IN(1,2,3..), fetching multiple rows via single network round-trip) are also supported.
See documentation for details.

* Can handle lots of concurrent connections
HandlerSocket connection is light. Since HandlerSocket employs epoll() and worker-thread/thread-pooling architecture, the number of MySQL internal threads is limited (can be controlled by handlersocket_threads parameter in my.cnf). So you can establish thousands or tens of thousands of network connections to HandlerSocket, without losing stability(consuming too much memory, causing massive mutex contentions, etc: such as bug#26590, bug#33948, bug#49169).

* Extremely high performance
HandlerSocket is possible to gain competitive enough performance against other NoSQL lineups, as already described. Actually I have not seen any NoSQL product that can execute 750,000+ queries on a commodity server from remote clients via TCP/IP.
Not only HandlerSocket eliminates SQL related function calls, but also it optimizes around network/concurrency issues.

** Smaller network packets
HandlerSocket protocol is much simpler and smaller than normal MySQL protocols. So overall network transfer size can be much smaller.

** Running limited number of MySQL internal threads
See above.

** Grouping client requests
When lots of concurrent requests come to HandlerSocket, each worker thread gathers as many requests as possible, then executing gathered requests at one time, and sending back results. This can improve performance greatly, by sacrificing response time a bit. For example, you can gain the following benefits. I'll explain them in depth in later posts, if anybody is interested.
*** Can reduce the number of fsync() calls
*** Can reduce replication delay

* No duplicate cache
When you use memcached to cache MySQL/InnoDB records, records are cached in both memcached and InnoDB buffer pool. They are duplicate so less efficient (Memory is still expensive!). Since HandlerSocket plugin accesses to InnoDB storage engine, records are cached inside InnoDB buffer pool, which can be reused by other SQL statements.

* No data inconsistency
Since data is stored at one place (inside InnoDB), data consistency check between memcached and MySQL is not needed.

* Crash-safe
Backend storage is InnoDB. It's transactional and crash safe. Even though you use innodb-flush-log-at-trx-commit!=1, you lose only < 1s of data on server crash.

* SQL can be used from mysql clients
In many cases people still want to use SQL (i.e to generate summary reports). This is why we can't use Embedded InnoDB. Most NoSQL products don't support SQL interface, either.
HandlerSocket is just a plugin for MySQL. You can usually send SQL statements from MySQL clients, and use HandlerSocket protocols when you need high throughput.

* All operational benefits from MySQL
Again, HandlerSocket runs inside MySQL, so all MySQL operations such as SQL, online backups, replication, monitoring by Nagios / EnterpriseMonitor, etc are supported. HandlerSocket activities can be monitored by regular MySQL command such as SHOW GLOBAL STAUTS, SHOW ENGINE INNODB STATUS, SHOW PROCESSLIST, etc.

* No need to modify/rebuild MySQL
Since it's a plugin, it runs on both MySQL Community and MySQL Enterprise Servers.

* Independent from storage engines
HandlerSocket is developed so that it can talk with any storage engine, though we have tested and used with 5.1 and 5.5 InnoDB Plugin only.

Notes and Limitations

* Need to learn HandlerSocket APIs
You need to write a program to talk with HandlerSocket, though it's pretty easy to use. We provide C++ API and Perl bindings.

* No security
Like other NoSQL databases, HandlerSocket does not provide any security feature. HandlerSocket's worker threads run with system user privileges, so applications can access to all tables through HandlerSocket protocols. Of course you can use firewalls to filter packets, like other NoSQL products.

* No benefit for HDD bound workloads
For HDD i/o bound workloads, a database instance can not execute thousands of queries per second, which normally results in only 1-10% CPU usage. In such cases, SQL execution layer does not become bottleneck, so there is no benefit to use Hanldersocket. We use HandlerSocket on servers that almost all data fit in memory.

DeNA is using HandlerSocket in production

We already use HandlerSocket plugin in our production environments. The results are great. We could have reduced lots of memcached and MySQL slave servers. Overall network traffics have been reduced, too. We haven't seen any performance problem (slow response time, stalls etc) so far. We've been very satisfied with the results.
I think MySQL has been underrated from NoSQL/Database communities. MySQL actually has much longer history than most of other products, and lots of unique and great enhancements have been done so far by excellent my ex-colleagues. I know from NDBAPI that MySQL has very strong potentials as a NoSQL. Storage engine API and daemon plugin interface are completely unique, and they made Akira and DeNA develop HandlerSocket possible. As an ex-employee at MySQL and a long-time fun for MySQL, I'd like to see MySQL becomes better and more popular, not only as an RDBMS, but also as Yet Another NoSQL lineup.
Since HandlerSocket plugin is Open Source, feel free to try. We'd be appreciated if you give us any feedback.

Wednesday, August 26, 2009

Accessing MySQL tables from UDF + storage engine API

A couple of weeks ago, at a MySQL study group in Tokyo I presented about various kinds of MySQL hacking techniques, such as debugging with MS Visual Studio or gdb, tracing with DTrace, writing information schema plugins, accessing tables from UDF and storage engine API, and extending MySQL server code itself. This 90-minute session went successful. Nearly 100 people attended, including famous MySQL developers such as Kazuho Oku-san and Kentoku Shiba-san, having productive discussions with attendees and quite a lot of people seemed interested in MySQL hacking. The slides are written in Japanese, but sample codes can be understandable and can be downloaded here.
What audiences were most interested in was accessing MySQL tables directly from Plugins(currently UDFs) and storage engine API. I showed a sample UDF code, based on what Kazuho-san released over one year ago, which is over 10 times faster than SQL or stored procedures for complex operations. Excerpted codes are as follows(skipping error handling).


#define MYSQL_SERVER 1
#include 
...
/* initializing tables, fields and indexes */
  THD *thd = current_thd;
  TABLE_LIST tables;
  TABLE *table;
  tables.init_one_table("test","message", TL_READ);
  simple_open_n_lock_tables(thd, &tables);
  table= tables.table;
  message_id_fld = get_field(table, "id");
  message_user_id_fld = get_field(table, "user_id");
  message_key = index_init(table, "user_id", true);

/* accessing table. Equivalent query is 
   SELECT id FROM message WHERE user_id=? ORDER BY id DESC LIMIT 1; */
  uchar*  key_buff= (uchar*) thd->alloc(message_key->key_length);
  bzero(key_buff, message_key->key_length);
  int null_offset= message_user_id_fld->null_bit; 
  int4store(key_buff + null_offset, user_id);
  err= table->file->index_read_last_map(table->record[0], key_buff, 1);
  fprintf(stderr, "id=%lld\n", message_id_fld->val_int());

/* closing objects */
  table->file->ha_index_end();
  close_thread_tables(thd);

By defining MYSQL_SERVER to 1, you can access to MySQL internal structures.
You can build the sample UDF on development environment as follows. MySQL source code is required to build when accessing to internal structures.


g++ -g -DSAFE_MUTEX -DSAFEMALLOC \
-Wall -fno-rtti -fno-exceptions -fPIC -shared \
-I/debug-built-mysql-src/include \
-I/debug-built-mysql-src/regex \
-I/debug-built-mysql-src/sql \
-o udf_sample.so  udf_sample.cc

On production environment, skip using -DSAFE_MUTEX -DSAFEMALLOC and include MySQL source built with configurations for production. When using some macros(i.e. current_thd) that behave differently between debug and prod configurations, you need to include appropriate source directly.


g++ -g \
-Wall -fno-rtti -fno-exceptions -fPIC -shared \
-I/prod-built-mysql-src/include \
-I/prod-built-mysql-src/regex \
-I/prod-built-mysql-src/sql \
-o udf_sample.so  udf_sample.cc

This snippet skips error handling etc for good legibility, but you need to carefully handle errors, otherwise mysqld crashes. Please check bench_udf.cc, build_udf_debug.sh, and build_udf_prod.shin my sample code.

MySQL is fast, but it still has many expensive CPU operations such as parsing SQL statements, making execution plans every time (MySQL doesn't have optimizer-plan caching mechanism), a lot of memcpy/malloc operations, etc. If you write a UDF which directly accesses tables through storage engine API, these expensive operations can be skipped so it can be much faster than SQL statements in many cases.
The above code looks like using key value store APIs (i.e. TokyoCabinet APIs). Most KVSs are faster than MySQL, and one of the reasons is they don't need to do extra operations such as making execution plans etc. But it is not impossible to use MySQL like fast KVS as above. You can not only get benefits about better performance, but also utilize traditional RDBMS implementations such as B+Tree (range scan), perfect durability, schema awareness. Data is stored in one place, and you can get the same data regardless of accessing methods. For example, you can use UDF for some special parts (i.e. picking up all friends' latest messages, which is difficult for MySQL to fetch quickly), and use normal SQL statements for others. I assume such two-way client interfaces (fast API and traditional SQL) can be a common implementation for next-generation RDBMS. Third, this is currently not common hacks so there is almost no document. For example, handling two or more tables requires additional codes, but this is not documented (I am still fully not understand to safely handle them).
Of course there are many disadvantages about using UDFs and storage engine APIs directly. First, it highly depends on MySQL versions. Storage engine APIs are standardized but client side operations such as calling table open/lock functions are different from MySQL versions. Second, mysqld easily crashes if UDF has bugs.
But there are a lot of possibilities here and this certainly attracts developers. For example, Kazuho-san yesterday published a very interesting UDF: mycached (memcached server udf). It listens memcached protocol, then directly accessing a table through storage engine API (primary key lookup). According to his benchmark this is twice as fast as SELECT with primary key lookup (if records are cached in buffer pool). I assume this approach will work very well as a durable key-value store, for example storing session data, last access timestamp etc.

Wednesday, August 12, 2009

Great performance effect of fixing broken group commit

Yesterday InnoDB Plugin 1.0.4 was released by Innobase. This version contains one of the most important performance fixes - "Fix for broken group commit". After MySQL5.0, InnoDB breaks group commit when using with binary log (or with other transactional storage engines), even though setting innodb_support_xa=0. This was really serious because fsync() (called at transaction commit when setting innodb_flush_log_at_trx_commit=1) is very expensive. The initial bug report about this was submitted by Peter four years ago. Last year David Lutz submitted a bug report with his prototype patch. It is great news that this bug fix has been finally implemented in the official InnoDB release.
I did a simple benchmarking by mysqlslap. mysqlslap has functionality to run concurrent inserts from multiple connections. The result is as follows.

mysqlslap --concurrency=1,5,10,20,30,50 --iterations=1 --engine=innodb \
--auto-generate-sql --auto-generate-sql-load-type=write \
--number-of-queries=100000

H/W is Sun Fire X4450, Intel Xeon X5560 Nehalem 2.80GHz * 16cores, 12GB RAM, SAS HDD with write cache. my.cnf configuration is as follows. log-bin is enabled.

[mysqld]
basedir=/usr/mysql5137
datadir=/data/mysql5137/data
ignore_builtin_innodb
plugin-load=innodb=ha_innodb.so;innodb_trx=ha_innodb.so;
  innodb_locks=ha_innodb.so;innodb_lock_waits=ha_innodb.so;
  innodb_cmp=ha_innodb.so;innodb_cmp_reset=ha_innodb.so;
  innodb_cmpmem=ha_innodb.so;innodb_cmpmem_reset=ha_innodb.so
innodb_log_files_in_group=2
innodb_buffer_pool_size=2G
innodb_flush_method=O_DIRECT
innodb_log_file_size=512M
innodb_data_file_path=ibdata1:500M:autoextend
innodb_file_per_table
log-bin
table_cache=8192

Apparently InnoDB Plugin 1.0.4 outperforms normal InnoDB (6.1 times faster on 30 connections, innodb_support_xa=1). Normal InnoDB doesn't scale with connections but InnoDB Plugin 1.0.4 does. What is the difference? Normal InnoDB does the following at transaction commit.

pthread_mutex_lock(&prepare_commit_mutex)
  writing into InnoDB logfile for prepare, then fsync
   (skipped if innodb_support_xa=0)
  writing into binlog
  writing into InnoDB logfile for commit, then fsync
pthread_mutex_unlock(&prepare_commit_mutex)

Under the critical section protected by prepare_commit_mutex, only one thread can do operation. So when 100 threads do commit at the same time, fsync() is called 100 times for prepare, 100 times for commit (200 in total). Group commit is totally broken. As you see the above graph, innodb_support_xa=0 is effective (though it still breaks group commit), but in general innodb_support_xa=0 is not recommended because it will break consistency between binlog and InnoDB in case of a crash.
In InnoDB Plugin 1.0.4, the behavior has changed as follows.

writing into InnoDB logfile for prepare, then fsync
 (skipped if innodb_support_xa=0)
pthread_mutex_lock(&prepare_commit_mutex)
  writing into binlog
  writing into InnoDB logfile for commit
pthread_mutex_unlock(&prepare_commit_mutex)
fsync to the InnoDB logfile

fsync, the most expensive operation, is called outside the critical section, so group commit is possible and concurrency is much more improved. The following graph shows how much Innodb_data_fsyncs was increased after executing mysqlslap(committing 100,000 transactions).

In 5.1.37+Builtin(support_xa=1), 2 fsyncs happens per transaction, regardless of # of concurrent connections. In 5.1.37+Builtin(support_xa=0), 1 fsync happens per transaction, regardless of # of concurrent connections. These mean group commit is broken. In both cases about 10,000 fsyncs were executed per second, which seems upper limit for regular HDD with BBU. On the other hand, InnoDB Plugin 1.0.4 greatly reduces the number of fsyncs(i.e. 200251 to 26092 on 30 connections(innodb_support_xa=1): 87% decreased). This shows group commit works well.

Write ordering between binlog and InnoDB logfile is still guaranteed. Write ordering for InnoDB prepare is not same as the ordering of binlog, but this is fine. Prepared entries are used only for recovery and not visible to applications. When doing crash recovery, mysqld reads binlog at first(picking up xids), then checking prepared but not committed entries(xids) in InnoDB logfile, then applying these entries in the order of binlog xids. So in the end write ordering is guaranteed.
Note that if you set sync_binlog=1, it is still very slow because writing into binlog is protected by mutex (prepare_commit_mutex and LOCK_log).
This is my understanding. InnoDB Plugin 1.0.4 also has other awesome features (i.e. multiple i/o threads) so it's worth trying.

Tuesday, July 28, 2009

iostat: (r/s + w/s) * svctm = %util on Linux

iostat -x is very useful to check disk i/o activities. Sometimes it is said that "check %util is less than 100%" or "check svctm is less than 50ms", but please do not fully trust these numbers. For example, the following two cases (DBT-2 load on MySQL) used same disks (two HDD disks, RAID1) and reached almost 100% util, but performance numbers were very different (no.2 was about twice as fast as no.1).

# iostat -xm 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
       21.16    0.00    6.14   29.77    0.00   42.93

Device: rqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s
sdb    2.60 389.01  283.12 47.35     4.86     2.19
avgrq-sz avgqu-sz   await  svctm  %util
43.67     4.89   14.76   3.02  99.83

# iostat -xm 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
       40.03    0.00   16.51   16.52    0.00   26.94

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s
sdb              6.39   368.53 543.06 490.41     6.71     3.90
avgrq-sz avgqu-sz   await  svctm  %util
21.02     3.29    3.20   0.90  92.66

100% util does not mean disks can not be faster anymore. For example, command queuing (TCQ/NCQ) or battery backed up write cache can often boosts performance significantly. For random i/o oriented applications(in most cases), I pay attention to r/s and w/s. r/s is the number of read requests that were issued to the device per second. w/s is the number of write requests that were issued to the device per second (copied from man). r/s + w/s is the total number of i/o requests per second (IOPS) so it is easier to check whether disks work as expected or not. For example, a few thousands of IOPS can be expected on single Intel SSD drive. For sequential i/o operations, r/s and w/s can be significantly affected by Linux parameters such as max_sectors_kb even though throughput is not different, so I check different iostat status variables such as rrqm/s, rMB/s.

What about svctm? Actually Linux's iostat calculates svctm automatically from r/s, w/s and %util. Here is an excerpt from iostat.c .

...
nr_ios = sdev.rd_ios + sdev.wr_ios;
tput   = ((double) nr_ios) * HZ / itv;
util   = ((double) sdev.tot_ticks) / itv * HZ;
svctm  = tput ? util / tput : 0.0;
...
/*       rrq/s wrq/s   r/s   w/s  rsec  wsec   rkB   wkB  rqsz  qusz await svctm %util */
printf(" %6.2f %6.2f %5.2f %5.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %6.2f %6.2f\n",
   ((double) sdev.rd_merges) / itv * HZ,
   ((double) sdev.wr_merges) / itv * HZ,
   ((double) sdev.rd_ios) / itv * HZ,
   ((double) sdev.wr_ios) / itv * HZ,
...

The latter means the following.

r/s = sdev.rd_ios / itv * HZ
w/s = sdev.wr_ios / itv * HZ

The former means the following.

svctm = util / ((sdev.rd_ios + sdev.wr_ios) * HZ / itv)

If %util is 100%, svctm is just 1 / (r/s + w/s) seconds, 1000/(r/s+w/s) milliseconds. This is an inverse number of IOPS. In other words, svctm * (r/s+w/s) is always 1000 if %util is 100%. So checking svctm is practically as same as checking r/s and w/s (as long as %util is close to 100%). The latter (IOPS) is much easier, isn't it?