Titles in this page

Wednesday, August 26, 2009

Accessing MySQL tables from UDF + storage engine API

A couple of weeks ago, at a MySQL study group in Tokyo I presented about various kinds of MySQL hacking techniques, such as debugging with MS Visual Studio or gdb, tracing with DTrace, writing information schema plugins, accessing tables from UDF and storage engine API, and extending MySQL server code itself. This 90-minute session went successful. Nearly 100 people attended, including famous MySQL developers such as Kazuho Oku-san and Kentoku Shiba-san, having productive discussions with attendees and quite a lot of people seemed interested in MySQL hacking. The slides are written in Japanese, but sample codes can be understandable and can be downloaded here.
What audiences were most interested in was accessing MySQL tables directly from Plugins(currently UDFs) and storage engine API. I showed a sample UDF code, based on what Kazuho-san released over one year ago, which is over 10 times faster than SQL or stored procedures for complex operations. Excerpted codes are as follows(skipping error handling).

#define MYSQL_SERVER 1
/* initializing tables, fields and indexes */
THD *thd = current_thd;
TABLE_LIST tables;
TABLE *table;
tables.init_one_table("test","message", TL_READ);
simple_open_n_lock_tables(thd, &tables);
table= tables.table;
message_id_fld = get_field(table, "id");
message_user_id_fld = get_field(table, "user_id");
message_key = index_init(table, "user_id", true);

/* accessing table. Equivalent query is
SELECT id FROM message WHERE user_id=? ORDER BY id DESC LIMIT 1; */
uchar* key_buff= (uchar*) thd->alloc(message_key->key_length);
bzero(key_buff, message_key->key_length);
int null_offset= message_user_id_fld->null_bit;
int4store(key_buff + null_offset, user_id);
err= table->file->index_read_last_map(table->record[0], key_buff, 1);
fprintf(stderr, "id=%lld\n", message_id_fld->val_int());

/* closing objects */

By defining MYSQL_SERVER to 1, you can access to MySQL internal structures.
You can build the sample UDF on development environment as follows. MySQL source code is required to build when accessing to internal structures.

-Wall -fno-rtti -fno-exceptions -fPIC -shared \
-I/debug-built-mysql-src/include \
-I/debug-built-mysql-src/regex \
-I/debug-built-mysql-src/sql \
-o udf_sample.so udf_sample.cc

On production environment, skip using -DSAFE_MUTEX -DSAFEMALLOC and include MySQL source built with configurations for production. When using some macros(i.e. current_thd) that behave differently between debug and prod configurations, you need to include appropriate source directly.

g++ -g \
-Wall -fno-rtti -fno-exceptions -fPIC -shared \
-I/prod-built-mysql-src/include \
-I/prod-built-mysql-src/regex \
-I/prod-built-mysql-src/sql \
-o udf_sample.so udf_sample.cc

This snippet skips error handling etc for good legibility, but you need to carefully handle errors, otherwise mysqld crashes. Please check bench_udf.cc, build_udf_debug.sh, and build_udf_prod.shin my sample code.

MySQL is fast, but it still has many expensive CPU operations such as parsing SQL statements, making execution plans every time (MySQL doesn't have optimizer-plan caching mechanism), a lot of memcpy/malloc operations, etc. If you write a UDF which directly accesses tables through storage engine API, these expensive operations can be skipped so it can be much faster than SQL statements in many cases.
The above code looks like using key value store APIs (i.e. TokyoCabinet APIs). Most KVSs are faster than MySQL, and one of the reasons is they don't need to do extra operations such as making execution plans etc. But it is not impossible to use MySQL like fast KVS as above. You can not only get benefits about better performance, but also utilize traditional RDBMS implementations such as B+Tree (range scan), perfect durability, schema awareness. Data is stored in one place, and you can get the same data regardless of accessing methods. For example, you can use UDF for some special parts (i.e. picking up all friends' latest messages, which is difficult for MySQL to fetch quickly), and use normal SQL statements for others. I assume such two-way client interfaces (fast API and traditional SQL) can be a common implementation for next-generation RDBMS. Third, this is currently not common hacks so there is almost no document. For example, handling two or more tables requires additional codes, but this is not documented (I am still fully not understand to safely handle them).
Of course there are many disadvantages about using UDFs and storage engine APIs directly. First, it highly depends on MySQL versions. Storage engine APIs are standardized but client side operations such as calling table open/lock functions are different from MySQL versions. Second, mysqld easily crashes if UDF has bugs.
But there are a lot of possibilities here and this certainly attracts developers. For example, Kazuho-san yesterday published a very interesting UDF: mycached (memcached server udf). It listens memcached protocol, then directly accessing a table through storage engine API (primary key lookup). According to his benchmark this is twice as fast as SELECT with primary key lookup (if records are cached in buffer pool). I assume this approach will work very well as a durable key-value store, for example storing session data, last access timestamp etc.

Wednesday, August 12, 2009

Great performance effect of fixing broken group commit

Yesterday InnoDB Plugin 1.0.4 was released by Innobase. This version contains one of the most important performance fixes - "Fix for broken group commit". After MySQL5.0, InnoDB breaks group commit when using with binary log (or with other transactional storage engines), even though setting innodb_support_xa=0. This was really serious because fsync() (called at transaction commit when setting innodb_flush_log_at_trx_commit=1) is very expensive. The initial bug report about this was submitted by Peter four years ago. Last year David Lutz submitted a bug report with his prototype patch. It is great news that this bug fix has been finally implemented in the official InnoDB release.
I did a simple benchmarking by mysqlslap. mysqlslap has functionality to run concurrent inserts from multiple connections. The result is as follows.
mysqlslap --concurrency=1,5,10,20,30,50 --iterations=1 --engine=innodb \
--auto-generate-sql --auto-generate-sql-load-type=write \

H/W is Sun Fire X4450, Intel Xeon X5560 Nehalem 2.80GHz * 16cores, 12GB RAM, SAS HDD with write cache. my.cnf configuration is as follows. log-bin is enabled.


Apparently InnoDB Plugin 1.0.4 outperforms normal InnoDB (6.1 times faster on 30 connections, innodb_support_xa=1). Normal InnoDB doesn't scale with connections but InnoDB Plugin 1.0.4 does. What is the difference? Normal InnoDB does the following at transaction commit.
writing into InnoDB logfile for prepare, then fsync
(skipped if innodb_support_xa=0)
writing into binlog
writing into InnoDB logfile for commit, then fsync

Under the critical section protected by prepare_commit_mutex, only one thread can do operation. So when 100 threads do commit at the same time, fsync() is called 100 times for prepare, 100 times for commit (200 in total). Group commit is totally broken. As you see the above graph, innodb_support_xa=0 is effective (though it still breaks group commit), but in general innodb_support_xa=0 is not recommended because it will break consistency between binlog and InnoDB in case of a crash.
In InnoDB Plugin 1.0.4, the behavior has changed as follows.
writing into InnoDB logfile for prepare, then fsync
(skipped if innodb_support_xa=0)
writing into binlog
writing into InnoDB logfile for commit
fsync to the InnoDB logfile

fsync, the most expensive operation, is called outside the critical section, so group commit is possible and concurrency is much more improved. The following graph shows how much Innodb_data_fsyncs was increased after executing mysqlslap(committing 100,000 transactions).

In 5.1.37+Builtin(support_xa=1), 2 fsyncs happens per transaction, regardless of # of concurrent connections. In 5.1.37+Builtin(support_xa=0), 1 fsync happens per transaction, regardless of # of concurrent connections. These mean group commit is broken. In both cases about 10,000 fsyncs were executed per second, which seems upper limit for regular HDD with BBU. On the other hand, InnoDB Plugin 1.0.4 greatly reduces the number of fsyncs(i.e. 200251 to 26092 on 30 connections(innodb_support_xa=1): 87% decreased). This shows group commit works well.

Write ordering between binlog and InnoDB logfile is still guaranteed. Write ordering for InnoDB prepare is not same as the ordering of binlog, but this is fine. Prepared entries are used only for recovery and not visible to applications. When doing crash recovery, mysqld reads binlog at first(picking up xids), then checking prepared but not committed entries(xids) in InnoDB logfile, then applying these entries in the order of binlog xids. So in the end write ordering is guaranteed.
Note that if you set sync_binlog=1, it is still very slow because writing into binlog is protected by mutex (prepare_commit_mutex and LOCK_log).
This is my understanding. InnoDB Plugin 1.0.4 also has other awesome features (i.e. multiple i/o threads) so it's worth trying.