Friday, May 20, 2011

Proper handling of insert-mostly, select-recently datasets

Some kinds of large tables such as chat messages, blog entries, etc have the following characteristics.

* huge number of records, huge data and index size
* insert and select mostly
* select from only recent data
* select by secondary index (i.e. user_id)
* secondary index entries are inserted in random order

What are optimal ways to handle these tables? The below single large table does not perform well.
CREATE TABLE message (
message_id BINGINT UNSIGNED PRIMARY KEY,
user_id INT UNSIGNED,
body VARCHAR(255),
...
created DATETIME,
INDEX(user_id)
) ENGINE=InnoDB;

The cause of poor performance is a secondary index on user_id. user_id is inserted in random order. Index size grows, and sooner or later it will exceed RAM size. Once index size on user_id exceeds RAM size, inserting into message table causes massive random read disk i/o, which reduces throughput significantly.
The below is a simple insert benchmark. Once random read disk i/o starts happening, throughput drops hugely. "Sequential order" means index entries are inserted sequentially, "Random order" means randomly.



This figure was what I presented at the MySQL Conference and Expo 2009. It's pretty old, but basic principles have not changed.

Using Range Partitioning


How can we make it faster? One of the best approaches in MySQL is using range partitioning, partitioned by date or primary key. This is one of my favorite features in MySQL. By using range partitioning, only the latest partition is actively accessed. Data/indexes in the rest partitions are much less accessed so they don't occupy buffer pool. Each partition size can be small enough to fit in memory, so insert performance does not drop.



CREATE TABLE message (
message_id BIGINT UNSIGNED,
user_id INT UNSIGNED,
body VARCHAR(255),
...
created DATETIME,
INDEX(message_id)
INDEX(user_id)
) engine=InnoDB
PARTITION BY RANGE(to_days(d1)) (
PARTITION p201103 VALUES LESS THAN (to_days('2011-03-01')),
PARTITION p201104 VALUES LESS THAN (to_days('2011-04-01')),
PARTITION p201105 VALUES LESS THAN (to_days('2011-05-01')),
PARTITION p201106 VALUES LESS THAN (to_days('2011-06-01')),
PARTITION p201107 VALUES LESS THAN (to_days('2011-07-01')),
...
);


As long as INSERT statements do inserts order by partition key and other SQL statements fetch only the recent data, no random disk reads will happen. Partitioning itself has some CPU overheads, but it's almost negligible in the real workloads, compared to disk i/o overheads.

(update:) Index and data size on each partition can be measured from information schema.
mysql> SELECT partition_name, index_length, data_length, table_rows FROM 
information_schema.partitions WHERE table_name='message';
+----------------+--------------+-------------+------------+
| partition_name | index_length | data_length | table_rows |
+----------------+--------------+-------------+------------+
| p201103 | 15565062144 | 15527313408 | 145146231 |
| p201104 | 15522070528 | 15507390464 | 205873280 |
| p201105 | 9736028160 | 9945743360 | 88653190 |
| p201106 | 32768 | 16384 | 0 |
+----------------+--------------+-------------+------------+
6 rows in set (0.13 sec)

In MySQL 5.0 or earlier versions where range partitioning is not supported, creating daily/weekly/monthly tables is a good way as a workaround, though applications have to aware of table name differences.

CREATE TABLE message_201103 ..
CREATE TABLE message_201104 ..
CREATE TABLE message_201105 ..

How about Database Sharding?


As you know, database sharding is very common approach for handling huge data. Is sharding good for handling these tables? Probably not. Database sharding is mainly used to reduce slow disk i/o by reducing data size per server. In the above case, inserts can be done in memory regardless of data size(10,000+ inserts/second), so from performance point of view splitting tables is not needed as long as applications can keep up with in-memory insert speed. From database size vs storage size point of view (disk capacity point of view), you'll need to archive or purge older data.

Actually I have seen a couple of times that people use NoSQLs supporting transparent sharding(Auto-Sharding) for these kinds of tables: such as MongoDB, Cassandra. Data size will sooner or later exceed disk size, so using unlimited horizontally scaling database sounds reasonable. But if the database products don't support range partitioning, sharding becomes much less optimal for handling these tables. Suppose you have 3000GB datasets and only recent 30GB data are mostly accessed. With MySQL 5.1+ range partitioning, you can simply manage one large(3000GB) table with weekly/daily partitions (as long as disk space is available). Only the latest partitions (30GB) are actively accessed and the rest partitions (2970GB) are less likely accessed. Single commodity database server can probably handle enough workloads.

On the other hand, if you shard 3000GB database without range partitioning support, you might need 300GB x 10 shards because the whole secondary indexes are accessed. This means you need 10 times more servers.

Transparent sharding is good for application developers, but I believe range partitioning is a mandatory feature for handling insert-mostly, select-recently huge datasets.

No comments:

Post a Comment