Friday, May 20, 2011

Proper handling of insert-mostly, select-recently datasets

Some kinds of large tables such as chat messages, blog entries, etc have the following characteristics.

* huge number of records, huge data and index size
* insert and select mostly
* select from only recent data
* select by secondary index (i.e. user_id)
* secondary index entries are inserted in random order

What are optimal ways to handle these tables? The below single large table does not perform well.
CREATE TABLE message (
message_id BINGINT UNSIGNED PRIMARY KEY,
user_id INT UNSIGNED,
body VARCHAR(255),
...
created DATETIME,
INDEX(user_id)
) ENGINE=InnoDB;

The cause of poor performance is a secondary index on user_id. user_id is inserted in random order. Index size grows, and sooner or later it will exceed RAM size. Once index size on user_id exceeds RAM size, inserting into message table causes massive random read disk i/o, which reduces throughput significantly.
The below is a simple insert benchmark. Once random read disk i/o starts happening, throughput drops hugely. "Sequential order" means index entries are inserted sequentially, "Random order" means randomly.



This figure was what I presented at the MySQL Conference and Expo 2009. It's pretty old, but basic principles have not changed.

Using Range Partitioning


How can we make it faster? One of the best approaches in MySQL is using range partitioning, partitioned by date or primary key. This is one of my favorite features in MySQL. By using range partitioning, only the latest partition is actively accessed. Data/indexes in the rest partitions are much less accessed so they don't occupy buffer pool. Each partition size can be small enough to fit in memory, so insert performance does not drop.



CREATE TABLE message (
message_id BIGINT UNSIGNED,
user_id INT UNSIGNED,
body VARCHAR(255),
...
created DATETIME,
INDEX(message_id)
INDEX(user_id)
) engine=InnoDB
PARTITION BY RANGE(to_days(d1)) (
PARTITION p201103 VALUES LESS THAN (to_days('2011-03-01')),
PARTITION p201104 VALUES LESS THAN (to_days('2011-04-01')),
PARTITION p201105 VALUES LESS THAN (to_days('2011-05-01')),
PARTITION p201106 VALUES LESS THAN (to_days('2011-06-01')),
PARTITION p201107 VALUES LESS THAN (to_days('2011-07-01')),
...
);


As long as INSERT statements do inserts order by partition key and other SQL statements fetch only the recent data, no random disk reads will happen. Partitioning itself has some CPU overheads, but it's almost negligible in the real workloads, compared to disk i/o overheads.

(update:) Index and data size on each partition can be measured from information schema.
mysql> SELECT partition_name, index_length, data_length, table_rows FROM 
information_schema.partitions WHERE table_name='message';
+----------------+--------------+-------------+------------+
| partition_name | index_length | data_length | table_rows |
+----------------+--------------+-------------+------------+
| p201103 | 15565062144 | 15527313408 | 145146231 |
| p201104 | 15522070528 | 15507390464 | 205873280 |
| p201105 | 9736028160 | 9945743360 | 88653190 |
| p201106 | 32768 | 16384 | 0 |
+----------------+--------------+-------------+------------+
6 rows in set (0.13 sec)

In MySQL 5.0 or earlier versions where range partitioning is not supported, creating daily/weekly/monthly tables is a good way as a workaround, though applications have to aware of table name differences.

CREATE TABLE message_201103 ..
CREATE TABLE message_201104 ..
CREATE TABLE message_201105 ..

How about Database Sharding?


As you know, database sharding is very common approach for handling huge data. Is sharding good for handling these tables? Probably not. Database sharding is mainly used to reduce slow disk i/o by reducing data size per server. In the above case, inserts can be done in memory regardless of data size(10,000+ inserts/second), so from performance point of view splitting tables is not needed as long as applications can keep up with in-memory insert speed. From database size vs storage size point of view (disk capacity point of view), you'll need to archive or purge older data.

Actually I have seen a couple of times that people use NoSQLs supporting transparent sharding(Auto-Sharding) for these kinds of tables: such as MongoDB, Cassandra. Data size will sooner or later exceed disk size, so using unlimited horizontally scaling database sounds reasonable. But if the database products don't support range partitioning, sharding becomes much less optimal for handling these tables. Suppose you have 3000GB datasets and only recent 30GB data are mostly accessed. With MySQL 5.1+ range partitioning, you can simply manage one large(3000GB) table with weekly/daily partitions (as long as disk space is available). Only the latest partitions (30GB) are actively accessed and the rest partitions (2970GB) are less likely accessed. Single commodity database server can probably handle enough workloads.

On the other hand, if you shard 3000GB database without range partitioning support, you might need 300GB x 10 shards because the whole secondary indexes are accessed. This means you need 10 times more servers.

Transparent sharding is good for application developers, but I believe range partitioning is a mandatory feature for handling insert-mostly, select-recently huge datasets.

39 comments:

Anonymous said...

>> PARTITION p201103 VALUES LESS THAN (to_days('2011-03-01')),

If your data is growing, so you need alter the table from time to time?

Yoshinori Matsunobu said...

You can prepare many partitions beforehand. And ALTER TABLE .. ADD PARTITION does not rebuild existing partitions so can finish quickly.

Casey said...

I have a large table of TEXT (20 million rows) that is accessed by primary key but 90% of reads are recent data (say the last 500K rows).

Is there an advantage to partitioning here? It seems like partitioning by ID range would might conserve buffer pool space although I'm not totally sure how to do that.

Martin Farach-Colton said...

@Yoshinori,

Fortunately the problem you describe has a direct solution: the TokuDB storage engine whose indexing performance stays strong at multi-terabyte+. (TokuDB is a fully MySQL compatible storage engine like InnoDB and MyISAM.)

Your post does a great job explaining the Achilles heel of OldSQL solutions that use B-tree indexing: They can't handle larger-than-memory indexing of non-sequential keys. Just to be clear, larger-than-memory performance collapse is not inherent in SQL databases but in B-tree databases. With TokuDB, you have full SQL support, with transactions and MVCC and the works, and still have great performance on big data.

Once you switch over your table to TokuDB - assuming you want to take it for a spin - it's important to take a look at your indexes. For example, in your case, INDEX(user_id) is implicitly INDEX(month, user_id), where the month is defined by the partitioning. Unfortunately, MySQL doesn't allow you to have a functional component in an index definition. (If it did, that would already work way better than partitioning, since 'month' as the first part of the key would make this index look a lot more sequential.) To get good performance, you'd need a different index than INDEX(user_id).

You didn't say what your queries were like, but assuming that you are doing equality tests on user_id you could define INDEX(user_id,d1). That should work well, and it would pose no indexing problem for TokuDB. Even better would be CLUSTERING INDEX(user_id,d1), because that index will cut out all the random I/O from your queries. No more slow lookups into the primary table. It's sequential I/O all the way.

Do you have a data set and set of queries that are available? We'd love to do a side-by-side comparison.

pachanga said...

Thanks for the post! BTW, what is the best way to measure the size of the index for InnoDB tables?

Yoshinori Matsunobu said...

@pachanga

Information schema has the statistics. I've added an example on the post.

pachanga said...

@Yoshinori

Thank you, just what I needed! And what do you think of usage of complex primary keys in InnoDB, e.g (player_id, id) ? According to your presentation, sequential primary keys is the best choice for inserts.

Anonymous said...

> From database size vs storage size point of view, you'll need to archive or purge older data.

What does "database size vs storage size" stand for ?

Yoshinori Matsunobu said...

@Anonymous

I intended disk capacity.

Yingkuan Liu said...

Hi Yoshinori,

I noticed the index size is bigger than data of this table. Is this table very heavily indexed? Given the table should have majority of data in body column which I believe you don't need to index. It's strange to see such big index size.

also the table size fluctuate a lot in relation to row count. April has 50% more rows but almost same size. May has half rows but size is a lot more than half. But I guess that probably related to your application change and feature change etc.

Anonymous said...

Typo:

message_id BINGINT UNSIGNED PRIMARY KEY,

BIGINT?

Yoshinori Matsunobu said...

@Yingkuan,

Index size and data size is just an example, so please don't take the numbers seriously. It is pretty common the index size gets bigger than data size if heavily indexed, but you're right data size should be much larger if body column is large, and highly depending on application changes.

Philip Zhong said...

@Yoshinori,

now the MYSQL does not support online partition drop;the index is global index,so if we use the MYSQL partition,we can't use it like oracle partition,can you provide a good idea?

domain hosting said...

I think that, you are expert in database management. I am not that much expert in database but I like database very much. Thanks for sharing this useful and detail information.

Baby ear infection said...

I would like to thank you for the efforts you have made in writing this article. Raw food diet Toenail fungus infection Common digestive disorders Common digestive system diseases and disorders

rs gold said...

Really impressed! Everything is very, very clear, open is a description of the problem. It contains the information.

Runescape Gold
MapleStory Mesos

wow gold said...

Just how much has this told her i Buy Tera Goldwould WOW Gold bring forth during the past? Work concerning the item earlier than you purchasing the point. BoEs may well change well inwards evaluate by server that will help you waiter, to’MT striving Crummy Aion Golden adequate how of import a to discover the industry unity’ray working with. Lay out conseverative with RS Gold the help of everyone of the payments. Wearable’metric ton run off and buying a lot of BoEs that were enough products only usually requires nearly a year to trade first earnings. BoEs ar parlous, high up wages pieces, gum olibanum an excellent make a motion can result in briny network,Inexpensive Aion Amber World Of Warcraft Gold only a tight relocation could make you recede riot golden.

Runescape Gold said...

This really is our first-time my spouse and i click here. I discovered numerous engaging products with your website, especially its debate. In the tons of feedback in your content, I assume I'm not really the only one having all the leisure below! Continue the nice work.

http://buyteraonline.com/
http://rsgoldfsbest.com/

tea length wedding dresses said...

Great article. I merely found your blog along with planned to declare i get genuinely liked looking at your blog blogposts. A reservoir of innovation. Reaped from the most popular designers and the new prominent. You are invited to be impressed with Amber’s selection of lace wedding gowns, by virtue of an incredible amount of self-motivated ideas that sparkling around. Have a digging-out on our span-new designer cheap wedding dresses plus size that are tagged into distinctive genre and style wedding dresses 2012.

Domain in India said...

This is my first time i visit here. I found so many interesting stuff in your blog especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all the enjoyment here! keep up the good work.

ambersbridal said...

Thanks for the informative writing. Would mind updating some good tips about it. I still wait your next place.simple wedding dresses http://www.ambersbridal.com/wedding-dresses/simple-wedding-dresses.html
mermaid wedding dresses http://www.ambersbridal.com/wedding-dresses/mermaid-wedding-dresses.html
tea length wedding dresses http://www.ambersbridal.com/wedding-dresses/tea-length-wedding-dresses.html

Runescape Gold said...

Thank you for sharing. Glad to see you.It is really a good post.http://gamestop4u.com/

ambersbridal said...

A holy church, rings, a bouquet, a 3-floor wedding cake, lace wedding dress
champagne, and the moving promise “ I do”, make up the happiest moment in one’s life. Then it is the wedding.
lace wedding gowns
A gorgeous wedding dress has been inside a girl’s dream since she was five years old. It is widely said that the bride is the most beautiful woman all over the world.
Bridal gowns vary in different parts of the world. For example, cheap wedding dresses plus sizein traditional Chinese culture, the color red is regarded for centuries as the symbol of good luck; while in the Occident, wedding dresses 2012 a full-length white wedding dress is always associated with romance and sanctity. But, with the globalization, wedding dresses tend to be alike in the whole world.

Michael said...

MySql Hosting India

IndoUs Hosting is the leading MySql Hosting companies in India. The company provides dedicated as well as shared hosting services of MySql hosting. If you are looking for guaranteed MySql Hosting India, we can be the best option for you.

Anonymous said...

What was presented in this article has forced me to reconsider and redefine previously held viewpoints. This can only be a good thing. Kimberly@reverse phone lookup

GuildWars2Items said...

Would you take the pen in hand, plunge right in and just do it, struggling to keep up with the twists and turns of the torrents of words that take you where they take you Buy D3 Gold? Would your plans be so extensive that you never even got to the writing Cheap D3 Gold? But if you do decide to use it, what would you do with it? How would you play the game D3 Gold Sale?

GuildWars2Items said...

It's great to be great, but it's greater to be human Sell Runescape Gold, if I could rearrange the alphabet, I'd put buy guild wars 2 gold Y and I together guild wars 2 gold buy, A lifetime of happiness! No man alive could bear it; it would be hell on earth .

GuildWars2Items said...

Experience is a hard teacher because she gives the test first, the lesson afterwards cheap c9 gold, being happy doesn't mean that everything is perfect. It means that you've decided to look beyond the imperfections c9 money, Life is like a hot bath. It feels good while you're in it, but the longer you stay in, the more wrinkled you get WOW Power Leveling.

GuildWars2Items said...

You have a forever friend for life, and forever has no end D3 Gold Sale, and if you find such a friend, you feel happy and complete, because you need not worry Diablo iii Gold, your forever friend holds your hand and tells you that everything is going to be okay Buy Diablo iii Gold.

GuildWars2Items said...

And if you find such a friend, you feel happy and complete, because you need not worry Cheap Diablo iii Gold, Diablo iii Gold Sale your forever friend holds your hand and tells you that everything is going to be okay Diablo 3 Gold.

Maria Arredondo said...

Your future depends on your dreams. So go to sleep Diablo 3 Gold, he time of life is short ; to spend that shortness basely, it would be too long C9 Gold, It's great to be great, but it's greater to be human C9 Gold Sale. People who are serious about the relation are moody as they have devoted a lot that makes them worry about gains and losses.

Maria Arredondo said...

One meets its destiny on the road he takes to avoid it Cheap C9 Gold, If I could rearrange the alphabet, I'd put Y and I together c9 gold, People who are serious about the relation are moody as they have devoted a lot that makes them worry about gains and losses c9 online gold

Maria Arredondo said...

Suppose someone gave you a pen — a sealed, solid-colored pen.You could not see how much ink it had D3 Gold, think Positive Thoughts Every Day Buy D3 Gold, the brightest future will always be based on a forgotten past, you can't go on well in life until you let go of your past failures and heartaches Cheap D3 Gold.

Maria Arredondo said...

You have a forever friend for life, and forever has no end D3 Gold Sale, and if you find such a friend, you feel happy and complete, because you need not worry Diablo iii Gold, your forever friend holds your hand and tells you that everything is going to be okay Buy Diablo iii Gold.

dte said...

The consequences of today are determined by the actions of the past scarlet blade gold. To change your future, alter your decisions today scarlet blade gold, Experience is a hard teacher because she gives the test first, the lesson afterwards scarlet blade gold, but it takes character to keep you there.

buyGuildwars2gold said...

Empty what's full. Scratch where it itches. Don't go around saying the world owes you a living rs gold, The world owes you nothing. It was here first. Being happy doesn't mean that everything is perfect Runescape Gold, It means that you've decided to look beyond the imperfections. Do not pray for tasks equal to your powersrs gold.

Jess said...

They managed to provide a service for people and charge for money. People who do not want to spend time grind levels will need the cheap runescape gold service. You can take the chance to make a profit. You have another way to make money—buying and selling. You can buy items needed by other players and sell to them. They will pay more than the price you buy cheap ffxiv gil.

steve7876 said...

I am quite sure they will learn lots of new stuff here than anybody else. Thanks for sharing this informationSupplements in Australia

sadia sulaman said...

I just wanted to say that I found your site via Goolge and I am glad I did. Keep up the good work and I will make sure to bookmark you for when I have more free time away from the books. Thanks again!www.hastings-online.co.uk |

www.steppescaribbean.co.uk |

www.hawnbyhotel.co.uk |

www.partridgeplace.co.uk |

www.newelfin.co.uk |

www.subito2.co.uk |

www.kingsofbassline.co.uk |

www.beckyandneil.co.uk |

www.antichew.co.uk |

www.oldtimegrates.co.uk |

Post a Comment