Thursday, July 4, 2013

Sharding (refresher)

Scaling is an eventuality with any growing business. It comes in many forms. For the online business, database scaling is the first big technical hurdle.

The new generation of webapps demand a near-persistent connection with the database, doing multiple read/writes over a single session; the more LIVE data you throw, the longer "they" stick with you. Couple that with the "viral" aspect, and you know that soon as your app gets noticed, the database demands are gonna surge. Databases are the foremost when it comes to contingency planning.

DBs are moving to the cloud, much like our individual data. Clouds, for all the good they offer, aren't strong when it comes to sustained I/O ops.
A 0.9GB database takes 1 min to load.
A 30GB database takes 10 days to load...

Another aspect is that the newer-gen NoSQL DBs are single threaded. Which means the CPU is only handling a single op at a time. In case of a lock, it could take a long time to resolve. Everyone (communicating with the DB) takes the brunt.

Horizontal partitioning is a design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. The advantage is the number of rows in each table is reduced (this reduces index size, thus improves search performance). If the sharding is based on some real-world aspect of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.

Another place sharding can be used is to reduce contention on data entities. It is especially important when building scalable systems to watch out for those piece of data that are written often because they are always the bottleneck. A good solution is to shard off that specific entity and write to multile copies, then read the total. An example of this "sharded counter wrt GAE: http://code.google.com/appengine/articles/sharding_counters.html

If you have queries to a DBMS for which the locality is quite restricted (say, a user only fires selects with a 'where username = $my_username') it makes sense to put all the usernames starting with A-N on one server and all from M-Z on the other. By this you get near linear scaling for some queries.
Resources
[1] http://en.wikipedia.org/wiki/Shard_(database_architecture)
[2] http://stackoverflow.com/questions/992988/what-is-sharding-and-why-is-it-important
[3] http://www.slideshare.net/rightscale/rightscale-webinar-scaling-your-database-in-the-cloud
[4] https://vimeo.com/32541189