Optimizing Transaction Processing for In-Memory Databases on Large Multicore Machines

By Stephen Tu, MIT CSAIL

When people think of “big data” storage systems, they generally think of huge clusters with hundreds of machines working together to serve a large distributed datastore. To this end, lots of people in both academia and industry have focused on maximizing the performance of distributed databases (relational and non-relational) for online transaction processing (OLTP) workloads (see e.g., H-StoreVoltDB, Calvin).

However, less attention has been given to optimizing single node performance, which is equally as important:  with more-efficient single-node execution of the same workload, one can get away with fewer machines than before. Unfortunately, even though machines are becoming equipped with more processing cores and main memory, most database systems are simply not designed to take full advantage of the machine’s resources (see: e.g., “OLTP Through the Looking Glass,” Shore-MT).

Silo is a new in-memory database system we designed at MIT to address this problem.  On standard database benchmarks, Silo achieves excellent throughput and near linear scalability on a single 32-core machine. We will be presenting a paper on Silo at SOSP in November.

Challenges to Effective OLTP Execution on Multicore Machines

Generally speaking, there are two prevailing camps when it comes to how to execute OLTP on a modern multicore machine. One approach, taken by H-Store/VoltDB, Dora and PLP, is to use some form of data partitioning. This is a very simple idea: simply associate each core on a machine with its own subset of the database. In other words, treat a multicore machine with N cores as N distinct machines with a single core. To run a transaction, send the transaction to the core(s) from which it wants to read/modify data.

This has several advantages. In the H-Store/VoltDB design, because the execution engine for each partition is single-threaded, transactions that only touch one partition can run without any concurrency control. Furthermore, for a fixed database size, each individual partitioned index is smaller than an equivalent shared index. This can yield shorter access times and better cache locality. For these reasons, given perfectly partitionable workloads, a partitioned design yields the best performance.

However, real workloads are often not perfectly partitionable and instead exhibit skew. This makes sense when we think about real-world phenomenon. For instance, people shop at different times during the day, different items are popular at different times of the year, and interest on certain topics varies given current events. The implication for partitioned data stores is not ideal: if a partition suddenly becomes overloaded, the performance of the system cannot adjust, since partitioned systems generally do not offer much parallelism within a partition.

An additional complication is that transactions often need to access multiple partitions. In this case, partitioned systems generally must take out expensive partition locks, or undergo several rounds of communication in order to ensure serializability across partitions. Thus, the performance of partitioned datastores very much depends on how good the partitioning scheme is.

While there has been a lot of work in the database community on automated partitioning (see e.g., SchismAgrawal et al.), these approaches generally require that you know your workload well in advance, which is often hard to predict in practice. An alternative design to data partitioning is to use a shared database approach, such that each database worker has access to the entire database (much like the traditional RDBMS design). In this design, skew is much less of an issue, since any database worker can run any transaction. Furthermore, there is no need for cross-partition coordination, since, well, there aren’t any partitions! The tricky part, of course, is getting the concurrency control mechanism right so that database performance can indeed scale with the number of cores.

From work on multicore scalability (see e.g. Boyd-Wickizer et al.), we know that two guiding principles are necessary for achieving good scalability: (1) avoiding shared memory writes on operations that only perform reads and (2) avoiding global critical sections when possible, even if they are only a few instructions long. Notice that essentially every two-phase locking (2PL) protocol violates both these principles. The first is violated because 2PL requires taking out read locks (which involve shared memory writes to internal data bookkeeping structures) for records that are read. The second is violated because most 2PL implementations use global critical sections to assign globally unique transaction identifies (TIDs) which reflect the serial order of transactions. Thus, it is not surprising that most of the modern 2PL RDBMS implementations exhibit scalability bottlenecks even on workloads that logically have very little contention (see e.g. Shore-MT).

With Silo, our goal from the beginning was to deliver awesome performance for in-memory OLTP workloads, in terms of both raw throughput and scalability.

Silo: High Throughput + Scalability + Robust Performance

With Silo, our goal from the beginning was to deliver awesome performance for in-memory OLTP workloads, in terms of both raw throughput and scalability. We also did not want to sacrifice serializability. We were inspired by Masstree, a very scalable, high-throughput ordered index developed also at MIT. Masstree gave us a very high performance index data structure, but serializable transactions required much more than that.

Silo’s major contribution is a new serializable commit protocol optimized for multicores, based on optimistic concurrency control (OCC). The basic idea is quite simple. Transactions maintain read/write sets: the read set keeps track of the TID for each record that was read in the transaction, and the write set maintains all updates to records in the transaction (the database remains unmodified before commit). At commit time, each record in the read set is checked against the current version to detect concurrent modifications. If no record was modified, then it is safe for the transaction to commit because it observed the latest consistent view of the database (otherwise the transaction aborts). Committing a transaction involves installing all its modifications into the database, while aborting a transaction requires no further action.

What makes Silo’s OCC protocol different from other OCC protocols is that TID assignment is completely distributed; there is no global critical section. While this is great for scalability, this complicates durability, because TIDs in Silo do not necessarily obey the serial order of transactions. To solve this problem, we divide time into epochs, and design the protocol to ensure that epoch boundaries are consistent (that is, all TIDs include the epoch number of the system at commit time, and these epoch numbers are used to partially order TIDs). Our logging subsystem then logs and makes entire epochs durable at once by fsync-ing once per epoch, which is our form of group commit. The exact details of this protocol can be found in our upcoming SOSP paper.

To validate our design, we ran many variations of both the YCSB benchmark and the TPC-C benchmark. Both are very popular for benchmarking OLTP systems. Our major takeaways are that Silo achieves awesome raw throughput (~700k txns/sec on TPC-C on a single 32-core machine), near-linear scalability (per-core throughput at 32 cores is 91% per-core throughput at 8 cores on TPC-C), and very robust performance with skewed workloads when compared to a partitioned approach.

More details of these experiments can be found in the paper. We believe our results are very promising and show that serializable transactions and scalability can co-exist on multicore machines.

You can download the paper here.  Fork us on github here.

Author Bio:

Stephen Tu is a third-year graduate student in the database group at CSAIL at MIT. His current research adviser is Sam Madden. He was previously an undergraduate at the University of California, Berkeley, where he received dual degrees in computer science and mechanical engineering. At Berkeley, he was a member of the RADLab (now AMPLab).   His research interests are broadly in designing and implementing large-scale computer systems.


This entry was posted in Big Data Architecture, Databases and Analytics, DBMS, ISTC for Big Data Blog and tagged , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

7 − one =