Computers are moving towards an era dominated by many-core machines with dozens or even hundreds of cores on a single chip. This unprecedented level of on-chip parallelism introduces a new dimension to scalability that current databases were not designed for. In order to keep performance increasing, databases will have to adapt to this new reality and be redesigned to scale to large core counts.
In this work by a team from MIT CSAIL and Carnegie Mellon, we scaled a transaction processing database to 1000 cores and showed that current database technology is not prepared for the many-core era, failing to scale as the number of cores increases.
We implemented seven concurrency control algorithms (see Table 1) on a main-memory database and using computer simulations scale our system to 1024 cores. Our extensive analysis shows that all algorithms fail to scale to this magnitude but for different reasons. In each case, we identify fundamental bottlenecks that are independent of the particular database implementation and argue that even state-of-the-art database management systems (DBMSs) suffer from these limitations. We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.
We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.
Concurrency Control Schemes
Online Transaction Processing (OLTP) database systems support the part of an application that interacts with end-users in the form of transactions. For example, OLTP may process new orders, respond to a page request, or perform a financial transaction. Each transaction is the execution of a sequence of one or more operations (e.g., SQL queries) on a shared database to perform some higher-level function.
An OLTP DBMS is expected to maintain four properties for each transaction: (1) atomicity, (2) consistency, (3) isolation and (4) durability, referred to with the ACID acronym. Concurrency control permits end-users to access a database in a multi-programmed fashion while preserving the illusion that each of them is executing the user’s transaction alone on a dedicated system. It essentially provides the atomicity, consistency, and isolation guarantees in the system.
In this work, we evaluated seven basic concurrency control schemes on our many-core system. They are the building blocks for most state-of-the-art concurrency control schemes existing in most databases. Table 1 summarizes the seven schemes evaluated in this work.
Since many-core chips with 1000 cores do not yet exist, we performed our analysis through computer simulations using Graphite, a multi-core simulator that can scale up to 1024 cores. The simulated architecture is a tiled chip multi-processor where each tile has a small processor core, two levels of cache, and a 2D-mesh network-on-chip for communication between the cores, as shown in Figure 1.
For the DBMS engine, we implemented a simplified main memory database that only contains the basic functionality needed for our experiments. The motivation for using a custom DBMS is two-fold. First, we want to ensure that no other bottlenecks exist in the database other than concurrency control. This allows us to study the fundamentals of each scheme in isolation without interference from unrelated DBMS features. Second, experimenting with a full-featured, state-of-the-art DBMS would be impractical due to the considerable slowdown of simulators (e.g., Graphite has an average slowdown of 10,000×). Using our simplified engine allows us to limit the experiments to reasonable times.
We evaluate the performance of different concurrency control schemes on the TPC-C benchmark, which is the current industry standard for evaluating the performance of OLTP systems. It consists of nine tables that simulate a warehouse-centric order processing application. For a concurrency control algorithm that requires data partitioning (i.e., H-STORE), TPC-C is partitioned based on the warehouse id of the WAREHOUSE table. Only two (Payment and NewOrder) out of the five transactions in TPC-C are modeled in our simulation. Since the two transactions comprise 88% of the total transactions, this is a good approximation.
Our results show that all seven concurrency control schemes fail to scale to a large number of cores, but for different reasons. Table 2 summarizes the findings for each of the different schemes. We identified two main bottlenecks to scalability: (1) conflict detection and (2) timestamp allocation. The conflict detection bottleneck affects both DL_DETECT and OCC. For DL_DETECT, detecting and resolving deadlocks is a burdensome operation that consumes most of the transaction’s processing time as the number of cores increases, even for relatively low contention levels. Similarly, OCC spends all its time in the validation phase, trying to detect conflicts between transactions. All other T/O based schemes suffer from the timestamp allocation bottleneck.
For a large number of cores, threads spend a significant amount of time waiting to acquire a global lock in order to increase the timestamp counter. This is a fundamental bottleneck inherent in the way these algorithms work. The NO_WAIT scheme is the only algorithm that has no centralized bottleneck. This scheme performed well for workloads with low contention, but with the side effect of having very high abort rates. For certain workloads the cost of aborting a transaction may be prohibitively high, in which case this scheme would have poor performance.
One possible solution to address these bottlenecks is to use custom hardware to perform these tasks more efficiently. For example, there could be a hardware-based deadlock detection component that would uncover a deadlock immediately when it occurs by tracking memory location accesses. A timestamp allocator implemented in hardware could advance the global timestamp for transactions in just a few clock cycles, which is much faster than using mutexes in software. With either of these capabilities, concurrency control algorithms could become far more scalable with the additional benefit of having lower power consumption.