By Lin Ma, Carnegie Mellon University; Joy Arulraj, Carnegie Mellon University; Sam Zhao, Brown University; Andrew Pavlo, Carnegie Mellon University; Subramanya R. Dulloor, Intel Labs; Michael J. Giardino, Georgia Institute of Technology; Jeff Parkhurst, Jason L. Gardner, Kshitij Doshi, Intel Labs; and Col. Stanley Zdonik, Brown University
In-memory database management systems (DBMSs) outperform disk-oriented systems for online transaction processing (OLTP) workloads because in-memory DBMSs eliminate legacy components that inhibit performance, such as buffer pool management and concurrency control. But this improved performance is only achievable when the database is smaller than the amount of physical memory available in the system.
To overcome this limitation, several approaches have been developed to allow in-memory DBMSs to support larger-than-memory databases without sacrificing the higher performance that they achieve in comparison to disk-oriented systems. The crux of how these approaches work is that they rely on the skewed access patterns exhibited in OLTP workloads, where certain data tuples are “hot” and accessed more frequently than other “cold” tuples.
For example, on a website like eBay, new auctions are checked and bid on frequently by users near the auction’s end. After the auction is closed, it is rarely accessed and almost never updated. If the auction data is moved to cheaper secondary storage, the system can still deliver high performance for transactions that operate on hot in-memory tuples while still being able to access the cold data if needed at a later point in time.
Although there have been several implementations proposed for this type of cold data storage, there has not been a thorough evaluation of the design decisions in implementing this approach, such as policies for when to evict tuples and how to bring them back when they are needed. These choices are further complicated by the varying performance characteristics of different storage devices, including future non-volatile memory technologies. The DBMS cannot fully leverage the properties of modern storage hardware if these policies are not chosen correctly.
The in-memory DBMS cannot fully leverage the properties of modern storage hardware if these policies [for “cold” data storage] are not chosen correctly.
We explore these issues in a new paper and discuss several approaches to solve them.
We implemented all of these approaches in an in-memory DBMS and evaluated them using five different storage technologies: (1) HDD, (2) shingled magnetic recording HDD, (3) NAND-based SSD, (4) emulator for 3D-XPoint like technologies, and (5) byte-addressable NVRAM. Our results show that choosing the best strategy based on the hardware improves throughput by 92%–340% over a generic configuration.
We identify three policies that are tightly coupled to the storage technology used for the DBMS’s cold-data storage. We analyze how the characteristics of the hardware device relate to each of these policies.
The first policy is how the DBMS should move tuples back into DRAM from the secondary storage. One method is to abort the transaction that touches cold tuples, merge the tuples asynchronously into memory, and then restart the transaction. We call this method abort-and-restart (AR). This removes the overhead of reading the data out of a transaction’s critical path, which is important if the device has a high read latency. The alternative approach is synchronous retrieval (SR), where the DBMS stalls a transaction that accesses evicted tuples while the data is brought back into memory. There is no additional overhead with respect to aborting and restarting the transaction, but the retrieval of data from the secondary storage delays the execution of other transactions. This policy is ideal when using smaller eviction block sizes on devices with low latencies. We compare the two policies on the five storage devices mentioned before in the H-Store DBMS using a 10 GB YCSB workload with 1.25 GB available DRAM. The results show that AR policy achieves the best performance for HDD and SMR. For SSD and 3D-XPoint, the best performance of the two policies are similar. And for NVRAM, the best policy is to use SR with the smallest block size.
Another important policy is where the DBMS should put tuples when they are brought back into memory. It could merge them back into the regular table storage (i.e., heap) immediately after a transaction accesses it so that future transactions can use it as well. This is important if the cost of reading from the device is high. But given that OLTP workloads are often skewed, then it is likely that the data that was just retrieved is still cold and will soon be evicted again. This thrashing will degrade the DBMS’s performance. One way to reduce this oscillation is to delay the merging of cold tuples. When a transaction accesses evicted tuples, the DBMS stores them in a temporary in-memory buffer instead of merging them back into the table. Then, when that transaction finishes, the DBMS discards this buffer and reclaims the space. Our evaluation shows that by correctly setting the threshold on the merging back of cold tuples into regular table storage based on access frequency, the eviction interval can be increased by up to 6x. The DBMS needs to perform more reads from the secondary storage when the merging threshold is higher. Thus, this setting benefits storage technologies with lower read costs.
The last policy is how the DBMS should access the cold-data storage. Up until now, we have assumed that the DBMS manages cold tuples in a block-oriented manner. That is, the system reads and writes data from the secondary storage device in batches that are written sequentially. This block-oriented model could be inappropriate for the future NVRAM storage that is able to support byte-level operations. Instead of organizing tuples into blocks, an alternative approach is to map a portion of the DBMS’s address space to files on storage devices using mmap system call and then move the cold data to the mapped region. Adopting this approach by using a file-system designed for byte-addressable NVRAM (PMFS) increases the performance of the DBMS by up to 31% according to our experiments.
We also measure the performance of the DBMS using all best policy configurations for each storage device on other OLTP workloads: Voter, TPC-C, and TATP. We compare each optimized configuration with a single “default” configuration from the original anti-caching paper of H-Store (see Figure 1). For Voter workload where transactions only insert tuples that are never read, the DBMS’s performance with the optimized and generic configurations is similar for all devices. In TPC-C, only 4% of the transactions accesses evicted tuples. The improvement over the generic configuration is 5%-36%. But for TATP workload with a large number of evicted tuples accesses and smaller tuples sizes, tailoring the strategy for each storage technology improves throughput by up to 3x over the generic configuration.
We will present our paper on this work, “Larger-than-Memory Data Management on Modern Storage Hardware for In-Memory OLTP Database Systems” at the forthcoming DaMoN 2016 workshop on June 27, 2016.