In a joint collaboration between Carnegie Mellon University and Intel Labs, we explore the changes required in the logging and recovery algorithms in non-volatile memory database management systems (DBMSs). The results of this work are described in our paper on “Write-Behind Logging” to be presented at VLDB’17 (August 28-September 1, 2017, Munich, Germany).
The design of the logging and recovery components of database systems has always been influenced by the difference in the performance characteristics of volatile (DRAM) and non-volatile storage devices (SSD). The key assumption has been that non-volatile storage is much slower than DRAM and only supports block-oriented read/writes. But the arrival of new non-volatile memory (NVM) storage that is almost as fast as DRAM with fine-grained read/writes invalidates these previous design choices.
We make the case for a new logging and recovery protocol [for emerging non-volatile storage technologies], called write-behind logging (WBL), that enables the DBMS to recover nearly instantaneously from system failures.
We make the case for a new logging and recovery protocol, called write-behind logging (WBL), that enables the DBMS to recover nearly instantaneously from system failures. The key idea is that the DBMS logs what parts of the database have changed rather than how it was changed. In contrast to the ubiquitous write-ahead logging (WAL) protocol, the DBMS directly flushes the changes to the database before recording them in the log when it employs the WBL protocol.
In our experiments, we showed that using the WBL protocol reduces system recovery time across different OLTP workloads by 100× in comparison to using the write-ahead logging protocol. Given this, we contend that it is better to employ logging and recovery algorithms that are designed for NVM.
To appreciate why WBL is better than WAL when using NVM, let’s look at how WAL is implemented in DBMSs.
The most well-known recovery method based on WAL is the ARIES protocol developed by IBM in the 1990s. ARIES is a physiological logging protocol where the DBMS combines a physical redo process with a logical undo process. During normal operations, the DBMS records transactions’ modifications in a durable log that it uses to restore the database after a crash.
The traditional WAL recovery algorithm (see Figure 1) comprises three phases: (1) analysis, (2) redo and (3) undo. In the analysis phase, the DBMS processes the log starting from the latest checkpoint to identify the transactions that were active at the time of failure and the modifications associated with those transactions. In the subsequent redo phase, the DBMS processes the log forward from the earliest log record that needs to be redone. Some of these log records could be from transactions that were active at the time of failure as identified by the analysis phase. During the final undo phase, the DBMS rolls back uncommitted transactions (i.e., transactions that were active at the time of failure) using the information recorded in the log.
Although WAL supports efficient transaction processing when memory is volatile and durable storage cannot support fast random writes, it is inefficient for NVM storage. Consider a transaction that inserts a tuple into a table. The DBMS first records the tuple’s contents in the log, and it later propagates the change to the database. With NVM, the logging algorithm can avoid this unnecessary data duplication and thereby better support data-intensive applications.
We now describe the design of such an algorithm geared towards a DBMS running on a hybrid storage hierarchy comprising DRAM and NVM.
Write-behind logging leverages fast, byte-addressable NVM to reduce the amount of data that the DBMS records in the log when a transaction modifies the database. The reason why NVM enables a better logging protocol than WAL is three-fold. First, the write throughput of NVM is more than an order of magnitude higher than that of an SSD or HDD. Second, the gap between sequential and random write throughput of NVM is smaller than that of older storage technologies. Third, individual bytes in NVM can be accessed by the processor, and hence there is no need to organize tuples into pages or go through the I/O subsystem.
WBL reduces data duplication by flushing changes to the database in NVM during regular transaction processing. For example, when a transaction inserts a tuple into a table, the DBMS records the tuple’s contents in the database before it writes any associated meta-data in the log. Thus, the log is always (slightly) behind the contents of the database, but the DBMS can still restore it to the correct and consistent state after a restart.
WBL differs from WAL in many ways. Foremost is that the DBMS does not construct log records that contain tuple modifications at runtime. This is because the changes made by transactions are guaranteed to be already present on durable storage before they commit. Relaxing the ordering of writes to durable storage complicates WBL’s commit and recovery protocols. When the DBMS restarts after a failure, it needs to locate the modifications made by transactions that were active at the time of failure so that it can undo them. But these changes can reach durable storage even before the DBMS records the associated meta-data in the log. This is because the DBMS is unable to prevent the CPU from evicting data from its volatile caches to NVM. Consequently, the recovery algorithm must scan the entire database to identify the dirty modifications, which is prohibitively expensive and increases the recovery time.
The DBMS avoids this problem by recording meta-data about the clean and dirty modifications that have been made to the database by tracking two commit timestamps in the log. First, it records the timestamp of the latest committed transaction, all of whose changes and updates of prior transactions are safely persisted on durable storage (cp). Second, it records the commit timestamp (cd, where cp < cd) that the DBMS promises to not assign to any transaction before the subsequent group commit finishes. This ensures that any dirty modifications that were flushed to durable storage will have only been made by transactions whose commit timestamp is earlier than cd. When the DBMS restarts after a failure, it considers all the transactions with commit timestamps earlier than cp as committed, and ignores the changes of the transactions whose commit timestamp is later than cp and earlier than cd. In other words, if a tuple’s begin timestamp falls within the (cp , cd ) pair, then the DBMS’s transaction manager ensures that it is not visible to any transaction that is executed after recovery.
Before describing WBL’s recovery algorithm, we first introduce the notion of a commit timestamp gap. A commit timestamp gap refers to the range of timestamps defined by the pair (cp, cd). The DBMS must ignore the effects of transactions that fall within such a gap while determining the tuple visibility. This is equivalent to undoing the effects of any transaction that was active at the time of failure. The set of commit timestamp gaps that the DBMS needs to track increases on every system failure. To limit the amount of work performed while determining the visibility of tuples, the DBMS’s garbage collector thread periodically scans the database to undo the dirty modifications associated with the currently present gaps. Once all the modifications in a gap have been removed by the garbage collector, the DBMS stops checking for the gap in tuple visibility checks and no longer records it in the log.
The example in Figure 2 depicts a scenario where successive failures result in multiple commit timestamp gaps. At the end of the first group commit operation, there are no such gaps and the current commit timestamp is 101. The DBMS promises to not issue a commit timestamp higher than 199 in the time interval before the second commit. When the DBMS restarts after a system failure, it adds (101, 199) to its set of gaps. The garbage collector then starts cleaning up the effects of transactions that fall within this gap. Before it completes the scan, there is another system failure. The system then also adds (301, 399) to its gap set. Finally, when the garbage collector finishes cleaning up the effects of transactions that fall within these two gaps, it empties the set of gaps that the DBMS must check while determining the visibility of tuples.
With WBL, the DBMS does not need to periodically construct WAL-style physical checkpoints to speed up recovery. This is because each WBL log record contains all the information needed for recovery: the list of commit timestamp gaps and the commit timestamps of long-running transactions that span across a group commit operation. The DBMS only needs to retrieve this information during the analysis phase of the recovery process. It can safely remove all the log records located before the most recent log record. This ensures that the log’s size is always bounded.
As shown in Figure 3, the WBL recovery protocol only contains an analysis phase. During this phase, the DBMS scans the log backward until the most recent log record to determine the currently present commit timestamp gaps and timestamps of long-running transactions. There is no need for a redo phase because all the modifications of committed transactions are already present in the database. WBL also does not require an WAL-style undo phase. Instead, the DBMS uses the information in the log to ignore the effects of uncommitted transactions. After the brief analysis phase, the DBMS can immediately start handling transactions again.
We implemented both WAL and WBL in Peloton, an in-memory HTAP DBMS that supports NVM. We compare the DBMS’s runtime performance, recovery times, and storage footprint for the YCSB benchmark. This is a widely-used key-value store workload from Yahoo!. It is representative of the transactions handled by web-based companies. The workload consists of two transaction types: (1) a read transaction that retrieves a single tuple using its primary key, and (2) an update transaction that modifies a single tuple based on its primary key. The distribution of the transactions’ access patterns is based on a Zipfian skew. We present the results for the write-heavy workload mixture, that consists of 10% reads and 90% updates.
We performed these experiments using Intel Labs’ persistent memory evaluation platform (PMEP) hardware emulator. It contains two Intel Xeon E5-4620 CPUs (2.6 GHz), each with eight cores and a 20 MB L3 cache. The PMEP contains 256 GB of DRAM. It dedicates 128 GB of DRAM for the emulated NVM. We configured the NVM latency to be 4× that of DRAM and validated these settings using Intel’s memory latency checker. The PMEP also includes two additional storage devices: (1) HDD: Seagate Barracuda (3 TB, 7200 RPM, SATA 3.0), and (2) SSD: Intel DC S3700 (400 GB, SATA 2.6).
We modified Peloton to use the PMEP’s allocator and filesystem interfaces to store its logs, checkpoints, and table heap on NVM. When employing WAL, the DBMS maintains the log and the checkpoints on the filesystem, and uses fsync to ensure durability. When it adopts WBL, the DBMS uses the allocator for managing the durable table heap and indexes. Internally, it stores indexes in persistent B+trees. It relies on the allocator’s sync primitive to ensure database durability. All the transactions execute with the same snapshot isolation level and durability guarantees.
We begin with an analysis of the recovery protocols’ impact on the DBMS’s runtime performance. To obtain insights that are applicable for different storage technologies, we run the YCSB benchmark in Peloton while using either the WAL or WBL. For each configuration, we scale up the number of worker threads that the DBMS uses to process transactions. The clients issue requests in a closed loop. We execute the workload three times under each setting and report the average throughput.
Figure 4 shows the throughput of the DBMS while executing YCSB with varying number of worker threads. The most notable observation from this experiment is that while the DBMS’s throughput with the SSD-WAL configuration is higher than that with the SSD-WBL configuration, its performance with the NVM-WBL configuration is comparable to that obtained with the NVM-WAL configuration. This is because NVM supports fast random writes, unlike SSD.
We observe that the NVM-WBL configuration delivers 1.3× higher throughput than the NVM-WAL configuration because of the former’s lower logging overhead. That is, under WBL the DBMS does not construct as many log records as it does with WAL and therefore it writes less data to durable storage. The performance gap between the NVM-based and SSD-based configurations is prominent on this write-intensive workload. The NVM-WBL configuration delivers 12.1× higher throughput than the SSD-WBL configuration.
We next evaluate the recovery time of the DBMS using the different logging protocols and storage devices. We first execute a fixed number of transactions and then force a hard shutdown of the DBMS (SIGKILL). We then measure the amount of time for the system to restore the database to a consistent state. That is, a state where the effects of all committed transactions are durable and the effects of uncommitted transactions are removed. We note that the number of transactions that the DBMS processes after restart in WAL depends on the frequency of checkpointing. With WBL, the DBMS performs garbage collection to clean up the dirty effects of uncommitted transactions at the time of failure. This garbage collection step is done asynchronously and does not have a significant impact on the throughput of the DBMS.
The results in Figure 5 present the recovery measurements for the YCSB benchmark. The recovery times of the WAL-based configurations grow linearly in proportion to the number of transactions that the DBMS recovers. This is because the DBMS needs to replay the log to restore the effects of committed transactions. In contrast, with WBL, we observe that the recovery time is independent of the number of transactions executed. The system only reverses the effects of transactions that were active at the time of failure as the changes made by all the transactions committed after the last checkpoint are already persisted. The WBL-based configurations, therefore, have a short recovery.
We presented the write-behind logging protocol for emerging non-volatile storage technologies. We examined the impact of this redesign on the transactional throughput, availability, and storage footprint of the DBMS. Our evaluation of this recovery algorithm in Peloton showed that across different OLTP workloads it reduces the system’s recovery time by 100× and shrinks the storage footprint by 1.5× in comparison to the write-ahead logging protocol.