Non-volatile memory (NVM) has been a game-changing memory technology. In contrast to traditional block-based durable storage devices, it provides low latency comparable to DRAM and byte-addressability. Although NVM is promising, it is still non-trivial for DBMSs to fully utilize its benefits. In this article, we present NVMRocks, a variant of the open-source RocksDB key-value database system that we enhanced to be NVM-aware.
RocksDB’s basic structure is a log-structured merge (LSM) tree, in which on-disk data is immutable and append-only, thus making it efficient on SSD. With NVM, however, performance no longer benefits from this feature, but LSM tree still has some promising features when storing in NVM. Generally, an LSM tree has less space amplification than a B-tree, which is valuable to NVM systems. In addition, the level-based structure of LSM tree is natural for heterogeneous storage, a popular trend of NVM systems. Therefore, it is still worthy to optimize LSM tree to leverage the benefits from NVM.
A DBMS can either treat NVM as persistent storage and access it through a file system, or map it into virtual memory address and manipulate it directly. In the first approach, one can run a legacy DBMS on NVM by storing data and log files on a persistent memory file system. Some of the DBMS’s components, however, are unnecessary with NVM, such as optimizations to reduce the overhead of erase operations on SSD. In the second approach, memory mapping enables more fine-grained control and efficient usage of NVM. But this means we must redesign certain in-memory components. For example, if data resides in persistent memory, the DBMS’s logging protocol to recover data in DRAM after a failure is unnecessary.
Figure 1 illustrates the high-level architecture of NVMRocks. The database can run on a heterogeneous storage environment containing DRAM, NVM, and SSD. User requests are buffered in the MemTable and then flushed into on-disk SSTable. All components in the DBMS are managed by allocators for memory and file systems for persistent storage. The allocator and file systems are specially designed to utilize the characteristics of NVM. Specifically, SSTables on NVM are stored on a file system optimized for NVM (PMFS) and use a special table format (PlainTable). MemTables are allocated by persistent allocator, dramatically reducing the cost of logging and recovery.
We also added a read cache to improve read performance and to handle skewed workload. Since NVM is faster than SSDs, NVM can also serve as a cache. In NVMRocks, recently accessed data is retained in a multi-tiered cache spanning different storage media. The NVM cache layer, also managed by persistent allocator, resembles the DRAM read cache to fully utilize the byte-addressability of NVM.
For NVMRocks, we design a lock-free persistent allocator for MemTable. In the original RocksDB, a MemTable’s access pattern is relatively simple. It does not reclaim an allocated block until the MemTable is full, at which point all of its allocated memory is freed. The design of a persistent allocator for the MemTable can be simplified because it does not need to maintain a free list for reclaimed block. This allocator incurs fewer L3 cache misses and leads to less performance overhead.
Our NVM allocator maps a file in the persistent memory file system into virtual memory address, and uses a pointer to represent current offset of allocated blocks. This pointer is stored in the header of the allocator file, so that the allocator can start from the previously allocated position after a restart. Concurrent modification of this pointer can be protected by lock-free compare-and-swap primitives or by a fast in-lined spin lock.
After closing and remapping a persistent allocator file, the base pointer of mmap-ed region will change. This means that pointers to memory are meaningless after reopening the file. Therefore, when the DBMS writes a pointer value onto NVM, it should transform it into a location relative to the base pointer of its corresponding allocator. And when the pointer is accessed, it should be converted back into the absolute value.
Since NVM itself is a persistent storage, data written to NVM is already durable. Therefore, there is no concern about data loss in MemTable after a database restart. When transaction is enabled, undo logs are still necessary to achieve atomicity. Note that undo logs are not always required because RocksDB can be used as the storage manager for full-featured DBMSs, like MySQL or MongoDB, which manage their own transactions in their upper levels.
NVMRocks’ Persistent MemTable dramatically accelerates database recovery by saving the time to rebuild the MemTable. Once the DBMS restores the MemTable from the persistent allocator by reading data from header area, the recovery process is complete. Furthermore, since the DBMS only needs the log record for undo operations, it is sufficient to store only the non-volatile pointer of the data, avoiding the need to store two copies in both MemTable and log files. This further reduces log size and improves runtime performance.
We use the persistent allocator described above to manage MemTable allocation. We retain the original pointer types and handle the absolute-relative pointer conversion within the scope of Node structure in the skip list. The next pointers of a node are stored in relative pointers and converted into absolute pointer when being accessed. External operations in the skip list are unaware of this conversion. To ensure data persistence into NVM before we return an insert function, we need to call the persistence primitives to write back data from CPU cache line to memory subsystem.
In terms of recovery, we maintain a database-wide metadata file on PMFS indicating the mapping from column families to their MemTables and allocators. Column families keep track of active MemTables and their assigned persistent allocators. When a MemTable is flushed into disk and destructed, the system also clears its column family mapping. This serves as a global entry point for recovery to retrieve active allocators before shutdown.
NVM-Aware Persistent Cache
RocksDB stores recently accessed data in a DRAM-resident read cache to improve read performance. But when more storage layers are introduced, all intermediate layers can also serve as a cache for the slower ones. To cache recently accessed data on SSD, although DRAM might provide lower latency, it is acceptable to use NVM as a supplemental cache layer for SSD besides DRAM, since it is still much faster to read data from NVM than SSD. In multi-tier cache, when the k-th layer evicts data, it does not immediately drop it, but instead inserts it into the k+1-th layer. By spanning read cache across multiple storage media, the database can utilize more space to cache its data.
RocksDB includes a prototype for multi-tiered read cache. In this tiered cache structure, each tier can either be a volatile tier using typical LRU eviction policy, or a persistent tier using file-level cache control scheme. Unfortunately, the multi-tiered cache module does not optimize for NVM, but treats NVM as a block device. Block cache is originally designed for a device without byte addressability. For a cache hit, the block cache will read the whole data block from disk. When the key size is small, a large portion of the read operations are unnecessary. Furthermore, block cache stores cached data in multiple files, and its eviction can only be executed in a file level, a very coarse-grained control for what to evict.
We implemented an NVM-aware cache tier for multi-tier read cache. Given that we already have a hash table + linked list implementation for volatile LRU cache, we can use a persistent allocator to convert the volatile LRU cache into its non-volatile equivalent. For performance, we do not need to put all data structure into NVM, but leave metadata in DRAM. Our NVM-aware cache tier does not require data recovery, as obsolete cache data can be immediately dropped. This also means the synchronization primitives can be relaxed.
Experiments and Evaluation
Our experiments are all performed on an NVM hardware emulator provided by Intel Labs. It contains a dual-socket Intel Xeon E5-4620 processor with 16 cores running at 2.6GHz. It uses a 128GB DRAM for NVM emulation and a 400GB Intel SSD DC S3700 Series on 2.56 Gb/s SATA. Experiments are conducted on Linux 18.104.22.168. For experiments on NVM, we run the database on PMFS; we run SSD experiments on EXT3 filesystem.
We first compare the database performance with the persistent MemTable. We test the database with the following five configurations:
- mt-nvm-1x and mt-nvm-4x are RocksDB instances using persistent MemTable running on the DRAM/NVM system in which NVM latency is set as 1x and 4x DRAM latency, respectively.
- mt-dram-1x and mt-dram-4x are RocksDB instances that place MemTable in DRAM and store log records in NVM. Similarly, we also tune NVM latency to observe its impact on database performance.
- Finally, we provide a baseline implementation using SSD as persistent storage for data and write-ahead logs.
When NVM latency is set as 1x DRAM latency, our optimized Persistent MemTable outperforms the DRAM version. In this setting, persistent MemTable does not have overhead in accessing NVM, so the logging factor dominates the performance. The extra step of logging in mt-dram series becomes relatively time-consuming. When NVM latency is set to 4x DRAM, the two MemTables perform similarly. However, we still benefit from fast recovery.
A benefit from the persistent MemTable is that it provides instantaneous recovery. In this experiment (see Figure 3), we measure the improvement in recovery time of RocksDB. We first execute a fixed number of operations to fill the database and close the database by sending a SIGKILL. We then restart the database and measure the recovery time using database statistics. In this experiment, the value size we used for insertion is 80 bytes.
As expected, the recovery time of a database using persistent MemTable (shown as mt-nvm) is not affected by the number of entries in the MemTable when the database quits. The original MemTable, however, might lose data in a failure, and thus need to read the write-ahead log at startup. This makes the database recovery time proportional to the size of MemTable. Within a range of keys in the database, persistent MemTable leads to a dramatically faster recovery speed, and that the improvement is bound by total MemTable size in database option.
Finally, we evaluate the optimized persistent cache. This experiment (see Figure 4) compares the optimized NVM-aware persistent cache (nvm-lru) with the original block-oriented cache (nvm-block). We reuse and improve the persistent cache benchmark in RocksDB.
The block implementation has higher latency because the block implementation also needs to read unnecessary block content. Moreover, it must go through a file system interface, which is proved to be slower than the allocator interface.
We start this experiment with limiting the total size of read cache to test eviction policy. In this experiment, we set the cache size to 4GB and each key-value pair is 4KB, so the cache can hold about 1 million (220) entries. We denote the number of entries in the cache as range(cache). We first prepopulate to fill the cache, then start to read keys from the cache within range, noted as range(key). If the key is not found, we issue a write for the key. This will trigger an eviction from the cache.
Figure 4 shows the cache hit rate of the two implementations. We can see that nvm-lru always has a higher cache hit rate because of its eviction policy. nvm-block evicts a whole file when a single block is picked to be evicted. In this case, more entries will be missed after an eviction.
The eviction policy also reflects in the latency. Figure 5 shows a latency histogram of two caches when read key range is 1.5x of cache size. nvm-lru has a lower read latency because the nvm-block must re-insert more entries due to its coarse-grained eviction.
In this project, we explored policies to optimize a database on NVM. We implemented and evaluated them on RocksDB, a key-value store designed for SSD. We found that redesigning in-memory components to store in NVM can avoid the logging overhead and potentially achieve best throughput. Our results also show that on-disk structure for traditional databases should be redesigned to be byte-addressable to maximize performance.
We look forward to exploring more optimizations for NVM. For example, we do not have a good data placement policy for heterogeneous storage architecture. A future project for NVM optimization on RocksDB could be a strategy to dynamically allocate storage for on-disk data according to their access pattern.