The next generation of high-performance RDMA-capable networks requires a fundamental rethinking of the design of modern distributed database systems (DDBMS).
Current distributed databases are commonly designed under the assumption that the network is the dominant bottleneck. Consequently, these systems aim to avoid communication between machines at almost all cost, using techniques such as locality-aware partitioning schemes, semi-reductions for joins, and complicated preprocessing and load balancing steps. Even worse, this rule of thumb has created mantras like “distributed transactions do not scale,” which to this day affect the way applications are designed.
The next generation of networks presents an inflexion point on how we should design distributed systems. With the nascent of modern network technologies, the assumption that the network is the bottleneck no longer holds. Even today, with InfiniBand FDR 4x, the bandwidth available to transfer data across the network is in the same ballpark as the bandwidth of one memory channel.
DDR3 memory bandwidth currently ranges from 6.25 GB/s (DDR3-800) to 16.6 GB/s (DDR3-2133) per channel, whereas InfiniBand has a specified bandwidth of 1.7 GB/s to 37.5GB/s (EDR 12x) per NIC port. However, modern systems typically support 4 memory channels per socket. For example, a machine with DDR3-1600 memory has 12.8GB/s per channel, with a total aggregate memory bandwidth of 51.2GB/s. Therefore we need, for example, four dual-port FDR 4x NICs to provide roughly the same bandwidth. Perhaps surprising: the CPU-memory bandwidth is half-duplex, while InfiniBand and PCIe are full-duplex, such that only 2 NICs per socket could saturate the memory bandwidth of a read/write workload. In addition, Mellanox recently announced new InfiniBand EDR products that achieve 25GB/s on dual-port cards, enough to achieve the same bandwidth even with a single NIC per socket (again assuming simultaneous reads and writes).
To test if the specifications hold true in practice, we recently performed an experiment. We equipped our cluster consisting of dual sockets Xeon E5v2 CPUs and DDR3-1600 main memory with 4 NICs per machine (2 per socket as outlined above). The results in Figure 1 show the theoretical (left) and measured (right) total memory and network throughput and clearly demonstrate that the network bandwidth already available today can be on par with the main memory bandwidth.
Another important factor is that with major advances in remote-direct-memory-access (RDMA), the network latency also improves quickly. Our recent experiments with InfiniBand FDR 4x showed that the system requires ≈1μs to transfer 1KB of data using RDMA, compared to ≈0.08μs for the CPU to read the same amount of data from memory. With only 256KB, there is virtually no difference between the access time since the bandwidth starts to dominate the transfer time.
Yet, it is wrong to assume that the fast network changes the cluster to a NUMA architecture because: (1) the RDMA-based memory access patterns are very different from a local memory access in a NUMA architecture; (2) the latency between machines is still higher to access a single (random) byte than with today’s NUMA systems; and (3) hardware-embedded coherence mechanisms ensure data consistency in a NUMA architecture, which is not supported with RDMA. Clusters with RDMA-capable networks are most similar to a hybrid shared-memory and message-passing system: it is neither a shared-memory system (several address spaces exist) nor a pure message-passing system (data can be directly accessed via RDMA).
Even worse. our recent experiments also showed that, if existing clusters are upgraded with InfiniBand from Ethernet-based networks using techniques like IP over InfiniBand (IPoIB) without redesigning the software, the performance of the system can actually decrease. We’ll discuss this later in this blog post.
Consequently, we believe there is a need to critically rethink the entire distributed DBMS design – from the architecture to the guts – to take full advantage of the next generation of network technology.
“We believe there is a need to critically rethink the entire distributed DBMS design – from the architecture to the guts – to take full advantage of the next generation of network technology.”
In a recent VLDB paper (a pre-print version can be found here), we therefore proposed the Network-Attached-Memory (NAM) architecture together with a new distributed transaction protocol to demonstrate what is possible to achieve with the current network technology for database workloads.
The NAM Architecture
The main problem with using RDMA in distributed systems is the complexity that comes with it.
First and foremost, the complexity of setting up the sharing mechanism through low-level primitives like queue pairs is pretty high, and there is no easy way to do even the simplest task (such as memory-management/garbage collection) in this environment. Second, accessing main memory without any cache-coherence protocol creates a whole bunch of new problems. Third, standard ways of achieving fault-tolerance are not directly applicable over RDMA (e.g., there is no way to use RDMA and at the same time guarantee that something is flushed to disk).
To overcome these problems, we propose a novel Network-Attached-Memory (NAM) architecture, which logically decouples compute and storage. In this architecture, the NAM servers provide a shared distributed memory pool abstraction that can be accessed from any compute node. However, the storage nodes are not aware of any database-specific operations (e.g., joins or consistency protocols). These are implemented by the compute nodes.
In contrast to alternative architectures, like distributed shared-memory or shared-nothing, this new architecture has several advantages: One, this logical separation helps to control the complexity of RDMA operations, which can be overwhelming to handle even for experienced programmers. At the same time, unlike in a shared-memory system, the logical separation helps to make the system aware of the different types of main memory, which is crucial for good performance (there is still a factor of 8x in latency for small requests). Note that it is also still possible to physically co-locate storage nodes and compute nodes on the same machine to further improve performance, but in contrast to a shared-memory architecture, the system gains more control over what data is copied and how copies are synchronized. Two, the storage nodes can take care of tasks such as garbage collection, data-reorganization, or metadata management to find the appropriate remote-memory address of a data page. Three, the NAM architecture can efficiently handle data imbalance since any node can access any remote partition without the need to re-distribute the data before. Moreover, storage and compute nodes can be scaled independently. Our paper contains more details on the NAM architecture and how it relates to other systems.
On the down side, the architecture also requires rethinking essentially any part of a distributed database system. For example, a transaction processing workload can only take full advantage of the NAM architecture if it also leverages the RDMA network primitives. In order to show what is possible, we also designed a transactional system based on the NAM architecture with snapshot Isolation (SI) consistency guarantees (as described further below).
OLTP for NAM
The essential differences between a traditional distributed snapshot isolation transaction protocol for a shared-nothing architecture and the protocol for the NAM architecture is shown in Figure 3(a) and (b).
In the traditional architecture, a transaction is first sent by the client (e.g., an application server) to a transaction manager (TM), which then initiates a 2-phase-commit protocol using messages. In contrast, in the NAM architecture the client modifies directly the remote memory addresses using special primitives. Furthermore, clients have the choice of modifying the main memory directly or sending messages to the NAM-servers, which then execute the request on their behalf. Which method is better highly depends on the type of the request. There are many details on how the remote memory has to be laid out and how requests are synchronized to guarantee good performance (again, we reference our paper). Though we found that, if these details are addressed, the architecture together with our new protocol can achieve orders-of-magnitude higher throughput with linear scalability if the workload itself does not contain a bottleneck (e.g., a workload where every transaction modifies the same record still wouldn’t be scalable).
Distributed Transactions Do Scale
Figure 4 shows the distributed transaction throughput results of our new snapshot isolation protocol for RDMA and the NAM architecture (RSI – RDMA), as well as the transaction throughput of a more traditional distributed database architecture with a more classic message-based snapshot isolation protocol over the same network using IP over IB (Trad-SI – IPoIB) and a 1Gbps Ethernet network (Trad-SI – IPoEth). The benchmark was executed on an 8 node cluster with dual socket Xeon E5v2 CPUs per machine, 2TB of DDR3-1600 memory, and ONE dual-port InfiniBand FDR 4x NIC. As workload we used a transaction similar to the buy transaction of the TPC-W benchmark. To increase the readability, we show the results on a linear-scale (left) and on a log-scale (right).
As Figure 4 shows, the redesigned architecture achieves a stunning 1.79 million distributed transactions with 70 clients and shows a linear scaling behavior with an increasing number of clients, whereas the more traditional approaches achieve not more than 32 thousand distributed transactions. Even more surprising, as the log-scale graph (b) shows, the faster network without a redesign of the system decreases the transaction throughput from 32 thousand distributed transactions to 24 thousand transactions. The reason: distributed transactions involve many small messages and the Ethernet-based TCP/IP stack is just better optimized for them.
For more details on the experimental setup and results, please see our paper: http://arxiv.org/abs/1504.01048
“The redesigned architecture [NAM] achieves a stunning 1.79 million distributed transactions with 70 clients and shows a linear scaling behavior with an increasing number of clients, whereas the more traditional approaches achieve not more than 32,000 distributed transactions. Even more surprising, the faster network without a redesign of the system decreases the transaction throughput from 32,000 distributed transactions to 24,000 transactions.”
In summary, our early results show that the next generation of networks present an inflexion point in how we should design distributed systems. The rule of thumb – that avoiding the network always helps or, more specifically, that distributed transactions do not scale – no longer holds true. Instead, clusters become far more balanced than they used to be and bottlenecks quickly shift from one issue (e.g., the network bandwidth) to another ( e.g., CPU efficiency). Furthermore, as our results also show, just upgrading an existing distributed system the easy way to InfiniBand using the IPoIB layer can actually degrade the performance. As a result we have to fundamentally rethink the way we build and optimize distributed systems and design new abstractions (such as the proposed NAM abstraction) and techniques to take full advantage of the new capabilities the network provides.