Modern data-driven analytics often deal with datasets that do not fit in the main memory of the processing system. In such a scenario, the limiting factor in performance is usually the transportation of the data from the backing store to where it can be processed, rather than the computing speed. This problem is aggravated when processing requires a lot of random access to the data. Solutions that have been proposed often attempt to reduce data transfer by, for example, using clever caching methods or moving the computation to the data. Other solutions simply use a fast but expensive backing store, such as a PCIe-attached flash storage device. We propose a new storage solution which is scalable, multi-access, low-latency, and high-bandwidth and which also supports hardware acceleration in its controllers.
In order to address the data transfer problem, we are implementing BlueDBM, a fast distributed storage system based on flash storage. BlueDBM provides a large capacity flash store by distributing storage across multiple nodes, each of which has fast access to the whole flash store. BlueDBM is a network of flash storage devices with an FPGA-based reconfigurable controller, which is plugged into the PCIE port of the host PC. The controllers are directly connected to each other via low-latency serial communications. The architecture of BlueDBM is shown in the diagram below.
A key aspect of the BlueDBM architecture is the controller-to-controller link, which is implemented using low-latency serial communication links in the reconfigurable flash controller. Allowing the flash controllers to communicate directly with each other eliminates the overhead of the general-purpose networking stack such as TCP/IP. This results in sub-microsecond latency per network hop, which is negligible compared to the access latency of the flash chip and software stack. This means that the entire network of flash storage performs as if it were a single, large, uniform-latency storage device. Algorithms that use large heaps as a data structure should benefit from this fast random-access capability. Raw page-access performance characteristic of BlueDBM, measured on our four-node prototype, is shown in the graph below.
The throughput of the prototype is low compared to modern systems because our custom flash boards are four years old. The 16-20 node system we are building now will have much better performance.
Another aspect of the BlueDBM system is the reconfigurable Flash controller. This permits us to implement hardware accelerators to offload computation into the Flash controller. Accelerators near the data store reduce data transfers as well as offload computation from the host PC.
The BlueDBM system provides an extremely low-latency backplane communications network directly between the flash devices. However, current file systems and database applications are optimized towards slow seek time in storage devices and high-latency network access. In order to exploit the characteristics of BlueDBM, we are (1) designing a new file system optimized for a fast random access storage system, and (2) developing database systems to take advantage of these features.
One of the changes to the DBMS is to offload operations to the reconfigurable controller when possible. For example, by offloading filtering operations to the controller, we can reduce the amount of data that has to be sent to the host PC to be processed by the DBMS software. The diagram below shows the two possible paths data can be taken through the DBMS stack.
We are currently looking for big data applications that are storage-latency bound. Please contact us if you have such applications.