Organizations face managing and deriving value from an ever-growing amount of data. Beyond its size, this data is often varied in both structure (e.g., relational data, linked data, numerical data and streaming data) and target analytic tasks (e.g., SQL, graph processing, machine learning, and complex event processing). These requirements leave traditional relational database engines no longer capable of fully serving an organization’s entire data processing needs.
This realization that ‘one size does not fit all’ has lead to an explosion in specialized data processing frameworks over the last decade. The resulting proliferation of database engines leaves organizations struggling to manage a wide variety of systems given that (i) identifying the right engine for a task is not always obvious, (ii) multiple types of analytics may be run on the same dataset, (iii) workload requirements can change and (iv) managing each system introduces operational complexity and overhead. These challenges call for a polystore that transparently integrates disjointed databases and data models to enable users to expressively query their data without worrying about the underlying engines.
A team of ISTC researchers is developing a new polystore, BigDAWG, to incorporate a wide variety of engines and data models, such as the relational model with PostgreSQL, an array model with SciDB, a key-value model with Accumulo, and a streaming model with S-Store. A promising alternative approach is to use just-in-time operators on raw data, such as NoDB and ViDa, to virtualize data sources.
To balance the trade-off between functionality and database-transparency, a BigDAWG query specifies Islands of Information that provide a set of operations and data model. The system manages data transformation between models and subsequently engine placement. Decisions about data placement consider required functionality, workload behavior and expected performance. Previous ISTC blog posts have introduced BigDAWG and discussed polystore query modeling and optimization. Other researchers have recently proposed another class of polystores that incorporates MapReduce and relational database systems. These systems – such as MISO, HadoopDB, and PolyBase – use a relational engine as an accelerator for certain computations.
While polystores bear a resemblance to federated database systems, two key differences separate these approaches. First, polystores explicitly consider multiple data models whereas most federated systems focused on the relational data model. Second and more importantly, polystores control the placement of data, both for long and short term. A short-term data placement is encountered when a polystore moves data between engines either for performance reasons (e.g., accelerate an operation) or due to functionality requirements (e.g., the host engine lacks a required operator). Long-term data placement occurs when the system permanently moves or copies a dataset between engines for load-balancing purposes (e.g., skewed server utilization) or in response to a change in the workload (e.g., some data typically analyzed with relational operators now frequently use graph operators). For these reasons, polystores require a migration component that can efficiently transform and migrate portions of a dataset (a data object) between engines on demand.
“Polystores require a migration component that can efficiently transform and migrate portions of a dataset (a data object) between engines on demand.”
A simple solution to enable migration would be to utilize CSV (comma separated values) as many databases support bulk loading as well as exporting of data in a CSV format. However, this process is compute-bound with parsing (finding new line and field delimiters) and de-serialization (transformation from text to binary representation) constituting the biggest consumers of CPU cycles. Instead, we target migrating data using a binary format, which when concise offers improved performance over a CSV-based migration. As each database has a custom binary format, this requires a canonical intermediate binary transformation or the ability for a migration component to understand the binary format of its migration counterpoint.
To illustrate the performance overheads of a CSV-based migration, we measured the migration time of various database sizes, going from PostgreSQL to SciDB. As shown in Figure 1, shared intermediate binary representation is approximately three times faster than a CSV-based migration, and PostgreSQL directly generating SciDB’s binary format is about four times faster than using CSV.
Therefore, to address this critical need of efficient polystore migration, ISTC researchers at the University of Chicago are developing an open-source framework: Portage, whose name is based on the practice of carrying a boat across land when moving between bodies of water.
Portage’s migration framework is broken into four distinct logical components, as shown in Figure 2:
- The extractor receives migration instructions from the polystore controller and initializes the migration. It has the ability to check the local metadata and identify the appropriate tables being migrated. A process then either reads the data directly from the storage medium or relies on a query execution engine to filter out necessary records if needed.
- The output is written to a local shared buffer, where a collection of transformers prepare the data. Each transformer first performs any necessary logical transformation, such as mapping a relational tuple to a graph node and edges. As the logical transformation depends on the polystores’ desired operations, Portage requires that such transformations are provided as a UDF. If the transformation is not a 1-to-1 tuple mapping, Portage relies on a query engine to materialize aggregated tuples to transform to a new single tuple. Next, the transformer converts the data to the destination binary format and optionally applies compression.
- The converted chunks are passed to the migrator, which. depending on the available resources and location of destination database, will select the physical migration form and data transfer rate. Examples of these forms include use of shared memory, local pipes, local files, SCP, or RDMA.
- Lastly, the importer receives, decompresses, and loads this data into the destination engine.
In order to enable fast transformation and migration, Portage will need to address several open challenges.
First, various and shifting resource bottlenecks can emerge. CPU cycles are needed for transformation, parsing, validation and serialization. Disk I/O is needed for extracting and importing the physical data. Network I/O is required for data transfer. A key requirement of Portage is to adapt its process based on constrained and available resources. For example, if network I/O is limited due to an over-utilized network and CPU cycles are abundant, then Portage should aggressively apply compression to reduce the data transferring over the network. On the other hand, when a given process is CPU-bound, for example when re-dimensioning from a flat array to a target array in SciDB (presented in the following graph), and the network bandwidth is underutilized (below the maximum of about 125 MB/sec) then data compression should not be applied.
Second, given the increase in the number of cores available on machines and newer storage devices, Portage will need to control the amount of parallelization to apply for the migration. Too little and the process will be slow, and too much will result in resource thrashing. Similarly, the data transfer rate may need to be throttled to allow for slower loading or an over-loaded transport medium. To address these issues, we are exploring the use of adaptive controllers to manage the migration process. Lastly, we are investigating the use of recent hardware and software trends, such as SIMD for data conversion, a unified automata processor (UAP) for fast data compression, and JIT compilation for efficient data transformations code.
While Portage is focusing on building efficient custom end-points for a set of specific database engines, another team of ISTC researchers at the University of Washington is concurrently building an exciting orthogonal approach. PipeGen is a project that is exploring the use of a general-synthesized approach to automatically build transformers based on existing import/exporters (CSV, JSON, other). You can expect to read more about PipeGen here in the near future. PipeGen and Portage are complementary approaches and we plan to incorporate ideas after initial prototypes are built.
In closing, data transformation and migration is a critical requirement for polystores. We are currently building Portage to focus on high-performance and just-in-time transformation and migration of data between engines, by adaptively controlling the level of compression and parallelism while exploiting recent hardware and software trends.