by Michael Maddox, MIT CSAIL, and Aaron J. Elmore, University of Chicago*
In Big Data’s wake has come demand for tools to curate, manage and analyze shared datasets collaboratively. For instance, consider researchers in a social media company concurrently working on user models with snapshots of a social graph, or scientists in a biology lab collaboratively cleaning and analyzing genomic data. These analysis tasks may need to operate concurrently to work on a consistent snapshot of the data. Moreover, the results of some operations may need to be visible to all users, while other results may not need to be merged back into the main dataset.
The methods many data scientists currently use to coordinate these kinds of operations are often ad hoc and rely on making full, redundant copies of an entire dataset in their individual workspaces. This not only wastes storage, but also woefully restricts collaborations: users cannot easily share patches or modifications to datasets, users cannot easily track which versions of a dataset were used for certain experiments, and there is no easy way to determine who is using particular versions of a dataset. In short, there is at present poor support for a collaborative lifecycle management tool for data scientists in industry or academia.
To remedy this problem, a team of ISTC researchers has introduced DataHub, a hosted, collaborative and on-line analytics platform that allows users to easily upload, modify, query, and share datasets. The DataHub project is exploring a wide range of challenges including data cleaning, visualizations, data integration, and data ingestion. In developing DataHub, the researchers encountered a need to manage versioned data. We explored using traditional version control systems (VCS) such as git, but found that they have performance constraints for both very large files and a large number of files─two common scenarios in building a hosted data management platform.
Therefore, at the core of DataHub we are building a dataset branching system, Decibel. Decibel is a relational storage engine designed to manage large collections of versioned datasets. Conceptually, Decibel merges a traditional relational storage engine and query processor with the version control models from software version control systems like Git or Mercurial. In particular, Decibel allows users to create logical working copies of a dataset (i.e., a branch) based on a present or prior snapshot of the dataset state (i.e., a commit). Similar to software version control systems, Decibel allows users to merge datasets of different lineages (e.g.. create a commit or branch with multiple parents). Crucially, Decibel also exposes an API to query and modify data from either a single branch or across multiple branches.
Decibel, a dataset branching system, is a relational storage engine designed to manage large collections of versioned datasets. Conceptually, Decibel merges a traditional relational storage engine and query processor with the version control models from software version control systems like Git or Mercurial.
Decibel is designed to support a general-purpose versioning and provenance query language called VQuel. As a declarative query language, VQuel extends its traditional equivalents (e.g., Quel, SQL) by allowing users to specify some version or versions of database tables in their operations. For instance, considering the queries in the table below, users may perform a join or “diff” (difference) operation across two or more versions of a dataset.
Beyond introducing the notion of a table version, VQuel also leverages version lineage to afford the user a richer set of queries they can perform. For instance, users can now ask questions like “Which versions contain a record with a particular primary key?” or “What are all the versions that contain over 100 records with a particular column value?” or “What was the last version of a table where some predicate is holds true?”
To support a broad range of versioned queries, choosing the right physical data layout is critical for achieving good performance and storage efficiency. Consider a naive physical design that stores each version in its entirety: if versions substantially overlap (which they generally will), such a scheme will be hugely wasteful of space. Moreover, data duplication could prove costly when performing cross-version operations, like diff, as it sacrifices the potential for shared computation for tuples that belong to multiple versions. When designing Decibel, we considered chiefly two different physical storage model designs: the “version-first” model and the “tuple-first” model. These two storage models were designed for two distinct subclasses of version queries.
The version-first storage model stores modifications made to each branch in a separate file (or file-segment), and pointers between files mark direct ancestors in the version lineage. A linear chain of such files thus comprises the state of a branch. Since modifications to a branch are co-located within individual files, it is easier to read the contents of a single branch or version by traversing a linear path of files. However, this structure makes it difficult to perform queries that compare versions, e.g., that ask which versions satisfy a certain property or contain a particular tuple.
Figure 1 (below) shows a sample data layout, where each file (or file segment) contains tuples for that branch and descendent branches. A tuple can be overwritten by a descendent branch, by having a tuple with the same key appear again. Therefore, conceptually, when a version needs to be scanned or read, we must trace through the ancestor branches in reverse order, and only emit tuples with keys that have been encountered. For example: to scan branch C, we would read all of File 03, then the part File 01 up to the branch point. When the tuple with key 3 is read in File 01, it would be ignored as it was overwritten in File 03. For additional details on performance optimizations or scanning complex paths with multiple parents, keep an eye out for our forthcoming publication on Decibel.
As an alternative, we also considered a tuple-first storage model where every tuple that has ever existed in any version of a table is stored in a single file. Each tuple in this file is annotated with its versions at any point in time. This approach is very efficient for queries that ask questions about the versions of some set of tuples (because such queries can be supported simply by looking at the version annotations of those tuples), but can be inefficient for queries that read a single version since data from many versions is interleaved within the contents of the single file. Note that the index file can be stored physically as a series of bitmaps, oriented either by row (tuple) or column (branch).
Figure 2 depicts how all tuples are stored in a heap file (on the left) while an index file (on the right) stores the version annotations for each tuple in the heap file. Note that the index file can be organized as a series of bitmaps, either oriented by tuple or branch.
To get the best of both worlds, we also propose a hybrid scheme that stores tuples across a set of distinct files as in the version-first scheme, but also annotates tuples in shared files like those in the tuple-first scheme. For the operations we consider, the hybrid scheme performs as well or better than both schemes above, and also affords a natural parallelism across most query types because we can easily identify which file-segments are required for a particular version.
Figure 3 (below) depicts how both approaches are united. Much like version-first, tuples can be overwritten by descendent branches or files. However, if the content of a file-segment is shared by multiple branches then a small segment index is created to show which tuples present in the corresponding file are active in descendent branches. Additionally, a global branch-segment index indicates which files have data (i.e., active tuples) for each branch. This index enables more efficient version scanning as we do not need to scan file-segments in any particular order or track emitted tuples. Additionally, files can be skipped if no relevant data for given branch is contained.
The graph below demonstrates how each of the storage models described above scale as we add more branches into the system. The graph on the left examines latencies for scanning a single version of the table. As described above, tuple-first underperforms in this situation because it stores all data across all versions in a single table. However, tuple-first outperforms strongly in the right graph, which additionally queries which versions a set of records belongs to. The hybrid scheme outperforms for both query types.
As part of the Decibel project, we have been creating a benchmark to evaluate versioned datasets, and will release a open-source module for other researchers to develop and explore interesting problems that arise from a versioned storage manager. Expect to hear more about Decibel, the versioning benchmark, and detailed results soon.
Decibel is an important step toward collaborative data science.
As the engine of the DataHub platform, Decibel is an important step toward collaborative data science. Adoption of systems like Decibel can improve the cost-effectiveness and reliability of data management solutions in data-sharing environments, and broaden the scope of analysis techniques available to analysts using a more collaborative platform. Our initial prototype of Decibel explores the fundamental performance and storage trade-offs among several basic physical representations, and serves as a starting point for future research into tools for collaborative data science.
“VQuel”: A. Chavan, S. Huang, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Towards a unified query language for provenance and versioning. In TaPP, 2015.
“DataHub”: A.P.Bhardwaj, S.Bhattacherjee, A.Chavan, A.Deshpande, A.J.Elmore, S. Madden, and A. G. Parameswaran. DataHub: collaborative data science & dataset version management at scale. In CIDR, 2015.
“DataHub, A Hosted Data Platform for Large-Scale Analytics, in Process of Being Deployed at MIT.” ISTC for Big Data Blog
*Michael Maddox is a computer science student at MIT interested in computer systems. Aaron J. Elmore is an Assistant Professor in the Department of Computer Science at the University of Chicago and a faculty member of the Intel Science and Technology Center for Big Data.