The Changing Landscape of Data Systems

anant bhardwaj MIT_CSAIL

By Anant Bhardwaj, MIT CSAIL*

Everyone continues to chase “Big Data” ─ from businesses to start-up companies to academic researchers. To understand better the changing landscape of data systems, I spent the last few months trying to learn about every data project I could find from research, start-ups and companies.

Data systems have always been one of the most important underlying infrastructure components behind any software application, but they have rarely been designed with the end-user experience in mind. While we have seen data systems change in recent years with some focus on usability, the current data systems research is still too caught up in the hammers ─ powerful/scalable infrastructure solutions for volume, velocity and variety problems ─ and not paying enough attention to the nails: the real problems that end-users face while performing their tasks.

Meanwhile, in the last decade, we witnessed a tremendous growth in social media, consumer devices and sensors. Organizations in every industry ─ healthcare/medicine, retail, urban planning, education, agriculture and more ─ now have the ability to collect data about almost everything that matters. Given that now organizations can collect data about everything that matters, there is an obvious appetite to find more value from the data as it proliferates. While we need data systems to manage the volume, velocity and variety, the true value of the data won’t be realized until people and organizations can use it to answer their questions.

The true value of the data won’t be realized until people and organizations can use it to answer their questions…We need an interface for data systems that will cater to all different types of users…This was the central point that influenced the design of DataHub. We wanted to make data accessible to everyone.

We need an interface for data systems that will cater to all different types of users ─ much like how current operating systems such as Mac OS X and MS Windows made it possible for anyone to use computers (because the OS abstracted all the complexities and provided an interface for user-facing applications). On one hand, such an interface for data systems would allow non-technical users such as product managers and financial analysts to explore, navigate and understand the data without requiring them to be familiar with the underlying complex data infrastructure, schemas and information models. On the other hand, it would let technical users such as expert data analysts and programmers work on increasingly sophisticated tasks with even more efficiency.

This was the central point that influenced the design of DataHub: we wanted to make data accessible to everyone.  While for non-technical users, the interface should be simple and non-intimidating to give immediate confidence that they can succeed, for expert users it should be powerful enough that they can create sophisticated, complete solutions.

A High-Level view of DataHub Architecture. (Courtesy of A. Bhardwaj, MIT CSAIL.)

Figure 1: A high-level view of DataHub architecture. (Courtesy of A. Bhardwaj, MIT CSAIL.)

Like GitHub, DataHub allows datasets to be forked and branched, enabling different collaborators to work on their own versions of a dataset and later merge with other versions. Users can, for example, add new records, delete records, apply transformations, add derived columns, delete redundant or useless columns, and so on ─ all in their own private version of the data, without having to create a complete copy or lose the correspondence with the original dataset.

The DataHub app ecosystem (see Figure 1) hosts apps for various data-processing activities such as ingestion, curation, querying, visualization, data science, and machine intelligence. The apps could be designed for either novices (“point-and-click” interface) or expert users (an interface for writing SQL and scripts). While we provide many pre-installed apps for common tasks, the app ecosystem is open for third-party developers and vendors so that users can choose the apps that fit their needs. A new DataHub app can be written and published to the DataHub App Center using our language-agnostic APIs (see Figure 2).

Third-party apps can provide services for datasets hosted on DataHub - for example, a recommendation service for (user, item, ratings). (Courtesy of A. Bhardwaj, MIT CSAIL.)

Figure 2:  Third-party apps can provide services for datasets hosted on DataHub – for example, a recommendation service for (user, item, ratings). (Courtesy of A. Bhardwaj, MIT CSAIL.)

The current DataHub platform provides the following:

  1. For non-technical users
    • ingest your data in DataHub from spreadsheets, files, web pages, etc.
    • seamlessly share your data with friends and colleagues
    • access to a suite of tools from the DataHub app store to explore, manage and process your data with a simple point-and-click interface
  2. For expert users
    • SQL for querying and JDBC/ODBC for connecting programmatically with other analytics and BI tools
    • a powerful language-agnostic notebook for data science
    • many choices of apps from the DataHub app store for data processing, including ingestion, curation, integration, discovery, query, analytics, visualization and machine learning
  3. For developers
    • a hosted backend for your mobile, web, and IoT apps
    • SDK for writing apps on top of DataHub in 20+ languages
    • in-built support for complex analytics

For more details, please see our VLDB 2015 demo paper. Check out the DataHub code here.

The MIT DataHub project is supported by grants from the National Science Foundation and the Intel Science & Technology Center for Big Data.

*Anant Bhardwaj is a Ph.D. student in the Computer Science & Artificial Intelligence Laboratory (CSAIL) at MIT, co-advised by David Karger and Samuel Madden. He received a M.S. in Computer Science from Stanford University and a B.E. in Computer Engineering from the University of Pune. At Stanford, he worked with Scott Klemmer and Jeff Heer. His primary interest these days is in developing systems and tools for data management. His projects draw ideas from various fields such as databases, distributed systems, algorithms, machine learning, and human-computer interaction.

 

 

This entry was posted in Big Data Applications, Big Data Architecture, Data Management, ISTC for Big Data Blog, Tools for Big Data and tagged , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


nine − 8 =