Simplifying and Scaling Data Discovery

By Raul Castro Fernandez, MIT CSAIL

People who need access to data for their jobs are spending more and more time searching for data of interest to the task at hand. This is particularly true for data-driven companies, where the heterogeneity of data sources and storage systems is broader than ever before and where volumes of data are ever-increasing. We are building a data discovery system to solve this challenge, increasing the productivity and quality of data-based jobs. The system has an interface similar to a search engine, which allows anyone—regardless of technical skill level—to find data of interest. The interface accepts keywords and special tokens to assist users in discovering relevant data.

For example, if an analyst wants to answer the question “I want to count the number of employees per department per gender”, a query similar to “must_match_schema(“employee”, “department”, “gender”)” will return those tables or combination of tables that may contain the desired output. If the user wants access to all “ProductIDs” because there is a new rule for the format, he or she can use a query similar to “content_similar_to(<productID>)” to return any data source that contains product ids, therefore avoiding a manual search over possibly millions of data sources.

Before providing more details on how we are building this system, let me illustrate why with a common business scenario.

Unexpected Challenges

During the first meeting with your new boss—you are a recent hire working for a big company—you receive a number of questions the company is interested in answering. These questions will help understand how to optimize some important business processes. The questions look great on paper, but as you prepare to answer the first one—the company is interested in finding whether there is a correlation between an internal variable and sales—you realize you don’t know what data to use. You have an intuition, of course, but you don’t know how to navigate the thousands of different data sources (e.g., tables and files available across company departments). Your only option is to ask someone, which takes you through a seemingly infinite chain of social interactions with multiple employees. After a long process, you finally find some data that seems interesting.  But how can you be sure that this is all the data relevant to answering the question? What if you are missing important evidence? What you have in front of you is called…

A Data Discovery Problem

Modern organizations keep data in multiple relational databases, data lakes, files and other kinds of repositories, with each department or business unit responsible for parts of it. In this scenario, it becomes difficult to understand the data under the domain of your department, and much more difficult to know the data maintained by different groups and departments. Given the heterogeneity of data sources and ever-increasing volumes of data, finding relevant data is a tedious, time-consuming process. Our capacity to make sense of big data diminishes if we don’t have systems in place to help us navigate through this data. Discovering relevant data to answer questions is fast becoming the most time-consuming task for analysts, reducing their productivity.

Is This a New Problem?

Data discovery isn’t a new problem; it’s related to other work in the database and data mining communities. For decades, researchers have been trying to solve the problem of data integration, which aims to find a global mapping of all data sources in an organization so that it becomes easier to find relevant data, among other things. Unfortunately, data integration is hard when an organization consists of many different departments, each one requiring permissions to access external sources relevant to local tasks. To maximize the chances of success for any data integration initiative, it’s necessary to enforce policies for creating new datasets. This, however, requires additional effort by employees, plus systems in place to check that all policies apply correctly. This is intrusive and may hinder employees’ agility.

For this reason, we want a solution that works out of the box, without modifying or interfering with employees’ daily routines. Instead of following the top-down approach of data integration solutions, we propose a bottom-up approach: our solution learns everything it can about the existing data by mining all existing relationships and representing them in a concise data structure that can be queried by different users to discover relevant data.

Our solution learns everything it can about the existing data by mining all existing relationships and representing them in a concise data structure that can be queried by different users to discover relevant data.

A Fresh Approach

Our approach for a data discovery system is based on the observation that: “X is relevant to Y implies there exists some relation between X and Y.” Hence, we envision a system that mines all connections we find in the data (similarity in their content, schema name, overlap, etc.) and organizes them in a way that can be efficiently queried and ranked for user consumption. We believe that by defining an API and query engine on top of the knowledge representation, we allow users more freedom to express the kind of questions they have, as opposed to building bespoke systems that only help on given sets of tasks.

With that goal in mind, our data discovery system consists of three cooperative components (see Figure 1). The first component (bottom) is a high-performance profiler and data summarizer. This component transforms data from multiple data sources into concise representations, or summaries. It operates at the field level, i.e., attributes of tables or columns of semi-structured data. The second component (middle) is in charge of representing all the knowledge, i.e., the summaries extracted by the profiler. We use a multigraph to represent the acquired knowledge, with nodes indicating different fields and edges: meaning different relationships that are mined out of the summaries, e.g., from content similarity to schema similarity or primary-key/foreign-key relationships. Finally, the third component (top) is a query engine that combines techniques from information retrieval with graph traversal algorithms to find significant relationships and sources in the multigraph. This allows users to see the relevant subsets of data they are interested in.

Figure 1: A bottom-up automated system for simplifying data discovery. (Source: Raul Castro Fernandez, MIT CSAIL).

Figure 1: A bottom-up automated system for simplifying data discovery. (Source: Raul Castro Fernandez, MIT CSAIL).

Users of this system will be able to write simple keyword queries (similar to those in a search engine) and access a list of data sources that match the keywords. They can also use other similarity functions, such as “content_similar_to” that will return data sources with content, i.e., values, similar to the ones provided, or “schema_similar_to” that will exploit information about the data source schema when available to return the best matches. We have defined a number of these data discovery primitives that we are using to build more complex functionality, such as “must_match_schema”, used to find the best combination of tables that will match a schema similar to the one provided. This is powerful: it means that users do not need to know the exact schema of all the sources. Instead, they can learn the relevant schema by interacting with the system through simple queries. We also have a function “add_column” that will enrich the provided schema with the desired column from other data sources in the organization, provided they exist. This is powerful for figuring out new variables or quickly creating data repositories with features—necessary for many machine learning tasks, omnipresent today in organizational processes.

We are working hard to solve the many challenges of building such a discovery system. To tame the scale of the problem—millions of data sources and complex algorithms for finding relationship—we are introducing a distributed solution to share the work among many processors. We are choosing carefully algorithms that permit us to scale pairwise operations, e.g., which data sources are similar to this one of interest? To speed up query execution, we are thinking of new indices on the multigraph. We are deploying our prototype with real data from organizations and open repositories such as the MIT data warehouse, a big pharma company and large open-government-data repositories. We are incorporating user feedback and envisioning the best API to permit discovery at scale.


We believe that by mining all relationships between often-disconnected data sources we can learn about the data in an organization and offer new insights. This not only would be useful to data scientists but also would shed new light on some traditional problems of the database community, such as data integration and data quality. The path is long and full of challenges, but we have a first prototype of the tool and the results are encouraging. We’ll keep you posted!


Editor’s Note: A paper on this work will be presented at ExploreDB 2016, the 3rd International Workshop on Exploratory Search in Databases and the Web, July 1, 2016, San Francisco, Calif. (co-located with SIGMOD/PODS)

  • “Towards Large-Scale Data Discovery.” Raul Castro Fernandez, Ziawasch Abedjan, Samuel Madden and Michael Stonebraker



This entry was posted in Data Management, ISTC for Big Data Blog, Query Engines and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

seven + = 9