Enabling Interactive Data Exploration over Big Data

By Ugur Cetintemel, Brown University

Interactive Data Exploration (IDE) has been a recent focus area of the Brown Data Management Group. This is an emerging form of data-intensive analytics in which users ask questions over a data set to make sense of the data, identify interesting patterns and relationships, and bring aspects of interest into focus for further analysis.

Users typically start with some high-level unknowns and hypotheses in mind and iteratively re-formulate or revise them as they learn from the data. As a result, Interactive Data Exploration is fundamentally a multi-step, non-linear process with underspecified end-goals. It’s both labor-intensive  and inefficient, as it requires users to possibly ask a large of number queries as they try to navigate through potentially large, amorphous data sets.

IDE is quickly becoming a key ingredient of discovery-oriented applications in diverse areas, including scientific computing, financial analysis, evidence-based medicine, and genomics. At the same time, traditional database systems and tools do not offer adequate support for IDE. They are not designed for human-in-the-loop (interactive) usage and are built based on the assumption that the users already have a good understanding of the structure and contents of the database, as well as the questions to be asked.

Interactive Data Exploration is quickly becoming a key ingredient of discovery-oriented applications in diverse areas, including scientific computing, financial analysis, evidence-based medicine, and genomics. 

We have recently initiated a number of complementary research projects to address these limitations and make progress toward enabling IDE over large data sets:

Query Steering: In this project, we’re developing techniques to build and leverage user profiles (i.e., models of user interests, goals, and database interaction patterns) to improve query performance and offer customized data navigation and visualization support to users. Traditional database systems are agnostic about what ‘s “above” them, offering generic, one-size-fits-all (aka one-size-fits-none) services for all users and applications. By widening the narrow application-db interface with usage information, we allow the database system to customize its operation at a fine-grained level (per user/application).  A long-term goal of this project is to build a “data navigation system” that would assist non-expert users as a tour guide, by effectively and efficiently guiding them through the data space while highlighting interesting data features or trajectories.

Initially, we’ve developed profile-driven prefetching and caching techniques to improve the interactivity of big data visualization. This project, which we are pursuing in collaboration with the MIT Database Group, will be the topic of next week’s blog post by Justin DeBrabant of Brown and Leilani Battle of MIT CSAIL.

TupleWare:  We are also in the process of building a new data processing system for complex interactive analytics and visualization. Today users are forced to make a choice between expressiveness (e.g., R, MATLAB, Python) and data scalability (e.g., Hadoop). With TupleWare, we aim to eliminate this artificially enforced choice and make it easy for users to incorporate big data processing primitives within their preferred computing environment by using state-of-the-art compilation and language-binding techniques. We’ll say more about this new project in a future post.

Semantic Windows: Finally, we are developing “exploration-oriented” query abstractions and associated constructs, which are lacking in existing query languages.

Suppose that we are studying the SDSS and want to identify all “3º by 2º” celestial regions in which the average brightness of all “stars” is greater than “0.8.” Or suppose we are studying NYSE trading data and are interested in finding a time period of “1 to 5” months during which the average stock price of all tech stocks is greater than “$40” per share. Such queries are difficult to express and even harder to efficiently execute with standard DBMSs. For example, existing SQL constructs (e.g., GROUP BY and OVER) do not allow users to directly pose these queries, typically requiring the use of Common Table Expressions (CTEs) that are inconvenient to use and difficult to optimize. Furthermore, existing query execution models do not offer online results, forcing users to wait until the query is done (which can take a looong time if the data set is large).

To better support such “structured search” queries, we proposed an approach called “Semantic Windows” (SWs), by which users can search for multidimensional “windows” (rectangular regions) of interest by specifying conditions on the structure (i.e., shape) and the contents of target windows. To quickly identify windows of interest in a large data set, we use a sampling-guided, data-driven search strategy that quickly steers the search towards promising regions of the underlying data space. This search-based query execution model is also a natural fit for interactive query execution, as results can be presented to the user as soon as they are identified.

As a proof-of-concept, we implemented the Semantic Web model as a layer on top of PostgreSQL and SciDB, and showed that we can identify result SWs much faster than is possible by standard query execution and optimization techniques. A paper describing the Semantic Windows approach is currently under submission.

 

This entry was posted in Analytics, Big Data Architecture, Databases and Analytics, DBMS, ISTC for Big Data Blog, Query Engines, Tools for Big Data, Visualizing Big Data and tagged , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


1 + = ten