Almost all areas of science are moving to a more data-driven analysis pipeline, where large multidimensional datasets need to be explored and analyzed for possible insights. The process of exploring large datasets is inherently an interactive process, as the user will react to aspects of the data, which will in turn help determine future queries.
Unfortunately, many visualization interfaces were designed with the assumption that much, if not all, of the data to be visualized would reside in memory. As both the quantity and quality of the tools used to collect data have improved, datasets have continued to grow in size, and this assumption often no longer holds. However, interactivity necessitates low-latency access, and the latency required to fetch data from disk for each interaction with the interface is unacceptable.
To address the challenge of scaling visualization to Big Data, we have implemented a data visualization system called ScalaR that provides a web-based, map-style interface (think Google Maps) for viewing large data sets. We presented a paper about ScalaR at the First Workshop on Big Data Visualization held recently in Santa Clara, Calif. The rest of this post summarizes the new approaches and methods to scaling data visualization that we are taking in ScalaR.
ScalaR is a data visualization system that provides a web-based, map-style interface (think Google Maps) for viewing large data sets.
Why New Approaches Are Needed
There are two general approaches to scaling visualization techniques to Big Data. One is to take into account data size in the visualization frontend, an approach taken by our ISTC for Big Data collaborators in the Stanford Visualization Group for their imMens project. The goal of such an approach is to scale using visual summaries of the underlying data. While this works well for certain datasets and analysis environments, there exist cases where data cannot be meaningfully summarized, and deeper exploration is needed. In particular, we consider satellite imagery from the MODIS dataset, which is on the order of terabytes in size. While summarization techniques are certainly popular and widely used for MODIS data, the result is often still a larger-than-memory summarized dataset. Thus, a different approach is necessary.
A second approach is to attempt to hide the latency of the backend data store, whatever that data store may be. Hiding disk latency is by no means a unique problem, and many techniques have been studied over the years, foremost among them prefetching and caching of data that will be needed in the near future. To this end, we are developing a predictive middleware that will reside between the frontend visualization interface and the backend data store and will predict, prefetch and cache relevant data. However, deciding what to prefetch and cache is no simple task. Because of the multidimensional nature of scientific data, data is unlikely to be co-located on a two-dimensional disk. Thus, more advanced techniques for predicting which data to prefetch are needed.
One method for predicting the data that will be needed by a user is to model their interactions with the system. Luckily, the domain of possible user interactions is limited by the visualization interface, meaning that the visualization interface will have pre-defined query templates based on the interactions possible. These templates will be instantiated with parameters based on the subset of the data the user is currently visualizing. Thus, prediction can be broken down into two distinct parts: predicting the template and predicting the parameters. Because each template will potentially have its own set of possible parameters defined by the query semantics for that template, the parameter prediction depends on the template prediction. In this way, the model is hierarchical, with the template being predicted first, followed by the parameter prediction.
Another method is to find similarities in the underlying data. Suppose we can divide the dataset into a set of discrete, non-overlapping data tiles, and the user explores the data set by viewing subsets of these data tiles. You can think of this as creating a Google Maps-style interface over the dataset, where each map tile is replaced with a data tile from our dataset. Interactions with the visualizations then translate to a stream of requests for subsets of data tiles on the backend, as the user moves around and zooms through the data set.
Users are drawn to various statistical properties of a dataset when interpreting visualizations, such as outliers or trends. We can take advantage of this by computing a statistical signature for each data tile the user has recently viewed, and comparing these signatures to signatures for new tiles the user hasn’t seen yet. Our goal is to identify new tiles that share the same statistical properties, and suggest them as potential points of interest for the user. You can think of a tile’s statistical signature as a set of numbers summarizing the distribution of the data stored in the tile. For example, computing the mean and standard deviation of a data tile would be a very simple statistical signature. To identify similar tiles, we would look for tiles with the same mean and standard deviation.
ScalaR at Work
The ScalaR data visualization system provides a web-based, map-style interface for viewing large data sets stored in SciDB. ScalaR takes an SQL query as input, and returns a visualization of the query results. The user can then move and zoom through the dataset by interacting with the visualization interface.
Figures 1 and 2 are examples of ScalaR’s interface, modified for a user study we will be conducting with domain scientists using NASA MODIS satellite imagery data. More details on ScalaR are provided in our BigDataVis ‘13 paper. We plan to extend ScalaR’s architecture to incorporate the above data prefetching models.
A diagram of the new architecture is provided in Figure 3.