by Zhicheng Liu, Stanford University
Interactive visualization of large datasets is key in making big data technologies accessible to a wide range of data users. However, as datasets expand in size, they challenge traditional methods of interactive visual analysis, forcing data analysts and enthusiasts to spend more time on “data munging” and less time on analysis. Or to abandon certain analyses altogether.
At the Stanford Visualization Group, as part of the Intel Science and Technology Center for Big Data, we are developing imMens, a system that enables real-time interaction of billion+ element databases by using scalable visual summaries. The scalable visual representations are based on binned aggregation and support a variety of data types: ordinal, numeric, temporal and geographic. To achieve interactive brushing & linking between the visualizations, imMens precomputes multivariate data projections and stores these as data tiles. The browser-based front-end dynamically loads appropriate data tiles and uses WebGL to perform data processing and rendering on the GPU.
The first challenge we faced in designing imMens was how to make visualizations with a huge number of data points interpretable. Over-plotting is a typical problem even with thousands of data points. We considered various data reduction techniques. Sampling, for example, picks a subset of the data, but is still prone to visual cluttering. More importantly, sampling can miss interesting patterns and outliers. Another idea is binned aggregation: we define bins over each dimension, count the number of data points falling within each bin, and then visualize the density of data distribution using histograms or heatmaps. Binned aggregation can give a complete overview of the data without omitting local features such as outliers.
A geographic heatmap showing 4.5 million location-based checkins on Brightkite
The number of bins we can define is constrained by the number of pixels on the screen. At the limit, we can map one bin to one pixel. If the number of bins exceeds the screen real estate, zooming and panning are useful to navigate through a large visual space. In addition to zooming and panning, brushing & linking is a very useful interaction technique to explore the relationships between different data dimensions.
Four linked visualizations showing checkin distributions by location, month, day and hour with a selection from January to June in the month histogram. All the other visualizations show corresponding distributions.
Supporting these interaction techniques constitutes the second challenge in developing imMens. Consider brushing & linking: to compute the aggregated values in each bin across the five dimensions in the four linked visualizations, we can naively construct a five-dimensional data cube. The resulting data cube will contain billions of rows of data, too big to fit into memory and too slow to query.
At the crux of our solution to ensure interactive scalability is the data tile concept. Inspired by map tiles used in systems like Google Maps, imMens computes multiple three-dimensional and four-dimensional data cubes and decomposes these into smaller cubes called data tiles. The rationales are similar to map tiles: we would only load the data needed to render the visualizations. The important difference is that data tiles are not pre-rendered images; instead, they are multi-dimensional data projections used for both querying and rendering.
From data cube to data tiles
In doing so, imMens greatly reduces the size of the cube from 2.3 billion cells in the full five-dimensional case to 17.6 million cells in the thirteen data tiles used in the visualizations. To further optimize the performance, imMens packs these data tiles as pixel values in images, and uses the GPU to do both querying and rendering. In a comparative benchmark test with Profiler, which uses an in-memory data cube, imMens is able to sustain 50 frames per second brushing and linking for 1 billion data points across 25 visualizations.
imMens will be released on Github soon. Stay tuned!
Zhicheng Liu is a postdoctoral scholar at the Department of Computer Science of Stanford University, working with Professor Jeffrey Heer. He completed his PhD in the Human-Centered Computing program at Georgia Tech in spring 2012, advised by Professor John Stasko. His research involves developing tools to enable data enthusiasts to more effectively perform visual data analysis. His work has three research threads: (1) novel interfaces for data modeling; (2) user-centered visual analytic systems for domain experts; and (3) computational methods for scaling interactive visualizations to big data.