In many discussions with scientists across a variety of specialties, we have found that interactive visualizations are important tools for helping people make sense of massive amounts of data. In particular, interactive visualizations are critical in the early stages of data analysis, when a scientist is browsing a new, unfamiliar dataset. In a research project, we studied how scientists explore dense multidimensional arrays, such as satellite imagery, so we could learn about technical barriers in their way.
We have observed that scientists explore dense array data in a particular way. First, they browse the data using a coarse-grained aggregated view (i.e., a low-resolution view), searching for interesting regions to analyze in more detail. Once they find a region of interest (or ROI), they “zoom in” by retrieving a fine-grained view (i.e., high-resolution view) of this smaller region from the dataset. Using detail-on-demand interfaces in their exploration tools, scientists can thus apply panning and zooming interactions to explore large arrays.
However, most interactive exploration tools are unable to scale up to massive datasets. Therefore, one major goal of this project, and in my thesis work, is to make visual exploration of large arrays interactive, where the user (e.g., a scientist) receives visual feedback from the system within acceptable response time guarantees (e.g., within 500ms or less). However, a critical challenge in this project is that database management systems are not designed for retrieving results at interactive speeds, making them too slow to provide the fast preliminary results needed by a scalable interactive exploration interface.
Most interactive exploration tools are unable to scale up to massive datasets….A critical challenge in this project is that database management systems are not designed for retrieving results at interactive speeds, making them too slow to provide the fast preliminary results needed by a scalable interactive exploration interface
To push beyond the limitations of current DBMSs and support interactivity, we developed the ForeCache visual exploration system (see Figure 1). ForeCache uses a client-server architecture: The user interacts with a visualization interface running on the client machine (i.e., the user’s laptop), and the client retrieves the corresponding data by issuing requests to a DBMS running on a remote server. For its extensive support for scientific analysis operations, we use the array-based DBMS SciDB as our back-end. To further boost back-end performance, ForeCache includes a server-side middleware layer inserted in front of the DBMS, which pre-fetches data into a main memory cache in anticipation of the user’s future interactions. The middleware layer comprises two components: a) the prediction engine, which identifies what data to pre-fetch and b) the tile cache manager, which manages data retrieval from the DBMS and storage in the main memory cache.
To compute the aggregate views required to support zooming (i.e., zoom levels), we apply windowed aggregation queries to the underlying data. For example, suppose we have a 2D SciDB array A with dimensions i and j. To produce a coarse view of array A with one quarter of the original resolution, we can tell SciDB to aggregate every two array cells along dimension i and every two array cells along dimension j. By doing this, we can control the final resolution of each zoom level by adjusting the size of the aggregation window of the aggregation query (e.g., changing from a 2×2 window to a 4×4 window to aggregate array A). To break down zoom levels into easily manageable units, we partition zoom levels into data tiles, or fixed-size sub-arrays. In this context, data tiles are the general-purpose equivalent of Google Maps tiles. Any interaction in the visualization interface thus can be mapped to specific data tiles, making the retrieval and rendering of data more efficient.
Leveraging this tile-based data model, ForeCache boosts performance even further by making predictions about which tiles will be requested by the client in the future, and caching these tiles ahead of time on the server. To accurately model how users behave as they explore dense arrays, we developed a two-level prediction engine, where at the top level we identify the user’s current goals, and at the bottom level we predict what behavioral patterns the user may apply to achieve those goals. ForeCache tracks the user’s interactions with the visualization interface, and uses the interaction data as input for the prediction engine.
At the top level, we predict the user’s current frame of mind, or the user’s current analysis phase, which hints at what the user’s goals might be in interacting with the interface. For pan-zoom interfaces, we map user interactions to one of three analysis phases: 1) Foraging (searching for a new region of interest to explore at a low-resolution zoom level); 2) Sensemaking (exploring a specific region of interest in more detail at a high-resolution zoom level); and 3) Navigation (moving between the previous two phases).
At the bottom level, we identify the low-level browsing patterns that best represent the current analysis phase, and use these patterns to predict which data tiles to cache. For example, in the Navigation phase, the user’s goal is to move between the coarse-grained zoom levels of the Foraging phase and fine-grained zoom levels of the Sensemaking phase. As such, zooming in multiple times in a row is one possible low-level browsing pattern during the Navigation phase. In contrast, during the Foraging phase, the user wants to stay at coarse zoom levels to find new ROIs quickly, making the consecutive zoom-ins pattern irrelevant during this analysis phase. Thus, we need the ability to switch between different exploration patterns as the user alternates through the three analysis phases.
We use a suite of prediction algorithms to learn and predict low-level patterns, or recommendation models. We apply two kinds of recommendation models in ForeCache: action-based (predicting interactions, which are then mapped to tiles) and signature-based (predicting visually similar tiles). Our action-based model assumes that users frequently use the same set of interaction patterns to navigate through the data, such as the consistent zooms pattern described above. We learn these interaction patterns by training Markov chain models on interaction data collected from past users. In contrast, the signature-based model completely ignores raw interaction patterns, and instead assumes that the user prefers to explore tiles that are visually similar to the ROIs that the user has visited recently. Thus the signature-based model assumes that the user will interact with the interface with the aim of exploring visually similar tiles. To measure tile similarity, we first calculate a set of metrics for each tile, which we call a tile signature. Signature metrics range from simple histograms to more sophisticated computer vision features computed over the tile. We can then compute the similarity of two tiles by comparing their signatures.
To evaluate ForeCache and verify how domain scientists visually explore array data, we conducted a user study with 18 scientists exploring satellite imagery. In the study, we used the interface seen in Figure 2. Study participants explored snow cover imagery calculated from data collected by the NASA MODIS instrument, where snow pixels were colored red in the ForeCache interface and non-snow pixels were colored green to blue. Using interaction logs recorded from the study, we retroactively applied our prediction techniques and compared ForeCache to two existing prediction techniques. We found a strong linear correlation between prediction accuracy and response times, making prediction accuracy a reliable gauge for average response time. Furthermore, we found that ForeCache provides: a) over 400% faster response times compared with non-prefetching systems and b) 88% faster response times compared with existing prefetching techniques.
In our user study, we found that ForeCache provides: a) over 400% faster response times compared with non-prefetching systems and b) 88% faster response times compared with existing prefetching techniques.
By demonstrating that it can serve the demanding needs of scientists, ForeCache offers potential as a general-purpose exploration system for browsing large datasets.
We will present our paper on this work, “Dynamic Prefetching of Data Tiles for Interactive Visualization,”* at the forthcoming SIGMOD 2016 conference .
*“Dynamic Prefetching of Data Tiles for Interactive Visualization.” Leilani Battle, Remco Chang and Michael Stonebraker. SIGMOD 2016.