In the Big Data era, professionals across scientific areas need efficient, interactive ad-hoc data analysis. Ideally, they need generic and reusable systems tools for interactive search, exploration and mining over large data sets. But exploring large data sets interactively requires advanced data-driven search techniques that go well beyond the conventional database-querying capabilities, whereas state-of-the-art search technologies are not designed and optimized to work for large out-of-core data sets. As a result, users are forced to roll their own custom solutions, typically by gluing together existing libraries, databases and custom scripts─only to end up with a solution that is difficult to develop, scale, optimize, maintain and reuse.
For this problem, we offer Searchlight, an approach that integrates Constraint Programming (CP) with traditional DBMS technologies to enable efficient ad-hoc, interactive data exploration of big data. Searchlight supports data- and search-intensive applications on large multidimensional data sets. The rest of this post discusses the challenges facing users followed by how Searchlight addresses those challenges, our experimental results, and future areas for additional research.
Wanted: Big Search over Big Data
As an example, consider an astronomer looking at various astronomical data sets, like the Sloan Digital Sky Survey (SDSS) or the Large Synoptic Survey Telescope (LSST). The astronomer’s tasks may vary from relatively simple retrieval of the information about different celestial objects, to complex search queries. A complex search query might include searching for an interesting rectangular region in the sky, satisfying constraints on the shape and content. The astronomer might describe the perimeter and area of the region or give the ranges of lengths in various dimensions. She might also add constraints for various astronomical measures of objects within such a region, e.g., in terms of photometric magnitudes available in SDSS. (See Figure 1.)
Another example is a researcher working with a medical database such as the MIMIC data set. MIMIC contains signal waveforms for Intensive Care Unit (ICU) patients over a large period of time. The researcher might look for different time intervals (i.e., one-dimensional regions) that satisfy certain conditions. For example, she might be looking for the time interval similar to some predefined pattern or containing a signal spike that makes it an anomaly. Such region-based data exploration has a remarkable expressibility for a plethora of interesting tasks. (See Figure 2.)
When users perform this kind of data exploration, they expect additional support from the system beyond query performance. The key requirement is interactivity. This does not necessarily mean small query latency. Data exploration problems often require complex computation that might result in considerable time for finding and computing the whole result, e.g., finding all interesting regions. The data exploration system should provide some feedback to the user during query evaluation. For example, it might prioritize finding individual regions fast and outputting them to the user as soon as possible, even if it results in slightly increased latency for the whole result. Another example is to periodically output current results for optimization queries, e.g., finding regions maximizing some objective function.
Another notoriously challenging problem is finding the right questions to ask, given the users’ lack of familiarity with the structure and contents of the underlying data sets as well as the inherently fuzzy goals in many exploration-oriented tasks. Often, the users might have some idea about the properties they want to study (e.g., the arterial blood pressure in MIMIC or photometric properties in SDSS), but they might not necessarily have ideas about the exact query parameters. A user often wants to get a representative, easy-to-study result from the system, but instead might have to engage in a cumbersome and frustrating “guessing game” in which some queries do not return any results at all while other queries return a very large number of results, which is hard to impossible to analyze.
Limitations of Current DBMSs
Perhaps surprisingly, traditional DBMSs provide little to no support for such search-related tasks. While it is possible to express some of the search queries in a DBMS natively (e.g., using SQL), such queries are very hard for the user to express and for the optimizer to process. Moreover, DBMSs treat such queries in the traditional enumerate-and-filter fashion, where all possible objects of interest are enumerated and checked for satisfying the query constraints. Such an approach, especially for complex search queries, impedes interactivity. In other words, traditional DBMSs lack any useful support for search queries.
We believe data exploration, as described above, has the following key characteristics. First, exploration queries involve large search spaces. The number of possible regions in the sky or time intervals in large historical data might be humongous. Thus, a brute-force enumeration-based solution is often infeasible. The system must perform extensive pruning of the search space to facilitate excellent overall query performance. At the same time, it must identify and explore promising parts of the search space fast to provide interactive results by incorporating information from query constraints or applying general heuristics to steer the search in the promising direction. Second, exploration queries are performed over large data sets. Some constraints might reference a lot of data, which implies not only high CPU costs to compute complex ad-hoc functions, but also significant I/O costs due to a large number of requests to out-of-core data. Thus, we are looking at a “Big Search over Big Data” problem.
Searchlight: Specialized Search from Inside DBMS Engines
While traditional DBMSs store and manage large data sets quite efficiently, they lack search constructs on the user level and search semantics at the query executor level to handle large search spaces. We believe this opens opportunities for introducing specialized search techniques inside DBMS engines. In our Searchlight work, we specifically look at Constraint Programming (CP) solvers. Several salient features make CP solvers particularly well suited for generic search and exploration:
- Efficiency: CP solvers are very proficient at exploring large search spaces. Similar to query optimizers, they incorporate a collection of sophisticated optimizations including pruning and symmetry breaking. The first is a core technique that safely and quickly removes parts of the search space that cannot contain any results.
- Interactivity: CP solvers are designed to identify individual solutions fast and incrementally. Users can pause or stop the search, or ask for more solutions. This is fundamentally different from conventional query optimization, which aims to minimize the completion time of queries.
- Extensibility: CP solvers are designed to be modular and extensible with new constraint types and functions. Moreover, users can introduce their own search heuristics that govern the search process.
These features render CP solvers a very powerful tool for interactive, human-in-the-loop data exploration and mining. Unfortunately, solvers commonly assume that all the required data fits into main memory and thus only optimize for the compute bottleneck. While this assumption has served well for many traditional CP problems (in which the primary challenge is to deal with very large search spaces), it has recently become obsolete with the availability of very large data sets in many disciplines. Our experimental results indeed show that solvers become unacceptably slow when operating over large, out-of-core data sets.
Searchlight overcomes this problem by integrating CP and DBMS functionality to operate on multidimensional data, enabling interactive search of data- and search-intensive applications at large scale. Searchlight’s design is guided by the following goals:
- Integrated search and query: Searchlight offers both standard DBMS functionality (e.g., storage and query processing) and sophisticated search. This integrated functionality simplifies application development and reduces data movement between the systems.
- Modularity and extensibility: Our design is minimally invasive and works with different solver implementations with very little modification. Similarly, the underlying DBMS engine will not require major modifications. This allows Searchlight to gracefully and automatically evolve as the underlying systems evolve. Moreover, the same framework can be applied to compatible systems.
- Optimized data-intensive search and exploration: Searchlight provides optimized execution strategies for CP, integrated with regular queries, on large data sets.
- Distributed, scalable operation on modern hardware: Searchlight supports efficient distributed and parallel operation on modern clusters of multi-core machines.
Users can invoke existing solvers along with their built-in search heuristics using an array DBMS language. Under the hood, Searchlight seamlessly connects the solver logic with query execution and optimization, allowing the former to enjoy the benefits of DBMS features such as efficient data storage, retrieval and buffering, as well as access to DBMS query operators. Searchlight runs as a distributed system, in which multiple solvers will work in parallel on different parts of the data- and search-space.
An important challenge is to enable existing CP solvers’ logic (without any modifications) to work efficiently on large data. To achieve this goal, Searchlight uses a two-stage Solve-Validate approach (see Figure 3 below). At the solving stage, Solvers perform speculative search on main-memory synopsis structures, instead of the real data, producing candidate results. A synopsis is a condensed representation of the data, containing information sufficient to perform pruning and to verify query constraints. The candidates are guaranteed to contain all real results, but possibly include false positives. Validators efficiently check the candidates over the real data, eliminating the false positives and producing the final solutions while optimizing I/O. Solvers and Validators invoke different instances of the same unmodified CP solver; yet the former will be directed to work on the synopses and the latter on the real data through an internal API that encapsulates all array accesses. This two-stage approach is transparent to the CP solvers and the users.
Searchlight can also transparently parallelize query execution across a modern cluster of multi-core machines. Searchlight supports both static and dynamic balancing. During the static phase, the search space and the data space are distributed between Solvers and Validators before the query begins. During execution, Searchlight redistributes work between idle and busy Solvers to address hot spots.
We have implemented Searchlight as a fusion of an existing CP solver from Google’s or-tools and a popular array DBMS SciDB. The CP solver is integrated inside the DBMS engine as a User-Defined Operator (UDO), which allows it to use internal DBMS facilities. We use the existing distributed and networking DBMS infrastructure for distributed query processing, internal DBMS functions to access the data and DBMS buffering to manage additional search data. Such integration significantly decreases the overhead and avoids duplication of existing functionality. At the same time, the integration is done in a modular way, such that the CP solver and DBMS engine can be substituted and upgraded with minimal effort, if required.
Our experimental evaluation showed that Searchlight’s CP-DBMS integration offers remarkable benefits over the individual approaches. The table below shows results for one of the queries run over a 120GB data set in an 8-node cluster of AWS c3.xlarge machines. The individual approaches are CP, where a solver was run without modification over the data set, and SciDB, where the query was run as a common DBMS query (on an array DBMS, SciDB). As can be seen, Searchlight considerably outperforms both approaches. We saw similar speed-ups across multiple queries for different data sets. Moreover, in many cases the individual approaches are not even able to produce results for many hours. For the same queries Searchlight can produce results in several minutes or even seconds, which further signifies the necessity of specialized data exploration solutions.
Improving Exploration for Users via Automatic Query Relaxation
As we pointed out at the beginning, it is important to provide support to the users who struggle to come up with proper exploration queries. While the problems of too few and too many results have been extensively addressed in the literature, the existing methods (e.g., for relational systems) are not directly applicable to search queries. This is due to an arbitrary and complex structure of the search space, and complex, ad-hoc user constraints.
Searchlight takes advantage of its extensive use of search techniques to provide better support to users in case their queries do not return the desired number of results. In addition to the query itself, the user can specify the desired cardinality of the result. If the search query does not return the required number of results, Searchlight automatically explores previously pruned parts of the search space, taking into account information about possible values of the constraint functions and relaxing the constraints appropriately. This results in dynamic and adaptive exploration, where different parts of the search space are explored with modified (relaxed) constraints. At the same time Searchlight does not have to retry previously explored parts of the search space, which results in considerable savings. Searchlight guarantees to return optimal results with respect to the user-specified distance function. Searchlight addresses the dual problem of query tightening, where queries produce too many results, with a similar approach: by introducing additional ranking constraints during the search to prune inferior results according to the user-specified ranking function. Searchlight guarantees to output top-k optimal results according to the function, where k is the user’s desired cardinality. Such dynamic constraint manipulation, which is natural for Searchlight, is extremely hard to incorporate into a traditional query execution engine.
Our experiments have shown remarkable results compared with the manual user approach, where the user has to guess the correct query. The table below shows one of the results for a query for a real-world MIMIC data set. It compares the Searchlight approach with the manual user relaxation, where the user had to go through a series of 2 or 3 queries until the required number of results was found. It can be seen that even in the case of 2 queries, which is quite optimistic (and unrealistic in practice), Searchlight was able to provide significant performance gains. We have seen similar results across a range of queries for different data sets.
We believe that integrating CP and DBMS technologies has a remarkable potential for ad-hoc, interactive data exploration of big data. We have addressed some of the issues─including interactivity, overall query performance, and query relaxation and tightening─for which our approach results in considerable performance improvements. At the same time, we believe there are a large number of interesting and fruitful directions for further research, including more complex integration of CP and DBMS, adaptive search query optimization, multi-part search queries, and further exploring the idea of dynamically modified constraints.
 Alexander Kalinin, Ugur Cetintemel and Stan Zdonik. Interactive data exploration using semantic windows. In SIGMOD, pages 505–516, 2014.
 Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. Searchlight: Enabling integrated search and exploration over large multidimensional data. In VLDB, pages 1094-1105, 2015.
 Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. Interactive search and exploration of waveform data with searchlight. In SIGMOD, pages 2105–2108, 2016.