Approximate Analytics: Keeping Pace with Big Data using Parallel Locality Sensitive Hashing

Narayanan Sundaram

By Narayanan Sundaram and Nadathur Satish, Intel Parallel Computing Lab

We tend to think of big data problems as those involving finding a needle in a haystack. However, many big data problems tend to be those estimating the shape or statistics of the haystack (or an interesting subset of it) rather than searching for an individual needle.

For example, folks searching for their favorite TV show on Twitter are not really looking for one specific tweet but are trying to estimate the overall sense of others’ feelings towards the show. A retail store cares more about the sum total of a customer’s purchasing habits than his/her shopping list on a single day. Most big data problems are about finding the right trends in large amounts of information. For such problems, trading off a small hit in accuracy for faster speed of processing is reasonable and acceptable. For applications that require real-time responses, this trade-off may even be inevitable.


Problems like finding music that matches my tastes, or finding tweets that are similar to my search phrase, rely on searching for nearest neighbors in large datasets. Nearest-neighbor search has become an important operation on databases, with applications in text search, multimedia indexing, and many other areas. However, for this operation, the problem scales linearly with the size of the data. This becomes quickly unmanageable when data sizes only show accelerating growth. Approximate and randomized algorithms promise sub-linear scaling while taking a small accuracy hit. Locality Sensitive Hashing (LSH) is one such algorithm for approximate nearest neighbor search.

Limitations of Locality Sensitive Hashing with Big Data

Locality-Sensitive Hashing (LSH) is a framework for constructing data structures that enable searching for near neighbors in a collection of vectors. The vectors each represent the data’s orientation in a multidimensional space (e.g., for text data, this could be the vocabulary space of all words).

Given a large dataset containing multi-dimensional input vectors, the goal is to construct a data structure that, for any given query, reports the points that are “close” in this multidimensional space.  The data structure is randomized: each near neighbor is reported with a high probability. LSH works by creating several hash tables using locality sensitive hash functions (which have a higher probability of hashing two points to the same bucket if they are similar). Retrieving nearest neighbors for a query involves hashing the query, looking up the hash tables and identifying the points that are indeed within a given distance of the query.

The goal of our project was to create a fast version of LSH that can scale to a large number of nodes to help us analyze very large streaming datasets in real-time  – performance that could scale to thousands of concurrent queries.

For this work, we set out to index and query a corpus of approximately 1 billion tweets, with a very high update rate (400 million new tweets per day). Initially we believed that existing LSH algorithms would provide a straightforward means to tackle this problem, as tweets are naturally modeled as high-dimensional objects in the space of all words in the English vocabulary. Surprisingly, applying LSH to very large collections of documents, or document sets that change very frequently, proved to be quite difficult. This can be ascribed to the following factors:

  1. LSH becomes memory-latency-limited due to irregular hash table accesses
  2. Parallel, distributed LSH with near-perfect scaling to large number of nodes is  unavailable
  3. LSH is unable to  handle large amounts of streaming data
  4. It’s difficult to set the right algorithmic parameters to minimize query runtime

Introducing Parallel LSH

To address these challenges, we developed a new version of LSH, which we call “Parallel LSH” (PLSH). PLSH is designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and able to support  high-throughput streaming of new data. We can now perform similarity search on a dataset of > 1 billion tweets, with hundreds of millions of new tweets per day, and achieve query times of 1–2.5 milliseconds. This is an order of magnitude faster than existing indexing schemes, such as inverted indexes.

To the best of our knowledge, this is the fastest variant of LSH.

To the best of our knowledge, this is the fastest variant of LSH. The following section describes the algorithmic contributions needed to achieve this performance.

More details can be found in our paper.

1.  Algorithmic improvements for single node efficiency: We introduce a novel cache-efficient variant of the LSH hashing algorithm, which significantly improves index construction time. (See Figure 1.) This involves creating 2-level hash tables that can share partial results coupled with a 2-level hashing approach. The hashing stage can also be made SIMD-friendly. This gives us a 3.7x speedup over a basic implementation.

 

In order to support high throughput parallel querying, we also developed a number of optimizations. These include: i) processing the queries in small batches of a few hundred queries, trading latency for throughput, ii) a bit-vector-based optimization for eliminating duplicate matches found in different hash tables, and iii) a technique to exploit software prefetching & large pages to lookup satisfying matches once item identifiers have been found. Compared to a basic implementation, querying is about 8.3 x faster because of these optimizations. We were able to verify that our initialization and query performances are limited only by the hardware and not by any inefficiencies in our software implementation. Figure 2 shows the query performance for 1000 queries over a dataset of 10.5 million tweets.
2.  Parallelism: Both hash table construction and hash table search in PLSH are distributed across multiple cores and multiple nodes. Within a single node, multiple cores concurrently access the same set of hash tables. We develop techniques to batch and rearrange accesses to data to maximize cache locality and improve throughput. On a single 8-core Intel® Xeon® processor E5-2670, we achieve 7.2x a speedup for initialization and 7.8x for querying. Scaling up to 100 nodes is close to ideal (<1% time spent in communication). Figure 3 shows weak scaling up to 100 nodes and a billion tweets. Each node has around 10.5 million tweets. Note that initialization of hash tables happens much more infrequently compared to querying existing data.

3.  Streaming: To handle streaming arrival and expiration of old documents, we propose a new hybrid approach that buffers inserts in an insert-optimized LSH delta table and periodically merges these into our main LSH structure. With such a streaming structure, we are able to guarantee a query performance that is within 1.3x of the static scenario in the worst case and no performance degradation in the average case. About 2% of the runtime is spent in insertions and merging (assuming 400 million tweets are added per day).

4. Performance modeling: We develop a detailed analytical model that accurately projects the performance of our single- and multi-node LSH algorithms to within 25% of obtained performance. This model allows us to estimate the optimal settings of PLSH parameters on modern hardware and also allows us to understand how close our observed performance is compared to expected performance.

Figure 4 shows the accuracy of our performance model on two different datasets (Twitter and Wikipedia). Estimating the performance accurately is essential to making PLSH easy to use, as it takes the guesswork out of figuring out the best-performing algorithmic parameters.

As datasets continue to grow at exponential rates and as more applications emerge that demand real-time responses, we need new approaches to keep pace with the data. Algorithms such as LSH, which achieve high quality (if not perfect) results quickly, will be essential to the future of Big Data Analytics.

More information about this project can be found on our project page.

The PLSH project team includes Narayanan Sundaram, Nadathur Satish and Pradeep Dubey of Intel’s Parallel Computing Lab and Aizana Turmukhametova, Todd Mostak, Piotr Indyk and Sam Madden of MIT CSAIL.   The team will present the paper on PLSH at the VLDB 2014 conference.

This entry was posted in Analytics, Big Data Architecture, Databases and Analytics, ISTC for Big Data Blog, Math and Algorithms, Tools for Big Data and tagged , , , , , . Bookmark the permalink.

Comments are closed.