Compiling Queries for High-Performance Computing

Brandon Myers Bill Howe University of Washington

By Brandon Myers and Bill Howe, University of Washington

High performance computing (HPC) is traditionally about compute, but its users have data management problems, too. In a recent paper, we demonstrated a promising technique for bringing ad hoc query processing into HPC languages.

Let’s look at just one example of a data problem—from astrophysics. Astrophysical simulations of galaxies output massive snapshots detailing particles and their properties, and these snapshots need to be analyzed. For example, astrophysicists want to find meaningful clusters of particles[1]. (See Figure 1.)

Figure 1: The challenge in providing HPC users with interactive querying of large, complex data sets - such as those used by scientists in astrophysics.

Figure 1:  An example of how HPC users – in this case, astrophysicists – typically query large, complex data sets.

Before analyzing a snapshot with a database, however, we would typically need to dump it to disk, read it back into a database, potentially even transmit it across a network to machines hosting the database. Instead, couldn’t we use an off-the-shelf in-memory dataflow system (IMDS) to analyze snapshots in situ? Yes, but that approach has limitations. First, IMDSs do not efficiently utilize HPC platforms because they use different messaging software and higher-level languages. Second, IMDSs will require data transformation to talk to the HPC program because the two programs represent collections of data differently.

With these deficiencies in mind, we were motivated to implement querying capabilities in the same HPC environment that simulations run in. This environment features fast interconnects, fast messaging libraries, and high CPU to IO capacity. Furthermore, we could analyze the data in its original format, avoiding expensive translation. We asked: how should we build a query processor on such a system? We considered building directly on the high performance messaging library MPI, but high-level parallel languages proved more powerful.

A class of high-performance parallel languages called global-view partitioned global address space (PGAS) is an attractive choice for building query processing on HPC. Think of PGAS as shared memory programming except that memory is partitioned. The programmer optimizes his or her PGAS program for a distributed system by carefully placing its data structures and tasks (computation) across the partitions. By using PGAS, we exploit (1) language constructs for flexible parallel execution, data layout, and task migration, (2) runtimes built upon the high-performance messaging standard MPI, and (3) compilers that are parallel-aware and offer optimizations that complement those of databases.

An architecture for query processing on PGAS

Having chosen PGAS as our interface, we next needed a strategy for executing queries. It is well known that a conventional iterator strategy for query processing[2] is inefficient for in-memory analytical workloads.[3] One approach, query compilation, increases performance by eliminating overheads like function calls for tuple access and by generating code that is more palatable to an optimizing compiler.[4],[5],[6] Existing query compilation systems generate sequential code. Query compilation is used in some parallel query processors, too. These systems generate sequential program fragments and stitch them into a parallel program using calls to a messaging library.[7],[8],[9]

However, this approach masks further optimization opportunities related to parallelism and communication. PGAS compilers can take advantage of this information, so we’d like to preserve it. To take full advantage of the parallel-awareness of PGAS compilers, we set out to build a new kind of query processor─one that works by generating parallel code from a query plan.

This architecture presented a number of challenges. Expressing query execution in a distributed system using the shared memory programming model of PGAS results in expensive fine-grained global memory operations. Here is a simplified example of a PGAS program that builds a histogram of brightness of particles (from a snapshot of a galaxy simulation).

Figure 2: Simplified example of a PGAS program that builds a query for a histogram of brightness of particles (from a snapshot of a galaxy simulation).

Figure 2: Simplified example of a PGAS program that builds a query for a histogram of brightness of particles (from a snapshot of a galaxy simulation).

Lines 1 and 4 allocate global arrays, that is, arrays distributed to be contiguous and evenly distributed among all the partitions. Line 5 indicates that all particles can be processed in parallel. Line 6 increments the count in the appropriate bucket of the histogram, using atomic to avoid losing conflicting updates by different iterations.

This code is quite flexible in at least two ways. First, the compiler and runtime can decide how forall is executed and dynamically load-balanced, which is important for skewed data. Second, line 6 is reading and writing a distributed data structure, yet this does not stop the parallel-aware compiler from optimizing the whole code together.

Despite the advantages, there are challenges in making this code perform well. Given that we do not know the brightness of a particle a priori, line 6 causes a random access to the histogram array. On a distributed system, this random access of a single cell will be inefficient for two reasons. First, performing the update might involve multiple round trips over the network: one round trip reads the current value and another round trip writes the incremented value. To improve this situation, we make line 6 a remote procedure call to the destination partition, limiting the communication to one message. We have to consider this transformation for every access to a global data structure. All the data that the remote procedure call touches must reside in the same partition, so we must partition our global data structures according to how they are accessed. The paper has more detail about these solutions.

The second reason for inefficiency: network interfaces require big packets to reach their peak throughput, but incrementing a single integer generates tiny packets. The IMDS implementation of histogram uses a global shuffle and faces the same problem. The solution in IMDSs is to batch tuples. In our system, we relegate the problem to the PGAS language runtime, which batches small messages headed for the same partition.[10]

We named our overall approach for generating parallel programs from queries Compiled Parallel Pipelines (CPP).  We built a compiler that uses CPP to generate PGAS programs (like the one shown in Figure 2) from SQL queries.

Comparison

We compared CPP to a conventional iterator execution strategy written in PGAS using the same infrastructure. To eliminate known overheads of tuple-at-a-time execution, like those from function calls and tuple accessors, we also compiled the iterator-based code. Our workload was the individual queries of TPC-H, and we ran the queries on 16 16-core AMD Opteron nodes, connected by a QDR Infiniband network. Prior to each query, we randomly partitioned the rows of the tables across the memory of the nodes. The plot below (see Figure 3) shows the speedup of CPP over iterators.

Figure 3: Speedup of queries based on Compiled Parallel Pipelines (CPP) over methods based on iterators.

Figure 3: Speedup of TPC-H queries based on Compiled Parallel Pipelines (CPP) over conventional methods based on iterators.

CPP is always at least 2X faster and on average 5.5X faster. To make sure that our overall system performed well on analytical queries, we also compared running times with a commercial parallel database.

Conclusion

We demonstrated a promising technique for bringing ad hoc query processing into HPC languages. Using PGAS to efficiently execute queries required designing efficient data structures, generating code that avoids extra messages, and mitigating the overhead of an execution model based on fine-grained tasks.

This result is the first step toward parallel language-integrated queries, which will be useful for writing high-performance analytics codes. Such a system would enable the programmer to write data analyses with little effort within the HPC code, wherever they are needed, and the compiler could co-optimize HPC code with generated code. Finally, because our system excels on certain relational queries, it will be useful as an additional back end to a polystore like BigDawg.

Footnotes       

[1] Loebman, Sarah et al. “Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?.” Cluster Computing and Workshops, 2009. CLUSTER’09. IEEE International Conference on 31 Aug. 2009: 1-10.

[2] Graefe, Goetz. “Volcano-an extensible and parallel query evaluation system.” Knowledge and Data Engineering, IEEE Transactions on 6.1 (1994): 120-135.

[3] Boncz, Peter A, Marcin Zukowski, and Niels Nes. “MonetDB/X100: Hyper-Pipelining Query Execution.” CIDR 4 Jan. 2005: 225-237.

[4] Krikellas, Konstantinos, Stratis D Viglas, and Marcelo Cintra. “Generating code for holistic query evaluation.” Data Engineering (ICDE), 2010 IEEE 26th International Conference on 1 Mar. 2010: 613-624.

[5] Neumann, Thomas. “Efficiently compiling efficient query plans for modern hardware.” Proceedings of the VLDB Endowment 4.9 (2011): 539-550.

[6] Sompolski, Juliusz, Marcin Zukowski, and Peter Boncz. “Vectorization vs. compilation in query execution.” Proceedings of the Seventh International Workshop on Data Management on New Hardware 13 Jun. 2011: 33-40.

[7] Murray, Derek Gordon, Michael Isard, and Yuan Yu. “Steno: automatic optimization of declarative queries.” ACM SIGPLAN Notices 4 Jun. 2011: 121-131.

[8] Crotty, Andrew et al. “An architecture for compiling udf-centric workflows.” Proceedings of the VLDB Endowment 8.12 (2015): 1466-1477.

[9] Seo, Jiwon et al. “Distributed socialite: a datalog-based language for large-scale graph analysis.” Proceedings of the VLDB Endowment 6.14 (2013): 1906-1917.

[10] Nelson, Jacob et al. “Latency-tolerant software distributed shared memory.” 2015 USENIX Annual Technical Conference (USENIX ATC 15) 2015: 291-305.

This entry was posted in Big Data Applications, Data Management, High-Performance Computing, ISTC for Big Data Blog and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


× nine = 63