Scaling Databases to 1000 Cores

By Sungjin Lee, Postdoctoral Associate, Carnegie Mellon University

At the Intel Science and Technology Center for Big Data’s annual Research Retreat at Intel’s Jones Farm campus in Hillsboro, Oregon, Xiangyao Yu of MIT described his team’s work in developing solutions for the bottlenecks that occur in many-core architectures.

The work assumed a tiled chip multi-processor which is based on a 2D-Mesh network and employs 1,000 low-power cores with private L1 caches (48 KB) and shared L2 caches (2MB). To find bottlenecks, the team performed a series of evaluations with four system configurations: Ideal, Real, Inf. BW, and Perfect Net:

  • Ideal is a many-core architecture with perfect memory and networks – there are no DRAM latencies and no network latencies or contentions.
  • Real is a realistic configuration with realistic network and DRAM overheads.
  • Inf. BW is the same as Real but with the network contentions excluded.
  • Perfect Net assumes an ideal network without network latencies or contentions.

For all the evaluations, the team used PageRank, one of the well-known graph analytics benchmarks. Based on the code-level analysis of PageRank, the team found that an existing hardware prefetcher works well with graph analytics applications. The prefetcher is designed for streaming access patterns (i.e., sequential memory accesses), so it does not effectively handle random memory accesses. Because random memory accesses are common in graph analytics, the prefetcher incurs lots of cache misses. Xiangyao Yu described an indirect prefetcher that avoids this problem by loading future data in advance by referring to the indexes of arrays instead of the address values of arrays.

The team also found that as the number of cores increases, the network bandwidth becomes a serious bottleneck. To reduce network traffic, the team developed a cache-line accessing policy called partial cache-line accessing, which allows for applications to read part of the data in cache-lines, instead of reading all the data in cache-lines. This simple approach greatly reduces on-chip data traffic, thereby lowering overall network traffic.

Xiangyao Yu mentioned that additional performance gains could be obtained by reducing network contentions.

Additional Reading:

Paper: “Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores.”

The Many-Core Database Systems Project Page at CMU

Blog article: “Concurrency Control in the Many-core Era: Scalability and Limitations

Blog article: “Open Problems in Transaction Processing (Part 2 of 3): Many-core CPU Architectures” (by Andy Pavlo, Carnegie Mellon University)

This entry was posted in Big Data Architecture, Computer Architecture, DBMS, ISTC for Big Data Blog and tagged , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *


+ 3 = four