Benchmarking Graph Databases

ByAlekh Jindal, MIT CSAIL

Graph data management has recently received a lot of attention, particularly with the explosion of social media and other complex, inter-dependent datasets. As a result, a number of graph data management systems have been proposed. But this brings us to the question: What happens to the good old relational database systems (RDBMSs) in the context of graph data management?

Do RDBMSs become obsolete or are they still relevant for large-scale graph data management? Do users now need to dump their data from the relational database to a graph database, or can they store and query graphs (with comparable or better performance) from within the relational engine?

Unfortunately, there’s an absence of a comparative study of modern relational databases in the context of graph data management. This is the problem we hope to solve at the MIT CSAIL Database Group with our study to benchmark and understand the performance of relational databases over very-large graphs. Our early results are promising.

Graph Data Management Systems

The new graph data management systems address roughly two kinds of query workloads: (i) low-latency online graph query processing, e.g., social network transactions, and (ii) high-throughput offline graph analytics, e.g., PageRank computation. Typical examples of systems for online graph processing include RDF stores (such as Jena and AllegroGraph) and key value stores (such as Neo4j and HypergraphDB). Other recent graph processing systems wrap around relational databases to provide efficient online query processing. These include TAO from Facebook and FlockDB from Twitter, both of which wrap around the MySQL database to build a distributed and scalable graph processing system.

On the other hand, Hadoop has gained a lot of popularity as a platform for large-scale data analytics in recent years. And indeed there have been several works to improve upon iterative graph queries in Hadoop. These include HaLoop, Twister, and PrItr. Though these works improve the performance of iterative queries, users still need to \textit{think} their analytical graph queries as MapReduce jobs.

Alternatively, other approaches allow programmers different interfaces for expressing their analytical graph queries. For instance, Pegasus allows users to express their graph queries as matrix operations and Pregel employs a vertex-centric model to express graph queries. Both Pegasus and Giraph (an open source implementation of Pregel), however, translate the user queries to MapReduce jobs. Trinity and GraphLab, also based on a vertex-centric computation model, bypass the MapReduce query layer and talk directly to the underlying distributed data store (key-value store and HDFS, respectively).

RDBMS and Graph Analytics

Interestingly, relational databases have undergone several path-breaking changes in recent years. These include the advent of column-oriented databases, main-memory databases, and array-oriented databases. Could these recent advances be used for efficient graph analytics as well? How good or bad are relational databases for graph analytics anyways?

In our benchmarking study, we compared the performance of relational databases on graph analytics as follows. We picked a leading graph database system (Neo4j) and three relational databases: a row-oriented database (MySQL), a column-oriented database (Vertica), and a main-memory database (VoltDB).

We implemented two queries, PageRank and Shortest Paths, on each of these systems. We created the appropriate sort orders and indexes for all systems.

We used two datasets from the Stanford large network dataset collection:

  • a Facebook dataset having 4K nodes and 88K edges, and
  • a Twitter dataset having 81K nodes and 1.8M edges.

Figures 1(a) and 1(b) show the result (below).

We can see that relational databases outperform Neo4j on PageRank by up to two orders of magnitude. This is because PageRank involves full scanning and joining of the nodes and edges table, something that relational databases are very good at doing. Finding Shortest Paths involves starting from a source node and successively exploring its outgoing edges, a very different access pattern from PageRank. Still, we see from Figure 1(b) that relational databases match or outperform Neo4j in most cases. In fact, Vertica is more than twice faster than Neo4j. The only exception is VoltDB over Twitter dataset. We are currently working on extending our benchmark to larger datasets as well as on including more queries.

Figure 1: Comparing Relational Databases over Graph Analytics (click to enlarge)

A key criticism of relational databases could be forcing programmers to use SQL, the de facto query interface in relational databases. Indeed, implementing graph queries in SQL is tricky and time-consuming. Thus, the question is whether we can have other graph-friendly query interfaces on top of a relational database.

To this end, we developed a vertex-centric computation interface, as in Pregel, on top of VoltDB. The idea is that the users simply write their vertex-centric compute functions, same as in Pregel, and our interface automatically maps it to the underlying SQL engine. As a result, users do not have to deal with SQL at all. This is very similar to Giraph, which maps vertex-centric computations to MapReduce jobs, without having users to deal with them explicitly. Table 1 shows the performance of VoltDB with SQL and vertex-centric query interfaces.

Table 1: VoltDB with SQL and Vertex-centric Interfaces

We can see from Table 1 that the performance of VoltDB degrades a bit when switching to a vertex-centric interface, due to the added query interface abstraction. However, with a vertex-centric interface, users can now easily express several other graph analytic queries such as connected components, random walk with restart, and other message passing algorithms – right \textit{in} their good old relational database.

Update as of 10/7/2013: Please see these updates to this work and this post.

Author Bio

Alekh Jindal is a postdoc associate in the Database Group at MIT CSAIL, working with Professors Sam Madden and Michael Stonebraker.  His current research focuses on improving the performance of novel large-scale data-intensive systems. He received his PhD from Saarland University in August 2012, working with Prof. Jens Dittrich. During his PhD, he worked on flexible and scalable data storage for traditional databases as well as for MapReduce.

This entry was posted in Analytics, Benchmarks, Big Data Applications, Big Data Architecture, Databases and Analytics, DBMS, Graph Computation, ISTC for Big Data Blog and tagged , , , , , , , , . Bookmark the permalink.

Comments are closed.