Graph Analytics: The New Use Case for Relational Databases

ByAlekh JindalMIT CSAIL

Graph analytics are becoming increasingly popular with several new big-data application domains such as social networks, transportation networks, ad networks, e-commerce, and web search. However, these graph analytics workloads are seen as quite different from traditional database analytics workloads, largely due to the iterative nature of many graph-analytics computations, and the perceived awkwardness of expressing graph analytics as SQL queries.

Interestingly, in many real-world scenarios, the raw graph data is collected and it resides in a relational database in the first place. The data is later extracted and fed to a specialized graph processing system. Given that relational databases are general-purpose data processing systems built with extensability and tuning in mind, how difficult or easy is it to perform graph analytics on relational databases?

How difficult or easy is it to perform graph analytics on relational databases?

Graphs on Databases

One could store a graph as a set of nodes and a set of edges in a relational database. This simple representation allows us to leverage built-in relational operators such as selection, projection, and join and express a range of graph analytics operations. It turns out that relational algebra is powerful enough to capture all basic graph operations such as node/edge access, neighborhood access, and graph traversal, which can then be composed into more complex analytics. Likewise, the iterative nature of graph algorithms could be captured using stored procedures as driver programs.

Performance Optimizations

Expressing graph analytics as SQL queries typically involves multiple self-joins on tables of nodes and edges. This could have very bad performance. Relational databases provide rich support for creating physical designs in order to boost query performance. For instance, sort orders and indexes can significantly boost the performance of selection and join predicate.

Modern databases also have built-in parallelism, i.e., they make use of multiple threads during query processing. For example, a group by clause can partition the data amongst different CPU cores and each core can then process the groups in parallel. We can exploit such parallelism for parallelizing graph operations as well. Similarly, we can induce query pipelining by expressing graph operations as nested queries, allowing the query optimizer to employ pipelining between the inner and outer query whenever possible. Such logical and physical query optimizations yield significant performance improvements.

Figure 1 below shows the performance of a column-oriented relational database and Apache Giraph on PageRank and Shortest Paths. We ran the experiment on four machines and two social network datasets: (i) LiveJournal having 4.8 million nodes and 68 million edges and (ii) Twitter having 41 million nodes and 1.4 billion edges. We can see that the column-oriented relational database can match or outperform the specialized graph processing system. Thus, relational databases can indeed be tuned to have good performance over graph analytics.

Figure 1: Performance of column-oriented relational database and Apache Giraph on PageRank and Shortest Paths queries. Experiment run on four machines and two social network datasets: (1) LiveJournal, having 4.8 million nodes and 68 million edges; and (2) Twitter, having 41 million nodes and 1.4 billion edges. Source: Alekh Jindal, MIT CSAIL

Combining Graph Analysis With Relational Analysis

As graphs get bigger, users would frequently want to (or will have to) select a subset of a graph before performing analysis on it. For example, it is unlikely that a user will run a single-source Shortest Paths query on the entire trillion-node Facebook graph — this would be prohibitively slow on any system. Rather, it is more likely that users will run several Shortest Paths queries over different subsets of the graph (e.g., the N-hop neighbors of some particular user.)

Furthermore, real-world graphs have vertices and edges accompanied by several other attributes. For example, edges in a social network may be of different types such as friends, family, or classmates. Similarly, nodes may have several attributes to describe the properties of each person in the social network, e.g., username, birth date, and so on.

Given such metadata, an analyst would typically do some ad hoc relational analysis in addition to the graph analysis. For instance, the analyst may want to pre-process and shape the graph before running the actual graph algorithm, e.g., filtering edges based on timestamp or limiting the graph to just close friends of a particular user. Similarly, he may want to analyze the output of the graph analysis, e.g., computing aggregates to count the number of edges or nodes satisfying some property, or other statistics.

Furthermore, in many cases, the graphs may be implicit in the relational data and need to be extracted in the first place. For example, e-commerce sales data might be used to extract the users-items bipartite graph which could be later used for collaborative filtering. Such pre- and post-processing of graph data would typically require relational operators such as selection, projection, aggregation, join, for which relational database systems are highly optimized.

Figure 2 below shows PageRank and Shortest Paths queries (on the LiveJournal dataset), in the presence of projection, selection, and aggregation over node and edge metadata. As we can see, the results are very different now and the column-oriented relational database outperforms Apache Giraph significantly. Thus, relational databases have a clear edge when combining graph analysis with relational analysis.

Figure 2: Performance of column-oriented relational database and Apache Giraph on PageRank and Shortest Paths queries on a LiveJournal dataset, in the presence of projection, selection and aggregation over node and edge metadata. Experiment run on four machines and two social network datasets, including a LiveJournal dataset having 4.8 million nodes and 68 million edges (seen here). Source: Alekh Jindal, MIT CSAIL.

The SQL Advantage

Let us now consider whether there are analytical graph queries which are more suited to implement in a relational interface as opposed to a vertex-centric one. It turns out that graph queries, such as those that require nodes to access their 1-hop neighborhood (i.e., neighbors’ neighbors), are very difficult to run in a vertex-centric model. Such queries can be easily executed in SQL in a relational database. For example, counting the number of triangles in a graph involves looking at three edges adjacent to each other. Similarly, finding all pairs of nodes having strong overlap (e.g., common neighbors) between them requires to inspect a node’s neighbors’ neighbors. While SQL is traditionally perceived as awkward for graph analytics, such queries are in fact more suited to SQL. This is because the programmers can easily manipulate sets of nodes and edges. On the other hand, it is very difficult to express such analysis on vertex-centric graph processing systems.

Update Handling

Graphs are not static in nature. Over time, we need to add, remove, or modify the nodes and edges in the graph. Similarly, we may need to update the metadata associated with the nodes and edges. Relational databases are naturally suited to handle updates. This is not necessarily true for graph processing systems. For example, Giraph, an open source graph processing system, has no clear method of updating the graphs it analyzes. One might think of using HBase for updates and Giraph for analytics, but then we have the additional overhead of stitching two systems together and coordinating between them.

Box Full of Goodies

Finally, relational databases come with many features that are not present (yet) in the next generation of graph processing systems. For instance, graphs are likely to be updated over time. Furthermore, graphs may also be updated while they are being analyzed. Many of the graph processing systems are simply not equipped to handle these situations, whereas relational systems are quite good at them. Of course, with time, graph-engines may evolve these features, but our results suggest that these graph systems should be layered on relational databases, not built outside of them.

Author Biography

Alekh Jindal is a postdoc associate in the Database Group at MIT CSAIL, working with Professors Sam Madden and Michael Stonebraker.  His current research focuses on improving the performance of novel large-scale data-intensive systems. He received his PhD from Saarland University in August 2012, working with Prof. Jens Dittrich. During his PhD, he worked on flexible and scalable data storage for traditional databases as well as for MapReduce.

This entry was posted in Analytics, Databases and Analytics, Graph Computation, ISTC for Big Data Blog and tagged , , , , , , , . Bookmark the permalink.

3 Responses to Graph Analytics: The New Use Case for Relational Databases

  1. Tony Aponte says:

    Are the test setup and execution instructions available for independent reproduction?

    Regards,

  2. Dmitry Davletbaev says:

    I disagree with the following author’s statement: “it is unlikely that a user will run a single-source Shortest Paths query on the entire trillion-node Facebook graph [because] this would be prohibitively slow on any system”. It is very likely, if the system allows to do it. But this statement is definitely true for relational databases.

  3. Jing says:

    I wonder why the author considers column store as relational database. Column store is one type of NoSQL database that cannot be considered a relational database. While it can be implemented to run relational queries, it can also be implemented to run graph queries as well. In fact, some graph databases are built above column store. There is no point of arguing relational database over graph database if they are both implemented over column store and none is native to its operations and require certain degree of customization to fit the specific need.

Leave A Reply

Your email address will not be published. Required fields are marked *


7 + = fourteen