Devs are from Venus, Ops are from Mars, Big Data: Neo4j

neo4j logoIf you’re just joining this column, it is one aspect of a response to the gap between how development and how operations view technology and measure their success – it is wholly possible for development and operations to be individually successful, but for the organization to fail.

So, what can we do to better align development and operations so that they can speak the same language and work towards the success of the organization as a whole? This article series attempts to address a portion of this problem by presenting operation teams insight into how specific architecture and development decisions affect the day-to-day operational requirements of an application.

The current article series is reviewing Big Data and the various solutions that have been built to capture, manage, and analyze very large amounts of data. Unlike the relational databases of the past, Big Data is not one-size-fits-all, but rather individual solutions have been built that address specific problem domains.

The last handful of articles reviewed Hadoop, MapReduce, HBase, and MongoDB so now we turn our attention to a completely different domain of Big Data stores: graph databases. This article reviews Neo4j, which is probably the most popular open source graph database.

Introduction to Neo4j

Graph databases are far different from everything that we have looked at thus far. Big Data solutions can be summarized as falling into one of the following four buckets:

  • Key / Value stores, such as Redis and Memcached
  • Column-oriented databases, such as Cassandra and HBase
  • Document-oriented databases, such as MongoDB and CouchDB
  • Graph databases, such as Neo4j and OrientDB

When working with relational databases is it not uncommon to see developers use an Object Relational Mapping (ORM) tool like Hibernate or iBatis. This is not just for ease of coding, but rather it is to bridge a fundamental gap between object-oriented programming and relational databases.

Putting aside functional programming for the moment, historically developers think in terms of objects and their relationships, whether those relationships are one-to-one, one-to-many, many-to-many, or even more advanced relationships like inheritance or polymorphism.

Relational databases are primarily focused on defining types (tables), attributes (columns), and relationships (foreign keys and join tables) as well as normalizing data to reduce redundancy. The net effect is that developers work with objects that do not cleanly map to database tables and thus a translation layer can simplify development.

All of this background is to say that graph databases, for a large number of problem domains, allow developers to think in terms of objects (nodes) and their relationships (edges). Let’s consider an example that I am paraphrasing from Manning’s Neo4j in Action book that defines friend relationships in a social network. If we were to create users and wanted them to be able to define friend relationships between one another, we could do so with two tables in a relational database:

  • User: defines the user himself
  • User_Friend: defines a friendship between two users

The relationship between these two tables is shown in figure 1.

relationship between user and user_friend

Data for this table would look something like the following:

user and user_friend data tables

This data tells us that there are several users and several friendship relations between those users. Figure 2 shows a fuller graph of these users and relationships.

users and their friends chart

If we want to find out the number of friends that Linda has we can do so with a query like the following:

SELECT COUNT(distinct uf.*) FROM User_Friend uf WHERE uf.user_1 = 2;

Likewise, if we wanted to find the friends of Linda’s friends we would execute the following query:

SELECT COUNT(distinct uf2.*) FROM User_Friend uf1
INNER JOIN User_Friend uf2 on uf1.user_1 = uf2.user_2
WHERE uf1.user_id = 2

This was a little more complicated, but still do-able. Now let’s look at friends of friends of friends of a user:

SELECT COUNT(distinct uf3.*) FROM User_Friend uf1
INNER JOIN User_Friend uf2 on uf1.user_1 = uf2.user_2
INNER JOIN User_Friend uf3 on uf2.user_1 = uf3.user_2
WHERE uf1.user_id = 2

A question that you might have at this point is: why would you do something like this? We might want to build a recommendation engine that makes suggestions to the user about something like movies or book reviews that the user might like based on the opinions of friends in their social network.

This SQL isn’t all that complicated, so why change the paradigm from a relational model to a graph model? Again to borrow from Neo4j in Action, the authors setup this data model with 1,000 users and 50 relationships per user, so a total of 50,000 User_Friend records, and then started searching for friends at depths of two through five, where a depth of two would be friends of friends, and so forth. They recorded the number of records returned and the response times as follows:


You can checkout their machine configurations and testing criteria in chapter 1, but the point here is that relational databases are very good at joining data (that is what they were designed for), but once you start computing the Cartesian product of tables multiple times, the response time really falls apart! In other words, finding friends of friends and even friends of friends of friends is very fast to do relationally, but taking it any further saw marked degradation.

With a graph database, you identify a starting node and then traverse across its relationships to retrieve your results. Considering the graph in figure 2, assume that we name each of the relationships (edges) as “IS_FRIEND_OF”, and then using Neo4j’s traversal API, we could retrieve the same results as the SQL queries above as follows:

TraversalDescription traversalDescription =
.relationships( “IS_FRIEND_OF”, Direction.OUTGOING )
.evaluator( Evaluators.atDepth( 2 ) )
.uniqueness( Uniqueness.NODE_GLOBAL );
Interable nodes =

Don’t worry about the syntax right now, essentially this is saying that from our starting node we want to traverse across all “IS_FRIEND_OF” relationships (edges) starting from the originating node and heading out to a friend, we want the traversal to go out to a depth of 2 (friends of friends), and we want the results to be unique across the entire graph (don’t include duplicates). The execution results are shown below:


The striking differences is that the execution time is directly related to the number of results, so at a depth of 5 we have about the same number of nodes so our 92 second SQL query drops to 0.07 seconds!

To jump to the end of their analysis, let’s consider a table to 1 million users and 50 relationships between users, so 50 million User_Friend rows. The summary of repeating the exercise in both MySQL and Neo4j are shown below:


What these results show is that the traversal execution time of a graph database is directly proportional to the size of the result set. And the execution time a SQL query with multiple joins is impacted by the size of the Cartesian product of the results it needs to analyze. This is why the Neo4j execution time is relatively consistent (it is only considering the results and not all of the possible results) and the SQL execution time increases exponentially (50M * 50M * 50M … results to examine).

Finally, graph databases are not the solution to all problems, but you might be surprised at how many problems lend themselves well to graph relationships. Building a social graph is one example, defining an access control list is another example. If you can define your domain model in terms of nodes and relationships and your queries are primarily interested in traversing relationships then a graph database can be very efficient.

Inside Neo4j

Now that we have made the case for Neo4j, this section looks inside Neo4j to see how it defines nodes and edges to organize data.

Nodes define the entities in your model and might be things like users, addresses, movies, books, or even groups, permissions, and roles. Inside Neo4j, nodes contain a collection of keys and values that host your data. For example, you might define a key of “name” and a value of “Linda”. Nodes can also be labeled, which can be seen as roughly analogous to defining the node’s type, but a node can contain multiple labels so the analogy is loose.

Relationships define the edges between nodes, relationships are directed, and relationships are named. Furthermore, key/value data can be associated with relationships. Consider a relationship between a user and a movie called HAS_SEEN. We might want to add a key of “rating” and a value of 1 through 5 to define the number of stars that the user rates the movie.

Figure 3 shows an example of these concepts.

sample Neo4j graph

Figure 3 defines two users: Steven and Linda that are labeled with a label named “USER”. Figure 3 also contains a movie named “Cinderella” that is labeled with a label named “MOVIE”. Steven defines a relationship named IS_FRIEND_OF with Linda. Note that the relationship is directed from Steven to Linda.

All relationships in Neo4j are directed, so you could create another relationship from Linda to Steven to show that Steven is also a friend of Linda, but the direction does not negate the ability to traverse it using Neo4j. For example, although the IS_FRIEND_OF relationship is outgoing from Steven to Linda, you can start at Linda and find all IS_FRIEND_OF relationships both ingoing to Linda and outgoing from Linda.

Linda and Steven both saw “Cinderella”, where Steven gave the movie a 4 star rating and Linda gave it a 5 star rating.

Finding Data

Once you have defined your data (nodes and relationships), the next big challenge is finding your data. Neo4j defines three primary strategies for retrieving data:

  • Core Java API
  • Traversal API
  • Cypher Queries

The Core Java API is a low-level way of manually walking across relationships programmatically. For example, starting at a node, a Java program can retrieve all of its relationships, examine the types of those relationships, and then retrieve the nodes at the other end of those relationships. This is a very natural way for Java programmers to think about nodes and their relationships, but it leads to very large search methods because it is up to the programmer to implement each step in the traversal herself.

The Traversal API is a higher-level programmatic API that Java programmers can use to find data. The sample query defined above shows an example of using the Traversal API:

TraversalDescription traversalDescription =
.relationships( “IS_FRIEND_OF”, Direction.OUTGOING )
.evaluator( Evaluators.atDepth( 2 ) )
.uniqueness( Uniqueness.NODE_GLOBAL );
Iterable nodes =

The program defines a TraversalDescription object that tells Neo4j how build its result set and then executes it by passing it the starting node and requesting an iterable interface to reading its nodes.

The final interface that Neo4j provides is a custom query language called Cypher. Cypher allows you to specify matching criteria for the starting nodes and then the rules for how to traverse from the starting nodes to the destination nodes and what data to extract along the way. For example, the following Cypher query shows how to find all movies that a specific user has seen:

start user=node(1)
match (user)-[:HAS_SEEN]->(movie)
return movie;

This Cypher query starts with the node with an ID of 1 and then traverses across all of its HAS_SEEN relationships (all relationships are enclosed in braces and prefaced with a colon; you can prefix the colon with an identifier so that you can reference the relationship later in the query.) If you notice the arrowhead on the right, you can see that we’re looking for a relationship emanating from user and going out to the movie (->).

We could look for incoming relationships by reversing the arrowhead: (user)<-[:IS_FRIEND_OF]-(friend), or we could look for either incoming or outgoing relationships by not providing an arrowhead: (user)-[:IS_FRIEND_OF]-(friend). The resultant node is named in the Cypher query (“movie” in this example) so that we can reference it later.

The return statement references the movie node and adds it to the result set, which contains a collection of all movies that the user has seen. We could further refine this by returning to just return the name of each movie and not the full nodes themselves.

This article provided a brief introduction to Neo4j and attempted to show you why developers like graph databases by reviewing the following:

  • The benefits to using a graph database instead of a relational database for certain problem domains
  • The performance difference between SQL JOINs and graph traversals for those problem domains
  • The internal representation of data as nodes and relationships in Neo4j, including the fact that both nodes and relationships can contain a set of key/value pairs that represent the data
  • The three types of traversal options to retrieve data from Neo4j: Core Java API, Traversal API, and Cypher

In the next article we’ll review indexes, which are used to help you find your starting nodes, more advanced Cypher queries, and production deployment strategies.

Leave a Reply

Your email address will not be published. Required fields are marked *