If you’re just joining this column, it is one aspect of a response to the gap between how development and how operations view technology and measure their success – it is wholly possible for development and operations to be individually successful, but for the organization to fail.
So, what can we do to better align development and operations so that they can speak the same language and work towards the success of the organization as a whole? This article series attempts to address a portion of this problem by presenting operation teams insight into how specific architecture and development decisions affect the day-to-day operational requirements of an application.
The current article series is reviewing Big Data and the various solutions that have been built to capture, manage, and analyze very large amounts of data. Unlike the relational databases of the past, Big Data is not one-size-fits-all, but rather individual solutions have been built that address specific problem domains.
The last handful of articles reviewed Hadoop, MapReduce, HBase, and MongoDB so now we turn our attention to a completely different domain of Big Data stores: graph databases. We’ve looked at Neo4j a bit already, so this is the second installment where we explore what is is probably the most popular open source graph database.
In the last article we reviewed:
- Motivations behind using Neo4j and its performance benefits for certain problem domains
- The internal representation of data inside Neo4j as nodes and relationships
- The three mechanisms provided to find data in Neo4j: Core Java API, Traversal API, and Cypher, the custom Neo4j query language
This article reviews indexes, which are used to help you find your starting nodes and production deployment strategies.
You are probably familiar with indexes in relational databases: when data is inserted, updated, or removed from a table, any columns that are indexed, such as the primary key, are maintained in a separate search construct called an index.
The benefit to the index is that if you are searching a table with constraints on an indexed column, such as SELECT * FROM user WHERE age >= 21, where age is an indexed column, then the database can quickly find matching rows in its index and then return the corresponding rows from the table. The drawback is that for each indexed column, any time data is inserted, updated, or deleted, the index must also be updated.
Neo4j makes use of indexes to help you find your starting nodes. You’ll recall from the previous article that traversal and Cypher queries begin at one or more starting nodes and then define rules for matching results.
Therefore, it is important to define a quick look-up strategy for commonly searched node properties so that you can find your starting node(s) without having to scan every object in your database (akin to a relational database table scan).
Neo4j provides the following mechanisms to allow you to index your data:
- Manually managing your index
- Schema Indexing
Neo4j leverages Apache Lucene as its search engine and exposes methods that all developers to manually manage search indexes. Neo4j defines indexes by name so developers are free to associate indexed data with whatever name makes sense. For example, the index for users may be named “users” and accessed programmatically from the graph database’s IndexManager:
IndexManager indexManager = graphDB.index();
Index userIndex = indexManager.forNodes( “users” );
With this index in hand, the developer can create a node and then add an index record for the node.
Node person = graphDB.createNode();
person.setProperty( “name”, “Steven” );
person.setProperty( “email”, “email@example.com” );
userIndex.add( person, “email”, “firstname.lastname@example.org” );
We created a person node, set his name and email, and then added an index for that person node on its email address.
Once the index is created then finding a node is much faster than scanning the entire graph of data:
IndexHits indexHits = userIndex.get( “email”, “email@example.com” );
Node user = indexHits.getSingle();
We ask the user index to get all nodes whose email is “firstname.lastname@example.org”. This returns an IndexHits instance through which we can access all nodes with the specified email address. In this case we only expect a single result, so we invoke getSingle() to retrieve the user node.
The challenge in manually managing indexes is that the developer is wholly responsible for creating the index record, updating the index, such as when the user’s email address changes, and deleting the index when it is appropriate to do so. This forces a lot of responsibility on the application, which is not necessarily a good idea.
Automatic indexing comes in two flavors:
- Schema Indexing
- Auto Indexing
Schema indexing works closely with the notion of node labels and allows you to automatically create an index for a node label and a set of properties. For example, if we have a label named “USER” then we could tell Neo4j to create a schema index on all nodes with the “USER” label on its “email” property. Neo4j will then manage the index for you, much like a relational database.
When you add a new node with the “USER” label then it will add a new index record for that node’s email property, if you update the node’s email address then Neo4j will update the index, and finally, if you delete the node then Neo4j will delete the index record for you.
Finally, Neo4j offers auto indexing, which can automatically create indexes for nodes and/or for relationships and for specific properties. For example, you can tell Neo4j to create an automatic index for nodes with a name property. The drawback is that there is currently no way to specify the schema (or node label) on which to create the automatic index, so the automatic index will intermix nodes of all labels.
Or stated in other terms, if you have nodes labeled MOVIE and nodes labeled USER, both of which contain a name property, then the automatic index will mix together movies and users based on their names. The result is that if you name your daughter Cinderella then you’ll find both the movie and your daughter ☺
To close out our discussion of indexing, let’s consider the benefits and drawbacks. The benefit is faster searching for indexed properties. The drawback is that you incur additional overhead when managing the index: every time an insert, update, or delete of a node or relationships with an indexed property occurs, the index must be likewise updated. So choose your indexing strategy and the properties you want to index appropriately.
I recommend that you look at your search criteria to determine the most appropriate properties to index and, at the time of this writing, I recommend using schema indexing because it provides more flexibility than auto-indexing and reduces the amount of work your developers will have to do.
Neo4j is one of the most popular graph databases and the community version is available at no cost.
The first article in this series provided a brief introduction to Neo4j and attempted to show you why developers like graph databases by reviewing the following:
- The benefits to using a graph database instead of a relational database for certain problem domains
- The performance difference between SQL JOINs and graph traversals for those problem domains
- The internal representation of data as nodes and relationships in Neo4j, including the fact that both nodes and relationships can contain a set of key/value pairs that represent the data
- The three types of traversal options to retrieve data from Neo4j: Core Java API, Traversal API, and Cypher
This article reviewed an important core topic: indexes. That leaves us with an exciting post ahead to tackle finding your starting nodes and production deployment considerations.