Our Neo4j adventure has led us through some Big Data architecture discovery during our first and second articles here at about:virtualization. Now that we have seen what comes together to power the Neo4j platform, it is time for us to take the last step which is to move to production!
In order to properly configure a production environment, it would be beneficial to first understand the internal architecture of Neo4j so that you know what those different configuration options actually mean. Figure 1 shows the high-level architecture of Neo4j, as extracted from Neo4j in Action.
Figure 1: Neo4j High-Level Architecture
The base of the Neo4j High-Level Architecture is the physical disk(s) upon which Neo4j runs. It is always best to serve data from memory, as we review below, but eventually Neo4j will need to read data from disk. Therefore it is best to use high I/O disks, such as solid-state drives (SSD) when possible.
In terms of how much storage you should allocate, I refer you to chapter 11 of Neo4j in Action for a detailed discussion and to Neo4j’s Sizing and Hardware Calculator to help you make your decision.
In order to reduce latency, Neo4j provides two levels of caching:
- Filesystem Cache
- Object Cache
The Filesystem cache is an area of free memory, or RAM on the operating system that has not been allocated to any processes, and is accessed through memory-mapped I/O. When a process requests a file, or a part of a file, that file is loaded into memory in the filesystem cache.
Subsequent reads are served from the filesystem cache rather than reading from disk; furthermore, writes are made to the filesystem cache and then flushed to disk. Neo4j leverages the Java Native I/O (Java NIO) to advise the operating system to efficiently load, read, and write data to and from the physical disk.
You can configure the filesystem cache through the following properties in the conf/neo4j.properties file:
Chapter 11 of Neo4j in Action goes into more details about what these are, but the high-level summary of these objects is:
- The node store hosts all of your nodes
- The property store hosts the properties that are defined on your nodes and your relationships, where primitives and inline strings are defined by the first property, strings are defined by the second property, and arrays are defined in the last property
- The relationship store hosts all of your relationships
The next cache that Neo4j provides is an object cache. Because Neo4j runs in a JVM, it maintains nodes and relationships as Java objects in memory. There are two memory configuration settings in Neo4j:
- JVM heap configuration using the –Xmx??m
- Choosing the cache type
Heap configuration is well beyond the scope of this article, but in short, you can set the initial and maximum size of the heap, the breakdown of the heap (e.g. survivor space size in a Sun JVM), as well as the garbage collection strategy. Unless you are running a JVM that does not run full stop-the-world garbage collections, such as the JVM provided by Azul, then the tradeoff is between the frequency and the duration of garbage collections.
A large heap means that you can hold more data (or in our application, the number of cached items can grow larger), but the larger the heap, the longer garbage collection will take to run.
The next thing you can do is choose the cache type:
- None: don’t use a high level cache; no objects will be cached
- Soft (default): provides optimal utilization for the available memory; good for high-performance traversal but might run into garbage collection problems under high load if frequently used parts of the graph do not fit in memory
- Weak: provides a short lifespan for cached objects; good for high-throughput applications where a larger portion of the frequently used graph can be stored in memory
- Strong: stores all data that is loaded into memory and never releases it; good for small data sets
- Hpc: a high-performance cache that allows you to dedicate a specific amount of memory to cache loaded nodes and relationships; only available in the enterprise version of Neo4j
In short, allocate enough memory to hold the frequently used parts of your graph in memory and choose the best caching strategy for your business need.
Transaction logs and recoverability
I have not mentioned it thus far, but Neo4j provides ACID transaction support for its operations, where ACID stands for Atomic, Consistent, Isolated, and Durable, just like relational databases. ACID transactions ensure that your data remains consistent and provides commit/rollback functionality based on the success or failure of transactions.
Neo4j supports ACID transactions by implementing a write-ahead log (WAL), in which it writes changes to the active transaction log file and physically flushes those changes to disk before they are applied to the underlying store files themselves.
The write-ahead log allows Neo4j to maintain its support for ACID transactions. You can find the transaction logs in the graph database directory, named nioneo_logical.log.*.
I won’t cover high-availability in this article, as HA is only available in the enterprise (non-free) version of Neo4j, but I refer you to chapter 11 of Neo4j in Action for more details. The HA module supports things like clustering and sharding of data. If you have a serious application running in Neo4j then you might want to consider the enterprise version.
Neo4j is one of the most popular graph databases and the community version is available at no cost.
The first article in this series provided a brief introduction to Neo4j and attempted to show you why developers like graph databases by reviewing the following:
- The benefits to using a graph database instead of a relational database for certain problem domains
- The performance difference between SQL JOINs and graph traversals for those problem domains
- The internal representation of data as nodes and relationships in Neo4j, including the fact that both nodes and relationships can contain a set of key/value pairs that represent the data
- The three types of traversal options to retrieve data from Neo4j: Core Java API, Traversal API, and Cypher
Our second article reviewed the very important topic of indexing. And to close out our Neo4j series, we have touched on important production deployment considerations.