Neo Performance Guide
From NeoWiki
This is the Neo4j performance guide. It will attempt to guide you in how to use Neo4j to achieve maximum performance.
Please note that this is a work in progress. As of March 2009, it contains correct information, but it needs to be a bit more polished. If you find something you disagree with, please join in the discussions on the mailing list or feel free to edit it here directly.
Contents |
[edit] Neo4j primitives' lifecycle
Neo4j manages its primitives (nodes, relationships and properties) different depending on how you use Neo4j. For example if you never get a property from a certain node or relationship that node or relationship will not have its properties loaded into memory. Another example is a node that you never request any relationships from, that node will then not load any of its relationships into memory.
Nodes and relationships are cached using LRU caches. If you (for some strange reason) only work with nodes the relationship cache will become smaller and smaller while the node cache is allowed to grow (if needed). Working with many relationships and few nodes results in bigger relationship cache and smaller node cache.
The Neo4j API specification does not say anything about order regarding relationships so invoking Node.getRelationships() may return relationships in different order than previous invocation, this allows us to make even heavier optimizations returning the relationships that are most commonly traversed.
All in all Neo4j has been designed to be very adaptive depending on how it is used. The (unachievable) overall goal is to be able to handle any incoming operation without having to go down and work with the file/disk I/O layer.
[edit] Passing in tuning parameters to Neo4j
When creating the embedded Neo4j instance it is possible to pass in parameters contained in a map where keys and values are strings. A utility method exist to convert a standard java properties file to this:
Map<String,String> configuration =
EmbeddedGraphDatabase.loadConfigurations( "neo_config.props" );
GraphDatabaseService graphDb = new EmbeddedGraphDatabase( "my-neo4j-db/", configuration );
The default configuration parameters can be found here Neo_default.props.
[edit] Tuning cache settings
The adaptive cache implementation in Neo4j (use_adaptive_cache=YES) will automatically cache more nodes,relationships and properties if there is heap space available. The overall goal is to cache as much as possible but keep the heap in such a state so the garbage collector can run efficiently. To avoid "GC-trashing" (collections start to use to many cycles and are run to often) the adaptive cache implementation tries to keep the heap at lower usage then (almost) full. By default adaptive_cache_heap_ratio is set at 0.77 (77%), This works well with small heaps (< few 100MB), larger heaps can have a higher heap ratio setting.
When the usage of heap is under the configured heap ratio Neo4j's node and relationship caches will grow and when above they will decrease. How aggressive they will increase and decrease can be controlled with adaptive_cache_manager_increase_ratio and adaptive_cache_manager_decrease_ratio (new cache size = cache size * increase/decrease ratio).
Monitor your heap usage and GC times. Goal is to set a high heap ratio (large caches), short overall time spent in GC and small change in heap usage (high cache hit ratio). Which parameters to choose is also very much dependent on memory usage of the application using Neo4j. If heap is too small so the layers above Neo4j use all heap memory Neo4j will not cache any nodes or relationships leading to poor performance. The min_node_cache_size and min_relationship_cache_size can be used when adaptive caching is turned on to set a minimum size for the node and relationship caches.
Applications that have very high and frequent memory allocation and collection may work better with adaptive caches turned off (use_adaptive_cache=NO) and use the max_node_cache_size and max_relationship_cache_size parameters to configure fixed sized LRU caches. Most graph types/data sets will in such a scenario get better performance with a 3 times (or more) larger relationship cache then node cache.
[edit] Tuning memory mapped I/O
Neo4j makes heavy use of the java.nio package that got introduced in Java 1.4. Native I/O may result in memory being allocated outside the normal Java heap so that memory usage needs to be taken into consideration. A well configured OS with large disk caches will help a lot once we get cache misses in the node and relationship caches, therefor it is not a good idea to use all available memory as Java heap.
To tune memory mapped settings start by investigating the size of the different store files found in the directory of your Neo4j database. Here is an example of some of the files and sizes in a Neo4j database:
14M neostore.nodestore.db 510M neostore.propertystore.db 1.2G neostore.propertystore.db.strings 304M neostore.relationshipstore.db
The nodestore stores information about nodes, propertystore stores information of properties and all simple properties such as primitive types (both for relationships and nodes). The propertystore strings stores all string properties and finally relationshipstore holds all the relationships.
In this example the application is running on a machine with 4GB of RAM. We've reserved about 2GB for the OS and other programs. The Java heap is set to 1.5GB, that leaves about 500MB of RAM that can be used for memory mapping. If traversal speed is of highest priority it is good to try memory map as much as possible of the node and relationship stores. An example configuration focusing on traversal speed would then look something like:
neostore.nodestore.db.mapped_memory=15M neostore.relationshipstore.db.mapped_memory=285M neostore.propertystore.db.mapped_memory=100M neostore.propertystore.db.index.mapped_memory=0M neostore.propertystore.db.index.keys.mapped_memory=0M neostore.propertystore.db.strings.mapped_memory=100M neostore.propertystore.db.arrays.mapped_memory=0M
[edit] Disks, RAM and other tips
As always, as with any persistence solution, performance is very much depending on the persistence media used. Better disks equals better performance.
If you have multiple disks or persistence media available it may be a good idea to split the store files and transaction logs across those disks. Having the store files running on disks with low seek time can do wonders for non cached read operations. Today a typical mechanical drive has an average seek time of about 5ms, this can cause a query or traversal to be very slow when available RAM is to low or caches and memory mapped settings are badly configured. A new good SATA enabled SSDs has an avg seek time of <100 microseconds meaning those scenarios will execute at least 50 times faster.
To avoid hitting disk you need more RAM. On a standard mechanical drive you can handle graphs with a few tens of millions of primitives with 1-2GB of RAM. 4-8GB of RAM can handle graphs with hundreds of millions of primitives while you need a good server with 16-32GB to handle billions of primitives. However, if you invest in a good SSD you will be able to handle much larger graphs on less RAM.
Neo4j likes Java 1.6 JVMs and running in server mode so consider upgrading to that if you haven't yet (or at least give the -server flag). Use tools like vmstat or equivalent to gather info when your application is running. If you have high I/O waits and not that many blocks going out/in to disks when running write/read transactions its a sign that you need to tweak your Java heap, Neo4j cache and memory mapped settings (maybe even get more RAM or better disks).
[edit] Second level caching
As mentioned in the Guidelines for Building a Neo App, "always assume the graph is in memory", but sometimes it is necessary to optimize certain performance critical sections. Neo4j adds a small overhead even if the node, relationship or property in question is cached when you compare to in memory data structures. If this becomes an issue use a profiler to find these hot spots and then add your own second-level caching. We believe second-level caching should be avoided to greatest extend possible since it will force you to take care of invalidation which sometimes can be hard. But when all else fails you have to use it so here is an example of how it can be done.
We have some POJO that wrapps a node holding its state. In this particular POJO we've overridden the equals implementation.
public boolean equals( Object obj )
{
return underlyingNode.getProperty( "some_property" ).equals( obj );
}
public int hashCode()
{
return underlyingNode.getProperty( "some_property" ).hashCode();
}
This works fine in most scenarios but in this particular scenario many instances of that POJO is being worked with in nested loops adding/removing/getting/finding to collection classes. Profiling the applications will show that the equals implementation is being called many times and can be viewed as a hot spot. Adding second-level caching for the equals override will in this particular scenario increase performance.
private Object cachedProperty = null;
public boolean equals( Object obj )
{
if ( cachedProperty == null )
{
cachedProperty = underlyingNode.getProperty( "some_property" );
}
return cachedProperty.equals( obj );
}
public int hashCode()
{
if ( cachedPropety == null )
{
cachedProperty = underlyingNode.getProperty( "some_property" );
}
return cachedProperty.hashCode();
}
Problem now is that we need to invalidate the cached property whenever the some_property is changed (may not be a problem in this scenario since the state picked for equals and hash code computation often won't change).
To sum up, avoid second-level caching if possible and only add it when you really need it.

