Batch Insert

From Neo4j Wiki

Jump to: navigation, search

Contents

[edit] Batch Insert

Neo4j has a batch insert mode that drops support for transactions and concurrency in favor of insertion speed. This is useful when you have a big dataset that needs to be loaded once. In our experience, the batch inserter will typically inject data around five times faster than running in normal transactional mode.

[edit] Getting Started

Creating a batch inserter is similar to how you create a GraphDatabaseService. After it has been created you can directly create nodes, relationships and properties. You don't have to open up transactions but remember that you can't have multiple threads using the same batch inserter concurrently without external synchronization. To get started include the "neo4j-kernel" component version "1.0-rc" in your pom or checkout the kernel trunk and build the jar manually.

import org.neo4j.kernel.impl.batchinsert.BatchInserter;
import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;

...

// create the batch inserter
BatchInserter inserter = new BatchInserterImpl( "neo4j-db/", BatchInserterImpl.loadProperties( "neo4j.props" ) );

// inject some data
Map<String,Object> properties = new HashMap<String,Object>();

properties.put( "name", "Mr. Andersson" );
properties.put( "age", 29 );
long node1 = inserter.createNode( properties );

properties.put( "name", "Trinity" );
properties.remove( "age" );
long node2 = inserter.createNode( properties );

inserter.createRelationship( node1, node2, DynamicRelationshipType.withName( "KNOWS" ), null );

// shutdown, makes sure all changes are written to disk
inserter.shutdown();

The batch inserter can be created loading a configuration that should be optimized for the work that you are about to perform. The "neo4j.props" is just a normal Java properties file (see further down for explanation how to configure the batch inserter).

For highest injection speed you should pass in all the properties for a node or relationship when it is created. If the node or relationship doesn't have any properties or you need to set its properties at a later time just pass in null. All changes go to memory (when available) so it is very important that you call BatchInserter.shutdown() when you're done since that will force all changes to be written to disk. Failing to do so may result in some of the changes getting lost or the store may even be left in a corrupted state!

[edit] Using batch inserter together with indexing

Often you have some property that needs to be indexed (typically some URI like property) and for that we have created an index batch inserter using Lucene. This index service works like the normal transactional LuceneIndexService (but then again with no transactions or thread safety). To access the Lucene index batch inserter include "neo4j-index" version "0.9" in your pom or checkout index trunk and build the jar manually.

import org.neo4j.kernel.impl.batchinsert.BatchInserter;
import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
import org.neo4j.index.lucene.LuceneIndexBatchInserter;
import org.neo4j.index.lucene.LuceneIndexBatchInserterImpl;

...

BatchInserter inserter = new BatchInserterImpl( "neo4j-db/", BatchInserterImpl.loadProperties( "neo4j.props" ) );
// create the batch index service
LuceneIndexBatchInserter indexService = new LuceneIndexBatchInserterImpl( inserter );

// ... create nodes and index them
while ( haveNodesToCreate ) 
{
    properties.put( "uri", nextUri );
    long node = inserter.createNode( properties );
    indexService.index( node, "uri", nextUri );
}

// optimize the index
indexService.optimize();


// create relationships, make use of the index to find the right nodes
while ( haveRelationshipsToCreate )
{
    long node1 = indexService.getNodes( "uri", uri1 ).iterator().next();
    long node2 = indexService.getNodes( "uri", uri2 ).iterator().next();
    inserter.createRelationship( node1, node2, DynamicRelationshipType.withName( "KNOWS" ), null );
}

indexService.shutdown();
inserter.shutdown();

For best performance when using the index batch inserter, everything that needs to be indexed should be indexed first followed by a call to indexService.optimize(). After that the index can be used to find nodes. Depending on how your data looks a small LRU cache for the index may speed things up even more.

[edit] How to configure the batch inserter properly

When creating the batch inserter you should supply a Java properties file with a good configuration that suits the data set you are about to inject. Lets say we have a random like graph with 10M nodes and 100M relationships. Each node (and maybe some relationships) have different properties of string and java primitive types (but no arrays). The important thing with a random graph will be to give lots of memory to the relationship and node store:

neostore.nodestore.db.mapped_memory=90M
neostore.relationshipstore.db.mapped_memory=3G
neostore.propertystore.db.mapped_memory=50M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M

The configuration above will fit the entire graph (with exception to properties) in memory. A rough formula to calculate the memory needed for the node store: nr_of_nodes * 9 bytes and for relationships: nr_of_relationships * 33 bytes.

Properties will typically only be injected once and never read so a few megabytes for the property store and string store is usually enough. If you have very large strings or arrays you may want to increase the amount of memory assigned to the string and array store files.

An important thing to remember is that the above configuration will need a Java heap of 3.3G+ since in batch inserter mode normal Java buffers that gets allocated on the heap will be used instead of memory mapped ones.

Personal tools