Batch Insert
From Neo4j Wiki
Note: when running batch inserter and failing to invoke the shutdown method it may corrupt the store. The batch inserter is great for initial import of data but should not be used in normal operation on an already existing store.
Contents |
[edit] Batch Insert
Neo4j has a batch insert mode that drops support for transactions and concurrency in favor of insertion speed. This is useful when you have a big dataset that needs to be loaded once. In our experience, the batch inserter will typically inject data around five times faster than running in normal transactional mode.
Be aware that the BatchInserter is
- intended use is for initial import of data
- non thread safe
- non transactional
- failure to successfully invoke shutdown (properly) results in corrupt database files
[edit] Getting Started
Creating a batch inserter is similar to how you create a GraphDatabaseService. After it has been created you can directly create nodes, relationships and properties. You don't have to open up transactions but remember that you can't have multiple threads using the same batch inserter concurrently without external synchronization. To get started include the "neo4j-kernel" component version "1.1" in your pom or checkout the kernel trunk and build the jar manually.
import org.neo4j.kernel.impl.batchinsert.BatchInserter; import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl; ... // create the batch inserter BatchInserter inserter = new BatchInserterImpl( "neo4j-db/", BatchInserterImpl.loadProperties( "neo4j.props" ) ); // inject some data Map<String,Object> properties = new HashMap<String,Object>(); properties.put( "name", "Mr. Andersson" ); properties.put( "age", 29 ); long node1 = inserter.createNode( properties ); properties.put( "name", "Trinity" ); properties.remove( "age" ); long node2 = inserter.createNode( properties ); inserter.createRelationship( node1, node2, DynamicRelationshipType.withName( "KNOWS" ), null ); // shutdown, makes sure all changes are written to disk inserter.shutdown();
The batch inserter can be created loading a configuration that should be optimized for the work that you are about to perform. The "neo4j.props" is just a normal Java properties file (see further down for explanation how to configure the batch inserter).
For highest injection speed you should pass in all the properties for a node or relationship when it is created. If the node or relationship doesn't have any properties or you need to set its properties at a later time just pass in null. All changes go to memory (when available) so it is very important that you call BatchInserter.shutdown() when you're done since that will force all changes to be written to disk. Failing to do so may result in some of the changes getting lost or the store may even be left in a corrupted state!
[edit] Using batch inserter together with indexing
Often you have some property that needs to be indexed (typically some URI like property) and for that we have created an index batch inserter using Lucene. This index service works like the normal transactional LuceneIndexService (but then again with no transactions or thread safety). To access the Lucene index batch inserter include "neo4j-index" version "1.1" in your pom or checkout index trunk and build the jar manually.
import org.neo4j.kernel.impl.batchinsert.BatchInserter;
import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
import org.neo4j.index.lucene.LuceneIndexBatchInserter;
import org.neo4j.index.lucene.LuceneIndexBatchInserterImpl;
...
BatchInserter inserter = new BatchInserterImpl( "neo4j-db/", BatchInserterImpl.loadProperties( "neo4j.props" ) );
// create the batch index service
LuceneIndexBatchInserter indexService = new LuceneIndexBatchInserterImpl( inserter );
// ... create nodes and index them
while ( haveNodesToCreate )
{
properties.put( "uri", nextUri );
long node = inserter.createNode( properties );
indexService.index( node, "uri", nextUri );
}
// optimize the index
indexService.optimize();
// create relationships, make use of the index to find the right nodes
while ( haveRelationshipsToCreate )
{
long node1 = indexService.getNodes( "uri", uri1 ).iterator().next();
long node2 = indexService.getNodes( "uri", uri2 ).iterator().next();
inserter.createRelationship( node1, node2, DynamicRelationshipType.withName( "KNOWS" ), null );
}
indexService.shutdown();
inserter.shutdown();
For best performance when using the index batch inserter, everything that needs to be indexed should be indexed first followed by a call to indexService.optimize(). After that the index can be used to find nodes. Depending on how your data looks a small LRU cache for the index may speed things up even more.
If you want to batch insert using a fulltext index, simply use code like this to create the corresponding index service:
LuceneFulltextIndexBatchInserter fulltextIndexService = new LuceneFulltextIndexBatchInserter( inserter );
[edit] How to configure the batch inserter properly
When creating the batch inserter you should supply a Java properties file with a good configuration that suits the data set you are about to inject. Lets say we have a random like graph with 10M nodes and 100M relationships. Each node (and maybe some relationships) have different properties of string and java primitive types (but no arrays). The important thing with a random graph will be to give lots of memory to the relationship and node store:
neostore.nodestore.db.mapped_memory=90M neostore.relationshipstore.db.mapped_memory=3G neostore.propertystore.db.mapped_memory=50M neostore.propertystore.db.strings.mapped_memory=100M neostore.propertystore.db.arrays.mapped_memory=0M
The configuration above will fit the entire graph (with exception to properties) in memory. A rough formula to calculate the memory needed for the node store: nr_of_nodes * 9 bytes and for relationships: nr_of_relationships * 33 bytes.
Properties will typically only be injected once and never read so a few megabytes for the property store and string store is usually enough. If you have very large strings or arrays you may want to increase the amount of memory assigned to the string and array store files.
An important thing to remember is that the above configuration will need a Java heap of 3.3G+ since in batch inserter mode normal Java buffers that gets allocated on the heap will be used instead of memory mapped ones.

