Training Bulk Loading File Based - tomgeudens/practical-neo4j GitHub Wiki
Context: Cut-and-paste commands for the Bulk Loading training (file based edition).
Prerequisites
This document will assume you have a Neo4j instance up-and-running. You should also have downloaded the following four files into a movies subfolder of the import folder of your installation (and if that doesn't make sense you're probably not doing the training, in which case ... b***** off).
What | Location |
---|---|
Movie Nodes | import/movies/basic/nodes/Movie.csv |
Person Nodes | import/movies/basic/nodes/Person.csv |
ACTED_IN Relationships | import/movies/basic/relationships/ACTED_IN.csv |
DIRECTED Relationships | import/movies/basic/relationships/DIRECTED.csv |
Create a new database
Take a moment to appreciate how much difference the 4.x release makes here. In the old days you'd have to stop your instance, move the current data out of the way (if you wanted to keep it) and restart the instance whereas now you just ...
:use system
CREATE DATABASE basic;
:use basic
Inspection
How many lines in the files?
LOAD CSV FROM 'file:///movies/basic/nodes/Person.csv'
AS row
RETURN count(*);
LOAD CSV FROM 'file:///movies/basic/nodes/Movie.csv'
AS row
RETURN count(*);
LOAD CSV FROM 'file:///movies/basic/relationships/DIRECTED.csv'
AS row
RETURN count(*);
LOAD CSV FROM 'file:///movies/basic/relationships/ACTED_IN.csv'
AS row
RETURN count(*);
What is in the files?
LOAD CSV FROM 'file:///movies/basic/nodes/Person.csv'
AS row
RETURN * LIMIT 5;
LOAD CSV FROM 'file:///movies/basic/nodes/Movie.csv'
AS row
RETURN * LIMIT 5;
LOAD CSV FROM 'file:///movies/basic/relationships/DIRECTED.csv'
AS row
RETURN * LIMIT 5;
LOAD CSV FROM 'file:///movies/basic/relationships/ACTED_IN.csv'
AS row
RETURN * LIMIT 5;
What is in the files (again)?
LOAD CSV WITH HEADERS FROM 'file:///movies/basic/nodes/Person.csv'
AS row
RETURN row, keys(row) LIMIT 5;
LOAD CSV WITH HEADERS FROM 'file:///movies/basic/nodes/Movie.csv'
AS row
RETURN row, keys(row) LIMIT 5;
LOAD CSV WITH HEADERS FROM 'file:///movies/basic/relationships/DIRECTED.csv'
AS row
RETURN row, keys(row) LIMIT 5;
LOAD CSV WITH HEADERS FROM 'file:///movies/basic/relationships/ACTED_IN.csv'
AS row
RETURN row, keys(row) LIMIT 5;
I don't like strings
LOAD CSV WITH HEADERS FROM "file:///movies/basic/nodes/Person.csv" AS row
RETURN row.name as name, toInteger(row.born) as born
ORDER BY born ASC
LIMIT 10;
Plug
Your new bedside companion ... Cypher Reference Card
Bulk data loading in action
Finally loading those persons
LOAD CSV WITH HEADERS FROM "file:///movies/basic/nodes/Person.csv" AS row
CREATE (p:Person {name: row.name, born: toInteger(row.born)})
RETURN p;
Cleaning things up
CALL apoc.periodic.commit(
"MATCH ()-[r]->() WITH r LIMIT $limit DELETE r RETURN count(*)",
{limit:20000}
);
CALL apoc.periodic.commit(
"MATCH (n) WITH n LIMIT $limit DELETE n RETURN count(*)",
{limit:20000}
);
Schema
Neo4j has schema ... you didn't see that coming, right?
CREATE CONSTRAINT uc_Movie_title ON (m:Movie) ASSERT m.title IS UNIQUE;
CREATE CONSTRAINT uc_Person_name ON (p:Person) ASSERT p.name IS UNIQUE;
CREATE INDEX ix_Movie_tagline FOR (m:Movie) ON (m.tagline);
CALL db.constraints();
CALL db.indexes();
Loading nodes
Person
CALL apoc.periodic.iterate('
LOAD CSV WITH HEADERS FROM "file:///movies/basic/nodes/Person.csv" AS row
RETURN row
','
CREATE (:Person {name: row.name, born: toInteger(row.born)});
',{batchSize:2000, parallel:true});
and verify
MATCH (p:Person) RETURN count(*);
Movie
CALL apoc.periodic.iterate('
LOAD CSV WITH HEADERS FROM "file:///movies/basic/nodes/Movie.csv" AS row
RETURN row
','
CREATE (:Movie {title: row.title, released: toInteger(row.released), tagline: row.tagline});
',{batchSize:2000, parallel:true});
and verify
MATCH (m:Movie) RETURN count(*);
also verify this
MATCH (m:Movie) WHERE m.title = "Something's Gotta Give" RETURN m;
Relationships
DIRECTED
CALL apoc.periodic.iterate('
LOAD CSV WITH HEADERS FROM "file:///movies/basic/relationships/DIRECTED.csv" AS row
RETURN row
','
MATCH (p:Person {name: row.person })
MATCH (m:Movie {title: row.movie})
MERGE (p)-[:DIRECTED]->(m);
',{batchSize:2000, parallel:false});
And verify
MATCH ()-[:DIRECTED]->() RETURN count(*);
ACTED_IN
CALL apoc.periodic.iterate('
LOAD CSV WITH HEADERS FROM "file:///movies/basic/relationships/ACTED_IN.csv" AS row
RETURN row
','
MATCH (p:Person {name: row.person })
MATCH (m:Movie {title: row.movie})
MERGE (p)-[a:ACTED_IN]->(m)
ON CREATE SET a.roles = split(row.roles,";");
',{batchSize:2000, parallel:false});
And verify
MATCH ()-[:ACTED_IN]->() RETURN count(*);
Tom Hanks mystery explained ...
MATCH (p:Person {name: "Tom Hanks"})-[a:ACTED_IN]->(m:Movie) RETURN p,a,m;