Training Bulk Loading File Based - tomgeudens/practical-neo4j GitHub Wiki

Context: Cut-and-paste commands for the Bulk Loading training (file based edition).

Prerequisites

This document will assume you have a Neo4j instance up-and-running. You should also have downloaded the following four files into a movies subfolder of the import folder of your installation (and if that doesn't make sense you're probably not doing the training, in which case ... b***** off).

What	Location
Movie Nodes	import/movies/basic/nodes/Movie.csv
Person Nodes	import/movies/basic/nodes/Person.csv
ACTED_IN Relationships	import/movies/basic/relationships/ACTED_IN.csv
DIRECTED Relationships	import/movies/basic/relationships/DIRECTED.csv

Create a new database

Take a moment to appreciate how much difference the 4.x release makes here. In the old days you'd have to stop your instance, move the current data out of the way (if you wanted to keep it) and restart the instance whereas now you just ...

:use system
CREATE DATABASE basic;
:use basic

Inspection

How many lines in the files?

LOAD CSV FROM 'file:///movies/basic/nodes/Person.csv'
AS row
RETURN count(*);

LOAD CSV FROM 'file:///movies/basic/nodes/Movie.csv'
AS row
RETURN count(*);

LOAD CSV FROM 'file:///movies/basic/relationships/DIRECTED.csv'
AS row
RETURN count(*);

LOAD CSV FROM 'file:///movies/basic/relationships/ACTED_IN.csv'
AS row
RETURN count(*);

What is in the files?

LOAD CSV FROM 'file:///movies/basic/nodes/Person.csv'
AS row
RETURN * LIMIT 5;

LOAD CSV FROM 'file:///movies/basic/nodes/Movie.csv'
AS row
RETURN * LIMIT 5;

LOAD CSV FROM 'file:///movies/basic/relationships/DIRECTED.csv'
AS row
RETURN * LIMIT 5;

LOAD CSV FROM 'file:///movies/basic/relationships/ACTED_IN.csv'
AS row
RETURN * LIMIT 5;

What is in the files (again)?

LOAD CSV WITH HEADERS FROM 'file:///movies/basic/nodes/Person.csv'
AS row
RETURN row, keys(row) LIMIT 5;

LOAD CSV WITH HEADERS FROM 'file:///movies/basic/nodes/Movie.csv'
AS row
RETURN row, keys(row) LIMIT 5;

LOAD CSV WITH HEADERS FROM 'file:///movies/basic/relationships/DIRECTED.csv'
AS row
RETURN row, keys(row) LIMIT 5;

LOAD CSV WITH HEADERS FROM 'file:///movies/basic/relationships/ACTED_IN.csv'
AS row
RETURN row, keys(row) LIMIT 5;

I don't like strings

LOAD CSV WITH HEADERS FROM "file:///movies/basic/nodes/Person.csv" AS row
RETURN row.name as name, toInteger(row.born) as born
ORDER BY born ASC
LIMIT 10;

Plug

Your new bedside companion ... Cypher Reference Card

Bulk data loading in action

Finally loading those persons

LOAD CSV WITH HEADERS FROM "file:///movies/basic/nodes/Person.csv" AS row
CREATE (p:Person {name: row.name, born: toInteger(row.born)})
RETURN p;

Cleaning things up

CALL apoc.periodic.commit(
 "MATCH ()-[r]->() WITH r LIMIT $limit DELETE r RETURN count(*)",
 {limit:20000}
);
CALL apoc.periodic.commit(
 "MATCH (n) WITH n LIMIT $limit DELETE n RETURN count(*)",
 {limit:20000}
);

Schema

Neo4j has schema ... you didn't see that coming, right?

CREATE CONSTRAINT uc_Movie_title ON (m:Movie) ASSERT m.title IS UNIQUE;
CREATE CONSTRAINT uc_Person_name ON (p:Person) ASSERT p.name IS UNIQUE;
CREATE INDEX ix_Movie_tagline FOR (m:Movie) ON (m.tagline);

CALL db.constraints();
CALL db.indexes();

Loading nodes

Person

CALL apoc.periodic.iterate('
  LOAD CSV WITH HEADERS FROM "file:///movies/basic/nodes/Person.csv" AS row
  RETURN row
','
  CREATE (:Person {name: row.name, born: toInteger(row.born)});
',{batchSize:2000, parallel:true});

and verify

MATCH (p:Person) RETURN count(*);

Movie

CALL apoc.periodic.iterate('
  LOAD CSV WITH HEADERS FROM "file:///movies/basic/nodes/Movie.csv" AS row
  RETURN row
','
  CREATE (:Movie {title: row.title, released: toInteger(row.released), tagline: row.tagline});
',{batchSize:2000, parallel:true});

and verify

MATCH (m:Movie) RETURN count(*);

also verify this

MATCH (m:Movie) WHERE m.title = "Something's Gotta Give" RETURN m;

Relationships

DIRECTED

CALL apoc.periodic.iterate('
  LOAD CSV WITH HEADERS FROM "file:///movies/basic/relationships/DIRECTED.csv" AS row
  RETURN row
','
  MATCH  (p:Person {name: row.person })
  MATCH  (m:Movie  {title: row.movie})
  MERGE (p)-[:DIRECTED]->(m);
',{batchSize:2000, parallel:false});

And verify

MATCH ()-[:DIRECTED]->() RETURN count(*);

ACTED_IN

CALL apoc.periodic.iterate('
  LOAD CSV WITH HEADERS FROM "file:///movies/basic/relationships/ACTED_IN.csv" AS row
  RETURN row
','
  MATCH  (p:Person {name: row.person })
  MATCH  (m:Movie  {title: row.movie})
  MERGE (p)-[a:ACTED_IN]->(m)
  ON CREATE SET a.roles = split(row.roles,";");
',{batchSize:2000, parallel:false});

And verify

MATCH ()-[:ACTED_IN]->() RETURN count(*);

Tom Hanks mystery explained ...

MATCH (p:Person {name: "Tom Hanks"})-[a:ACTED_IN]->(m:Movie) RETURN p,a,m;