Graph Data Modeling: Fundamentals - srijan-singh/neo4j-lineage GitHub Wiki
Component of a Neo4j Graph
The Neo4j components that are used to define the graph data model are:
- Nodes
- Labels
- Relationships
- Properties
Domain Understanding
- Identify the stakeholders and developers of the application.
- With the stakeholders and developers:
- Describe the application in detail.
- Identify the users of the application (people, systems).
- Agree upon the use cases for the application.
- Rank the importance of the use cases.
Data Modeling Process:
- Understand the domain and define specific use cases (questions) for the application.
- Develop the initial graph data model:
- Model the nodes (entities).
- Model the relationships between nodes.
- Test the use cases against the initial data model.
- Create the graph (instance model) with test data using Cypher.
- Test the use cases, including performance against the graph.
- Refactor (improve) the graph data model due to a change in the key use cases or for performance reasons.
- Implement the refactoring on the graph and retest using Cypher.
Modeling Nodes
Defining labels
Entities are the dominant nouns in your application use cases:
- What ingredients are used in a recipe?
- Who is married to this person?
Node properties
Node properties are used to:
- Uniquely identify a node.
- Answer specific details of the use cases for the application.
- Return data.
Modeling Relationships
- When you create a relationship in Neo4j, a direction must either be specified explicitly or inferred by the left-to-right direction in the pattern specified.
Relationships are connections between entities
Connections are the verbs in your use cases:
- What ingredients are used in a recipe?
- Who is married to this person?
Properties for relationships
Properties for a relationship are used to enrich how two nodes are related. When you define a property for a relationship, it is because your use cases ask a specific question about how two nodes are related, not just that they are related.
Fanout
- Creating specialization within nodes.
For example, splitting last names onto separate nodes helps answer the question, “Who has the last name Scott?”. Similarly, having cities as separate nodes assists with the question, “Who lives in the same city as Patrick Scott?”.
The main risk about fanout is that it can lead to very dense nodes, or supernodes. These are nodes that have hundreds of thousands of incoming or outgoing relationships Supernodes need to be handled carefully.
Refactoring
Refactoring is the process of changing the data model and the graph.
There are three reasons why you would refactor:
- The graph as modeled does not answer all of the use cases.
- A new use case has come up that you must account for in your data model.
- The Cypher for the use cases does not perform optimally, especially when the graph scales
Steps for refactoring
To refactor a graph data model and a graph, you must:
- Design the new data model.
- Write Cypher code to transform the existing graph to implement the new data model.
- Retest all use cases, possibly with updated Cypher code.
MATCH (m:Movie)
UNWIND m.languages AS language
WITH language, collect(m) AS movies
MERGE (l:Language {name:language})
WITH l, movies
UNWIND movies AS m
WITH l,m
MERGE (m)-[:IN_LANGUAGE]->(l);
MATCH (m:Movie)
SET m.languages = null
Special Relationships
Neo4j as a native graph database is implemented to traverse relationships quickly. In some cases, it is more performant to query the graph based upon relationship types, rather than properties in the nodes.
MATCH (n:Actor)-[:ACTED_IN]->(m:Movie)
CALL apoc.merge.relationship(n,
'ACTED_IN_' + left(m.released,4),
{},
{},
m ,
{}
) YIELD rel
RETURN count(*) AS `Number of relationships merged`;
Intermediate Nodes
You sometimes find cases where you need to connect more data to a relationship than can be fully captured in the properties. In other words, you want a relationship that connects more than two nodes. Mathematics allows this, with the concept of a hyperedge. This is impossible in Neo4j, but a solution is to create intermediate nodes.
You create intermediate nodes when you need to:
- Connect more than two nodes in a single context.
- Hyperedges (n-ary relationships)
- Relate something to a relationship.
- Share data in the graph between entities.
// Find an actor that acted in a Movie
MATCH (a:Actor)-[r:ACTED_IN]->(m:Movie)
// Create a Role node
MERGE (x:Role {name: r.role})
// Create the PLAYED relationship
// relationship between the Actor and the Role nodes.
MERGE (a)-[:PLAYED]->(x)
// Create the IN_MOVIE relationship between
// the Role and the Movie nodes.
MERGE (x)-[:IN_MOVIE]->(m)