Graph Database - kdwivedi1985/system-design GitHub Wiki

Neo4j

  • Neo4j is Graph Database Management System designed store, manage, query in form of graph(nodes, relationships, properties).
  • Node represent entities (Person, Product etc).
  • Relationships connect nodes and have a direction and type (e.g., FRIENDS_WITH, PURCHASED).
  • Properties are key-value pairs attached to nodes and relationships.
  • Data is stored natively as a property graph format—optimized for traversal operations.
  • It is ACID complaint.
  • It open source under the GPLv3 license and have enterprise version which is distributed and offers (clustering, fine-grained access control, monitoring, etc.).
  • Enterprise version supports Clustering for high availability and scalability, replica set and leader-follower architecture.

Example of Node, Relationship and Properties

  • Create Data

    • CREATE (:Person:Actor {name: 'Tom Hanks', born: 1956})-[:ACTED_IN {roles: ['Forrest']}]->(:Movie {title: 'Forrest Gump', released: 1994})<-[:DIRECTED]-(:Person {name: 'Robert Zemeckis', born: 1951})
  • It creates:

    • Nodes:
      • (:Person:Actor {name: 'Tom Hanks', born: 1956})
      • (:Movie {title: 'Forrest Gump', released: 1994})
      • (:Person {name: 'Robert Zemeckis', born: 1951})
    • Relationships:
      • (:Person:Actor)-[:ACTED_IN {roles: ['Forrest']}]->(:Movie)
      • (:Person)-[:DIRECTED]->(:Movie)
  • Each node stores its properties as key-value pairs (e.g., name: 'Tom Hanks', born: 1956)

  • Query - Find all movies Tom Hanks acted in:

    • Cypher is Neo4j’s declarative query language (like SQL for graphs):
      • MATCH (tom:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(movie:Movie)
      • RETURN movie.title, movie.released;

How does Neo4J internally works?

  • Neo4j uses a native graph storage engine which uses fixed-size record structures with pointers for high-performance traversal. It doesn’t store data in relational tables, JSON documents, or key-value pairs like many other databases.
    • Each node is stored as a fixed-size record in the neostore.nodestore.db file. Which contains:
      • nodeid,
      • pointer to first relation,
      • pointer to properties,
      • pointer to labels.
    • Each relationship is also a fixed-size record stored in neostore.relationshipstore.db file, it includes:
      • relationship id,
      • start node id and end node id,
      • relationship type,
      • A pointer to properties,
      • Pointers to the next relationship for each node (doubly-linked list).
      • This makes traversing from a node to its connected nodes O(1) in most cases.
    • Properties are stored separately in the neostore.propertystore.db. Each property is a record of:
      • Property type (string, int, float, etc.),
      • Key ID (e.g., “name” → 101),
      • Value (e.g., "Alice"),
      • Pointer to next property.
    • Labels and relationship types are stored as integers internally.
    • Indexes are stored separately to speed up lookups. Index are build using B-tree and fulltext indexes (via Lucene). Constraints (e.g., uniqueness) are also tracked internally.
    • Neo4j uses a write-ahead log (WAL) for durability.
    • Neo4j uses property chains (linked lists) when there are multiple properties on a node or relationship.

Amazon Neptune

  • Amazon Neptune is a fully managed graph database service offered by Amazon Web Services (AWS). It is designed to store and query highly connected data, making it ideal for applications like social networks, recommendation engines, fraud detection, knowledge graphs, and network/IT operations.
  • Amazon Neptune supports two popular graph models:
    • Property Graph (PG): Accessed using Gremlin (query language from Apache TinkerPop). Use it for general-purpose graphs (e.g., social networks).
    • RDF (Resource Description Framework): Accessed using SPARQL. Use it for semantic web/linked data (e.g., knowledge graphs like Wikidata).

Use cases for GraphDB

  • Social networks: Friend relationships, group memberships, interactions.
  • Recommendation engines: Suggesting products or content based on user relationships.
  • Fraud detection: Identifying suspicious activity via graph patterns.
  • Knowledge graphs: Managing complex domain knowledge (e.g., in healthcare or law).
  • Network & IT operations: Infrastructure modeling, impact analysis.