Deploy Neptune and Rhizomer - rhizomik/rhizomerAPI GitHub Wiki

In this section, we walk you through deploying Neptune and Rhizomer in order to interactively explore knowledge graphs.

Neptune deployment

For instructions on deploying a Neptune database, see the Neptune User Guide. You can start with the smallest instance type available. Later, we demonstrate how to load large amounts of data, which requires a bigger instance.

Rhizomer deployment

Rhizomer is based on a server-side API and a client-side web application. Both are available as open-source projects from GitHub:

RhizomerEye is the front end. It's developed with Angular and consumes RhizomerAPI.
RhizomerAPI is the backend. It's developed using Spring and provides the API consumed by RhizomerEye.

To facilitate their deployment, they're also available as Docker images that you can launch using Amazon Elastic Container Service (Amazon ECS). Amazon ECS makes it easy to deploy, manage, and scale Docker containers from its command-line tool, which supports docker-compose configuration files. For more information, see Tutorial: Creating a Cluster with an EC2 Task Using the Amazon ECS CLI, which details how to install ecs-cli and configure it.

We also demonstrate how to use the AWS Command Line Interface (AWS CLI). For more information, see the AWS CLI User Guide.

When the ECS CLI tool is ready, we first create a cluster configuration:

ecs-cli configure --cluster rhizomer --region us-east-1 --default-launch-type EC2 --config-name rhizomer-config

Then we configure a profile using your access key and secret key as detailed in configure ecs-cli:

ecs-cli configure profile --access-key $AWS_ACCESS_KEY_ID --secret-key $AWS_SECRET_ACCESS_KEY --profile-name rhizomer-profile

Now you can create the cluster where the containers are launched.

First, we need a security group for the cluster that opens port 80 for the client and 8080 for the API. We can do so from the command line:

aws ec2 create-security-group --group-name rhizomer-security-group --description "Rhizomer security group" --vpc-id vpc-1234567a

We open input traffic to ports 80 and 8080 from anywhere:

aws ec2 authorize-security-group-ingress --group-name rhizomer-security-group --protocol tcp --port 80 --cidr 0.0.0.0/0

aws ec2 authorize-security-group-ingress --group-name rhizomer-security-group --protocol tcp --port 8080 --cidr 0.0.0.0/0

Now we can finally create the cluster, associated with the previous security group through its returned identifier. Additionally, we configure the identifier of the VPC where the Neptune cluster is (and the subnets corresponding to this VPC) plus the instance type, ECS configuration, and the profile to use:

ecs-cli up --security-group sg-0123456789101112 --capability-iam --vpc vpc-1234567a --subnets subnet-1abcd234 --instance-type t2.micro --size 1 --cluster rhizomer --cluster-config rhizomer-config --ecs-profile rhizomer-profile --force

We use a docker-compose file to define and configure the Docker images to load into our cluster. The following content should be available in a file called docker-compose.yml:

version: '3'
services:
   rhizomer-api:
      image: rhizomik/rhizomer-api
      ports:
         - "8080:8080"
      environment:
         - ALLOWED_ORIGINS=http://${HOSTNAME}
         - RHIZOMER_DEFAULT_PASSWORD=password
   rhizomer:
      image: rhizomik/rhizomer-eye
      ports:
         - "80:80"
      environment:
         - API_URL=http://${HOSTNAME}:8080

In addition to the details provided by the docker-compose file, Amazon ECS requires some additional details about memory usage limits per container. We have roughly 0.90 GB in a t3.micro instance, which we share among the containers as detailed in a file called ecs-params.yml in the same folder, whose content should be as follows:

version: 1
task_definition:
  services:
    rhizomer:
      mem_limit: 0.20GB
    rhizomer-api:
      mem_limit: 0.70GB

Now we can launch the docker-compose through the ECS CLI. However, we need to set the HOSTNAME variable used in the docker-compose.yml to the public DNS name of the EC2 instance in our cluster.

On the Amazon ECS console, set the environment variable HOSTNAME. Alternatively, you can also enter the following command to set the environment variable from the command line:

export HOSTNAME=$(aws ecs list-container-instances --cluster rhizomer --query "containerInstanceArns" --output text | xargs aws ecs describe-container-instances --cluster rhizomer --container-instances --query "containerInstances[].ec2InstanceId" --output text | xargs aws ec2 describe-instances --instance-ids --query 'Reservations[].Instances[].PublicDnsName' --output text)

We start the containers from the same folder where the docker-compose.yml and ecs-params.yml files are located with the following command:

ecs-cli compose up --cluster-config rhizomer-config --ecs-profile rhizomer-profile --force-update

After the command is complete, we can list the running containers:

ecs-cli ps --cluster rhizomer

This command outputs something similar to the following code, which shows containers as RUNNING, one corresponding to the Rhizomer client attached to port 80 and one for the API at port 8080:

Name                                               State    Ports                        TaskDefinition  Health
847bf8a8-954e-4ce6-957a-9e26e23d2425/rhizomer-api  RUNNING  54.87.29.177:8080->8080/tcp  roberto:8       UNKNOWN
847bf8a8-954e-4ce6-957a-9e26e23d2425/rhizomer      RUNNING  54.87.29.177:80->80/tcp      roberto:8       UNKNOWN

Now we can start interacting with Neptune through the just deployed Rhizomer. When we're finished, we can easily remove all the involved resources (such as the cluster and instances) using the following command:

ecs-cli down --cluster-config rhizomer-config --ecs-profile rhizomer-profile

Enhanced Security

To enhance the security of your Rhizomer deployment, we recommend activating the encryption of the EBS volumes of the instances running the frontend and the backend. The easiest way to do so is by activating EBS encryption by default as detailed in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSEncryption.html#encryption-by-default

Additionally, it is recommended to use HTTPS when deploying Rhizomer. The easiest way to accomplish this is to use an AWS Load Balancer that secures the connection using a certificate provided by AWS Certificate Manager (ACM). To do so, follow the instructions at https://docs.aws.amazon.com/es_es/elasticloadbalancing/latest/classic/elb-create-https-ssl-load-balancer.html

Interact with Rhizomer

Now that we have deployed Rhizomer through Amazon ECS, we can start using its web user interface. It's available from the public DNS name we previously stored in the HOSTNAME environment variable, which we can retrieve with the following code:

echo $HOSTNAME

Enter the DNS name in your preferred browser. You should see Rhizomer's About page.

To manage users and registered datasets, sign in with the username admin and the default password provided in the docker-compose.yml during deployment, for instance password.

Define a new dataset

We can now define a new dataset to explore.

On the Neptune console, choose Datasets.
Choose New dataset.
For name, enter a name for your dataset.
For Query Type, choose your type of query (for this post, we choose Detailed).

For SPARQL Server type¸ choose Amazon Neptune.

This allows you to generate SPARQL queries optimized for Neptune.
Provide the details of the SPARQL endpoint.

You don't need to define a separate SPARQL update endpoint because Neptune uses the same for querying.
Optionally, you can make the endpoint writable or password protected.

You can retrieve your Neptune endpoint on the Instances page of the Neptune console.
In the Dataset Graphs section, we load data into the new graph.

We can use a URI (either an URL or URN) to identify the new graph (for this post, we enter urn:game_of_thrones).
Choose Add Graph.

We can now see our graph in the list of dataset graphs.
Choose Load data to load data into the new graph.

For this post, we load semantic data about the Game of Thrones characters from the file got.ttl.
Choose Submit to load the data into Neptune.

After we load the data, we can explore the graph urn:game_of_thrones.

Explore a small dataset

When we choose Explore on the dataset detail page, we can start inspecting the data we loaded into Fuseki. The first thing Rhizomer does when interacting with a dataset is present an overview of the data:

A word cloud generated from the classes in the dataset, if the dataset Query Type was set to Optimized. Each word in the cloud corresponds to a class, and its size is relative to the number of instances of the class in the dataset.
A network overview of the main classes and relationships among them, if the dataset Query Type was set to Detailed.

These visualizations are generated automatically by sending SPARQL queries to the endpoint associated with the explored dataset. The queries includes in the FROM clause the graph or graphs selected for exploration. For instance, to retrieve the classes:

SELECT ?class (COUNT(?instance) AS ?n)
FROM <urn:game_of_thrones>
WHERE
  { ?instance a ?class
    FILTER ( ! isBlank(?class) )
  }
GROUP BY ?class

As shown in the following network overview, we have four classes: FictionalCharacter, Noble, Book, and Organisation. All the characters in the dataset are instances of FictionalCharacter, but some of them are also Noble. They appear in Books and have allegiance with houses, which are Organisations.

We can choose a class to explore it further. For example, if we choose Noble, we see the following faceted view of all Game of Thrones characters classified as nobles.

This visualization shows the number of instances for the selected class, initially unconstrained so all 430 of 430 nobles are listed. Rhizomer uses the following query to retrieve the count from Neptune:

SELECT  (COUNT(?instance) AS ?n)
FROM <urn:game_of_thrones>
WHERE
  { ?instance a <http://dbpedia.org/ontology/Noble> }

Some of the instances are displayed using pagination, which is implemented by the underlying SPARQL query using OFFSET and LIMIT on an embedded SELECT query that retrieves the instances to display. To retrieve all the triples describing the resources included in the page, the DESCRIBE SPARQL query is used:

DESCRIBE ?instance
FROM <urn:game_of_thrones>
WHERE
  { { SELECT DISTINCT  ?instance
      WHERE
        { ?instance a <http://dbpedia.org/ontology/Noble> }
      OFFSET  0
      LIMIT   10
    }
  }

The visualization lists the facets for the class. Each facet corresponds to a property used to describe instances, and you can see how many times the corresponding property is used. You can also see how many different values are used and whether all are literals or not. This view is generated automatically by Rhizomer using the following SPARQL query:

PREFIX hint: <http://aws.amazon.com/neptune/vocab/v01/QueryHints#>

SELECT  ?property (COUNT(?instance) AS ?uses) (COUNT(DISTINCT ?object) AS ?values) (MIN(?isLiteral) AS ?allLiteral)
FROM <urn:game_of_thrones>
WHERE
  {
    hint:Query hint:joinOrder "Ordered"
    { SELECT  ?instance
      WHERE
        { ?instance a <http://dbpedia.org/ontology/Noble> }
    }
    ?instance  ?property  ?object
    BIND(isLiteral(?object) AS ?isLiteral)
  }
GROUP BY ?property

You can expand each facet to show the 10 most common values for the selected instances using the following query, which also retrieves more readable labels (preferably in English, if available):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX hint: <http://aws.amazon.com/neptune/vocab/v01/QueryHints#>

SELECT  ?value ?label (COUNT(?value) AS ?count)
FROM <urn:game_of_thrones>
WHERE
  {
    hint:Query hint:joinOrder "Ordered"
    { SELECT DISTINCT  ?instance
      WHERE
        { ?instance a <http://dbpedia.org/ontology/Noble> }
    }
    ?instance  rdfs:comment  ?resource
    OPTIONAL
      { ?resource  rdfs:label  ?label
        FILTER langMatches(lang(?label), "en")
      }
    OPTIONAL
      { ?resource  rdfs:label  ?label }
    BIND(str(?resource) AS ?value)
  }
GROUP BY ?value ?label
ORDER BY DESC(?count)
LIMIT   10

From this visualization, you can further filter the available instance by one or more specific facet values. There is also an input form for each facet that allows filtering by any of the values.

As we mentioned earlier, Rhizomer also works as a linked data browser, so if you choose any resource in the instance descriptions, its description is retrieved and presented if available locally or remotely, from the resource URL. You don't need any prior knowledge about the dataset to explore it, because the overview and faceted views inform you of what classes are present in the dataset and how they're described using properties and values. Rhizomer does all the hard work through SPARQL queries, so you don't need to worry about it.

Bulk load DBpedia in Neptune

The mechanism provided by Rhizomer to load data into Neptune is only suitable for small data files. For bigger data files, in the order of millions of triples, we recommend using the bulk loader provided by Neptune.

The data to upload should be available in an Amazon Simple Storage Service (Amazon S3) bucket, which Network needs to read and list access to. To prepare the S3 bucket with DBpedia data, follow the instructions in Prepare DBpedia Dataset.

DBpedia is a large dataset; it's a semantic version of Wikipedia featuring millions of triples. Therefore, we use a bigger instance for Neptune.

We start a new instance with instance class db.r5.2xlarge, which has 8 vCPU and 64 GiB RAM.

Next, Neptune requires permission to access the S3 storage bucket. This is granted through an AWS Identity and Access Management (IAM) role that has access to the bucket.
Create the new role.
Add the role to your Neptune cluster.
Create a VPC endpoint for Amazon S3 (for the Neptune loader to use).

Now the Neptune loader has access to the S3 bucket with the DBpedia data. Next, we instruct the loader to get the data from Amazon S3. The loader provides a web API that, for security reasons, is accessible just from machines connected to the VPC where our Neptune instance is.
The easiest way to get one is to start a new t2.micro instance and, during the networking part of the configuration process, define that it's connected to the same VPC as Neptune.

In our case, this vpc-id is vpc-1234567a, the same one we used when launching Rhizomer's containers.
From the command line of the new instance, we can use curl to interact with the loader web API. In the following code, the URL points to your own Neptune instance and the iamRoleArn points to the new IAM role:

curl -X POST \
-H 'Content-Type: application/json' \
https://neptune.abcdefghi1cba.us-east-1.neptune.amazonaws.com:8182/loader -d '
{
   "source" : "s3://your-bucket/dbpedia/",
   "format" : "turtle",
   "iamRoleArn" : "arn:aws:iam::1234567890:role/NeptuneLoadFromS3",
   "region" : "us-east-1",
   "failOnError" : "FALSE",
   "parserConfiguration" : { "namedGraphUri": "http://aws.amazon.com/neptune/vocab/v01/DefaultNamedGraph" }
}'

The response from the loader indicates the identifier for the load process just triggered:

{
   "status" : "200 OK",
   "payload" : { "loadId" : "d4f889ca-f5b3-47b7-a873-2d4343896a77" }
}

Use the loadId to check the status of the loading process:

curl -G 'https://neptune.abcdefghi1cba.us-east-1.neptune.amazonaws.com:8182/loader/d4f889ca-f5b3-47b7-a873-2d4343896a77?details=true'

For more information about the Neptune loader, see Neptune Loader Reference. The status, when the load is complete, should look like the following code:

{
   "status" : "200 OK",
   "payload" : {
      "feedCount" : [ { "LOAD_COMPLETED" : 18 } ],
      "overallStatus" : {
         "fullUri" : "s3://your-bucket/dbpedia/",
         "runNumber" : 1,
         "retryNumber" : 0,
         "status" : "LOAD_COMPLETED",
         "totalTimeSpent" : 3622,
         "totalRecords" : 131861846,
         "totalDuplicates" : 21012430,
         "parsingErrors" : 0,
         "datatypeMismatchErrors" : 0,
         "insertErrors" : 0
      }
   }
}

The status contains the number of files loaded from the S3 bucket and the total amount of triples finally stored. For the subset of DBpedia contained in the bucket, 131,861,846 triples are loaded but 21,012,430 are duplicates, the same triple is present in more than one file. This means that we actually have 110,849,416 triples now in Neptune. It took 3622 seconds to load it, slightly more than one hour.

We can check this by sending a query to the Neptune using SPARQL that gets the total amount of triples stored:

curl -X POST --data-binary 'query=SELECT (COUNT(?s) AS ?n) WHERE { ?s ?p ?o }' https://neptune.abcdefghi1cba.us-east-1.neptune.amazonaws.com:8182/sparql

The following code is the result:

{
   "head" : { "vars" : [ "n" ] },
   "results" : { "bindings" : [ {
      "n" : {
         "datatype" : "http://www.w3.org/2001/XMLSchema#integer",
         "type" : "literal",
         "value" : "110849416" }
      } ]
   }
}

However, it's inconvenient to explore the loaded data through manually crafted SPARQL queries sent from the command line. It is better to use Rhizomer to interactively explore all the data we loaded: https://neptune-rhizomer.rhizomik.net/datasets/dbpedia

Prepare DBpedia dataset

In an EC2 instance, we use wget to get the following files from the DBpedia 2016-10 dump. These are just a subset of DBpedia (about 130 million triples), but the most useful for data exploration.

File	Dataset	Description	Size (triples)
article_categories_en.ttl.bz2	Article Categories	Links from concepts to categories using the SKOS vocabulary.	23,990,514
category_labels_en.ttl.bz2	Category Labels	Labels for categories.	1,475,015
disambiguations_en.ttl.bz2	Disambiguations	Links extracted from Wikipedia disambiguation pages. Because Wikipedia has no syntax to distinguish disambiguation links from ordinary links, DBpedia uses heuristics.	1,537,180
geo_coordinates_en.ttl.bz2	Geo Coordinates	Geographic coordinates extracted from Wikipedia.	2,323,568
geo_coordinates_mappingbased_en.ttl.bz2	Geo Coordinates Mappingbased	Geographic coordinates extracted from Wikipedia originating from mapped infoboxes in the mappings wiki.	2,450,527
geonames_links_en.ttl.bz2	Geonames Links	This file contains the back-links (owl:sameAs) to the Geonames dataset.	535,380
homepages_en.ttl.bz2	Homepages	Links to homepages of persons, organizations, etc.	688,563
images_en.ttl.bz2	Images	Main image and corresponding thumbnail from the Wikipedia article.	11,869,354
instance_types_en.ttl.bz2	Instance Types	Contains triples of the form $object rdf:type $class from the mapping-based extraction.	5,150,432
labels_en.ttl.bz2	Labels	Titles of all Wikipedia articles in the corresponding language. In Wikidata, it contains all the languages a vailable in the mappings Wiki; labels_nmw contains the rest.	1,2845,252
long_abstracts_en.ttl.bz2	Long Abstracts	Long abstracts (full abstracts) of Wikipedia articles, usually the first section.	4,935,279
mappingbased_literals_en.ttl.bz2	Mappingbased Literals	High-quality data extracted from infoboxes using the mapping-based extraction (literal properties only). The predicates in this dataset are in the ontology namespace. This data is of much higher quality than the raw infobox properties in the property namespace.	14,388,537
mappingbased_objects_en.ttl.bz2	Mappingbased Objects	High-quality data extracted from infoboxes using the mapping-based extraction (object properties only). The predicates in this dataset are in the ontology namespace. This data is of much higher quality than the raw infobox properties in the property namespace.	18,746,174
mappingbased_objects_uncleaned_en.ttl.bz2	Mappingbased objects uncleaned	The DBpedia dataset mappingbased_objects_uncleaned	18,806,500
short_abstracts_en.ttl.bz2	Short Abstracts	Short abstracts (about 600 characters long) of Wikipedia articles.	4,935,279
skos_categories_en.ttl.bz2	SKOS Categories	Information of which concept is a category and how categories are related using the SKOS vocabulary.	6,083,029
specific_mappingbased_properties_en.ttl.bz2	Specific Mappingbased Properties	Infobox data from the mapping-based extraction, using units of measurement more convenient for the resource type, such as square kilometers instead of square meters for the area of a city.	915,714
topical_concepts_en.ttl.bz2	Topical Concepts	Resources that describe a category.	186,680
Total			131,862,977

Neptune is just able to load the input files from Amazon S3. Therefore, the DBpedia files should be uploaded to S3 first:

aws s3 cp . s3://your-bucket/ --recursive --exclude "*" --include "*.bz2"

Finally, we're ready to load DBpedia into Neptune and explore our knowledge graph following the instructions in Bulk load DBpedia in Neptune.

Alternative: DBpedia Databus

The DBpedia files can be also downloaded from DBpedia Databus using the Databus Client.

First, download the client as a JAR file from its GitHub releases:

wget https://github.com/dbpedia/databus-client/releases/download/v0.3.1/databus-client-1.0-SNAPSHOT.jar

Then, execute the client (Java required) to retrieve the "Rhizomer Dump 2021.09.01-en" collection defined to select the subset of DBpedia files that makes it browsable using Rhizomer:

java -jar databus-client-1.0-SNAPSHOT.jar -f ttl -c bz2 -t ./databus-download/ -s "https://databus.dbpedia.org/rogargon/collections/browsable_core"

This will download the following 12 files in the collection (about 2 GB of data) into the databus-download folder:

Dataset	Downloads	Variant	Format
Cleaned object properties extracted with mappings (2021.09.01)	175.5 MB	en	ttl
	383 KB	disjointDomain, en	ttl
	685 KB	disjointRange, en	ttl
Numeric Literals converted to designated units with class-specific property mappings (2021.09.01)	8.3 MB	en	ttl
Extracted facts from Wikipedia Infoboxes (2021.09.01)	820 MB	en	ttl
DBpedia Ontology instance types (2021.09.01)	42.4 MB	en, specific	ttl
Geo-coordinates extracted with mappings (2021.09.01)	17.1 MB	en	ttl
geo-coordinates dataset (2021.09.01)	32.7 MB	en	ttl
images dataset (2021.09.01)	604.9 MB	en	ttl
Wikipedia page title as rdfs:label (2021.09.01)	153.3 MB	en	ttl
homepages dataset (2021.09.01)	11.5 MB	en	ttl
Literals extracted with mappings (2021.09.01)	139.1 MB	en	ttl