Importing and Updating Data on Virtuoso - DDMAL/linkedmusic-datalake GitHub Wiki

Overview

This document shows how to use the bulk loader to:

  1. Prepare your host folders and Virtuoso container
  2. Transfer RDF files (e.g., Turtle .ttl) from your local machine to Virtuoso
  3. Create per-folder .graph files to assign each set of TTLs to its own graph IRI
  4. Use isql to register and bulk-load each subfolder into Virtuoso
  5. Verify that both graphs contain data

It also contains information on updating/reuploading RDF files into Virtuoso.

Table of Contents:

This guide is divided in two sections. The first is for importing new databases, and the second is for updating data for an already present database.

Note: As of 9 June 2025, Liam is not sure why the directory structure is different on staging and production. This should be investigated. See #347.

We will be following the process outlined in this guide, which describes how to use Virtuoso's bulk loader for data import. Note that it requires the data to already be in one of the following RDF formats:

For this project, we are currently mainly using .ttl for its general ease-of-use, although other formats have not been extensively tested.

Importing Data

The instructions for staging and production are the same, except for the path to the Virtuoso files. On production, it is /srv/virtuoso/my_virtdb/ while on staging it is /srv/webapps/virtuoso/my_virtdb/.

1. SSH into Virtuoso and Verify Docker Permissions

First, SSH into the Virtuoso server (see How to SSH into Virtuoso).

Next, you will need Docker permission to add to the database. Reach out to the Virtuoso admin (Liam) and give him your username. This is the same username you use to SSH into the server. Alternatively, a system admin (Hanwen, Dylan) can add you using the following commands:

sudo usermod -aG docker <USERNAME>
sudo usermod -aG virtuoso-users <USERNAME>

You can verify everything is working by running docker ps.

2. Create and Organize Host Directories

For production, create a new directory under /srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME> with:

mkdir -p /srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>

For staging, create a new directory under /srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME> with:

mkdir -p /srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>

Remember to replace <NEW_DATABASE_NAME> with the name of your new database.

3. Transfer RDF Files from Local Machine

Next, from your local machine (i.e., Terminal), use rsync or scp to transfer your RDF files from your local machine to Virtuoso. The data files are located in the /srv/virtuoso/my_virtdb/data/ directory (for staging they are in /srv/webapps/virtuoso/my_virtdb/data/). In the Docker container, this folder is visible at /database/data.

For production:

rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
  "LOCAL_PATH_TO_DATA" \
  ddmal.prod_virtuoso:/srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>

For staging:

rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
  "LOCAL_PATH_TO_DATA" \
  ddmal.staging_virtuoso:/srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>

Then, confirm it's on the server.

For production:

ssh ddmal.prod_virtuoso
ls /srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>

For staging:

ssh ddmal.staging_virtuoso
ls /srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>

4. Create Per-Folder .graph Files

Virtuoso uses .graph files to assign uploaded data to its target graph IRI.

For production:

nano /srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>/global.graph

For staging:

nano /srv/webappsvirtuoso/my_virtdb/data/<NEW_DATABASE_NAME>/global.graph

Then, paste the new database's graph name. This should be https://linkedmusic.ca/graphs/<NEW_DATABASE_NAME>. Make sure there are no extra spaces or newlines. For example, for DIAMM, this would be:

https://linkedmusic.ca/graphs/diamm/

When you run the bulk loader, it will by default look for a global.graph file in the same directory as the data and assign the global.graph file's IRI to all data found in that directory. If you would like to override this default, you can create a separate .graph file. For example, if you have SAMPLE_SPECIAL_DATA.ttl, you can make a SAMPLE_SPECIAL_DATA.ttl.graph file and specify a different IRI in that .graph file. Ensure that the names of the files are identical except for the .graph file extension.

5. Run the Bulk Loader

  1. Enter the isql shell. The credentials here are the same as what you use for Virtuoso Conductor on the web. Note that your Virtuoso account will need dba permissions. Contact Liam, Dylan, or Hanwen for access or to make a new account.
docker exec -it my_virtdb isql 1111 <DBA_USERNAME>

Then enter your password.

  1. Register the new data. If you get "Access denied," see Troubleshooting.
DB.DBA.ld_dir(
  '/database/data/<NEW_DATABASE_NAME>',  -- container path
  '*.ttl',                               -- match all TTLs in that folder (replace with desired extension)
  NULL                                   -- NULL ⇒ look for <NEW_DATABASE_NAME>.ttl.graph
);
  1. Run the bulk loader
DB.DBA.rdf_loader_run();
  1. Verify the status of the new database

Run the following command to see the status of all the files for the database. It will show each file, the named graph into which all triples were imported, and will show the import status. You should see a value of 2 in the ll_state column for each file, this means that the file was fully imported. If you do not see a value of 2, you can add the ll_error column to the selected columns to see any errors that arose.

SELECT ll_file, ll_graph, ll_state
  FROM DB.DBA.load_list
 WHERE ll_file LIKE '%/<NEW_DATABASE_NAME>/%';
  1. Commit loaded triples
checkpoint;

6. Add Prefixes for all Entity Types

Add a prefix for each entity type in the database. See this section of the Virtuoso setup guide for instructions.

7. Verify Graph Contents with SPARQL

At the SQL> prompt, run the following. Remember to replace both <NEW_DATABASE_NAME> and <GRAPH_IRI>. Note that the graph IRI should remain enclosed in < > brackets.

SPARQL
  SELECT (COUNT(*) AS ?count)
    WHERE { GRAPH <<GRAPH_IRI>> { ?s ?p ?o } };

For example, for DIAMM, this would be:

SPARQL
  SELECT (COUNT(*) AS ?count)
    WHERE { GRAPH <https://linkedmusic.ca/graphs/diamm/> { ?s ?p ?o } };

You should see a non-zero number representing the number of triples in your new database.

Updating data

The instructions for staging and production are the same, except for the path to the Virtuoso files. On production, it is /srv/virtuoso/my_virtdb/ while on staging it is /srv/webapps/virtuoso/my_virtdb/.

You should have 2 terminals, one where you SSH into the server, and on just on your local machine. Starting in step 2, all commands will be run on the server.

1. Clear the Turtle files on Virtuoso and upload the new ones

Run the following command on the server to delete all Turtle files for the database:

For production:

rm /srv/virtuoso/my_virtdb/data/<DATABASE_NAME>/*.ttl

For staging:

rm /srv/webapps/virtuoso/my_virtdb/data/<DATABASE_NAME>/*.ttl

Then, run the following command on your machine to upload the new Turtle files:

For production:

rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
  "LOCAL_PATH_TO_DATA" \
  ddmal.prod_virtuoso:/srv/virtuoso/my_virtdb/data/<DATABASE_NAME>

For staging:

rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
  "LOCAL_PATH_TO_DATA" \
  ddmal.staging_virtuoso:/srv/webapps/virtuoso/my_virtdb/data/<DATABASE_NAME>

And confirm it's on the server by running the following command on the server:

For production:

ls /srv/virtuoso/my_virtdb/data/<DATABASE_NAME>

For staging:

ls /srv/webapps/virtuoso/my_virtdb/data/<DATABASE_NAME>

2. Enter the isql shell.

From now on, all commands will be run on the server.

The credentials here are the same as what you use for Virtuoso Conductor on the web. Note that your Virtuoso account will need dba permissions. Contact Liam, Dylan, or Hanwen for access or to make a new account.

docker exec -it my_virtdb isql 1111 <DBA_USERNAME>

Then enter your password.

3. Remove the Turtle files from the list of loaded files on Virtuoso

Virtuoso keeps a list of Turtle files that it's loaded to avoid re-loading files, but we need to remove the files for the database that we want to update from that list so that they actually get imported. Run the following command in the isql prompt to clear all files for the graph that you want to update:

DELETE FROM DB.DBA.LOAD_LIST WHERE ll_file LIKE '%/<DATABASE_NAME>/%';

Ensure that DATABASE_NAME follows the same capitalization as the folder name for that database (ideally all lowercase). For DIAMM, this would be:

DELETE FROM DB.DBA.LOAD_LIST WHERE ll_file LIKE '%/diamm/%';

4. Delete the graph on Virtuoso if you're modifying existing data

When you will run the bulk loader, Virtuoso will skip any triples that already exist in the database. If all you're doing is adding new triples, skip this step. However, if you're modifying any triples in the graph, you need to clear the graph on Virtuoso to get rid of the old data.

To do this, you'll need to grant yourself permission to update the graph. To give yourself this permission, run the following command:

DB.DBA.RDF_GRAPH_USER_PERMS_SET('<GRAPH_IRI>', '<DBA_USERNAME>', 7);

DBA_USERNAME is the same username that you use on Conductor and that you used to enter the isql shell in step 2. For DIAMM and for Simon's user, this would be:

DB.DBA.RDF_GRAPH_USER_PERMS_SET('https://linkedmusic.ca/graphs/diamm/', 'sngassam', 7);

Then, delete the graph:

log_enable(2, 1);
SPARQL CLEAR GRAPH <<GRAPH_IRI>>;

The log_enable function temporarily disables the transaction logging and the manual commit, which prevents clearing the graph from taking up all the memory on the server. The second argument having a value of 1 means that the disabling will only affect the next function call. As such, these 2 commands need to be run back-to-back.

For DIAMM, this would be:

log_enable(2, 1);
SPARQL CLEAR GRAPH <https://linkedmusic.ca/graphs/diamm/>;

5. Run the bulk loader

First, register the new data. If you get "Access denied," see Troubleshooting.

DB.DBA.ld_dir(
  '/database/data/<DATABASE_NAME>',      -- container path
  '*.ttl',                               -- match all TTLs in that folder
  NULL                                   -- NULL ⇒ look for <DATABASE_NAME>.ttl.graph
);

Then, run the bulk loader

DB.DBA.rdf_loader_run();

Then, verify the status of the new database. Run the following command to see the status of all the files for the database. It will show each file, the named graph into which all triples were imported, and will show the import status. You should see a value of 2 in the ll_state column for each file, this means that the file was fully imported. If you do not see a value of 2, you can add the ll_error column to the selected columns to see any errors that arose.

SELECT ll_file, ll_graph, ll_state
  FROM DB.DBA.load_list
 WHERE ll_file LIKE '%/<NEW_DATABASE_NAME>/%';

Then, commit the loaded triples

checkpoint;

6. Verify that the data was successfully imported

Run the following command to count the triples for the database, you should see a non-zero number.

SPARQL
  SELECT (COUNT(*) AS ?count)
    WHERE { GRAPH <<GRAPH_IRI>> { ?s ?p ?o } };

For example, for DIAMM, this would be:

SPARQL
  SELECT (COUNT(*) AS ?count)
    WHERE { GRAPH <https://linkedmusic.ca/graphs/diamm/> { ?s ?p ?o } };

Troubleshooting

  • "Access denied" when running ld_dir()

This is likely because Virtuoso restricts directories via DirsAllowed. To fix it, you need to modify virtuoso.ini on the host.

Production:

sudo -i
nano /srv/virtuoso/my_virtdb/virtuoso.ini

Staging:

sudo -i
nano /srv/webapps/virtuoso/my_virtdb/virtuoso.ini

Then find the [Parameters] section and make sure it has:

[Parameters]
  ...
  DirsAllowed = ., ../vad, /usr/share/proj, /database, /database/data
  ...

Save (Ctrl+O, Enter, Ctrl+X) and restart the Docker container:

docker restart my_virtdb

If this still doesn't work, try whitelisting my_virtdb instead:

[Parameters]
  ...
  DirsAllowed = ., ../vad, /usr/share/proj, /`my_virtdb`, /`my_virtdb`/data
  ...
  • Number of triples is zero when running DB.DBA.load_list

This could happen for a number of reasons. However, it means that no RDF files were registered. Double check that your .graph files are not missing, misnamed, or in the wrong place. The filenames must either be <NEW_DATABASE_NAME>.<EXTENSION>.graph (i.e., musicbrainz.ttl.graph) for per-file .graph loading, or you must have a single global.graph file in the root directory for your new database.

  • rsync not able to transfer the files to the server

Note, you may need to add --rsync-path="sudo rsync" to your rsync command if it says you don't have the right permissions. If this is the case, let an admin know to create an issue for it.

For production:

rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
  "LOCAL_PATH_TO_DATA" \
  --rsync-path="sudo rsync" \
  ddmal.prod_virtuoso:/srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>

For staging:

rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
  "LOCAL_PATH_TO_DATA" \
  --rsync-path="sudo rsync" \
  ddmal.staging_virtuoso:/srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>

Future Work

  • There is a way to load .csv files instead of RDF directly. However, as of 4 June 2025, Liam has not investigated how to do this (see #294 and #319).
⚠️ **GitHub.com Fallback** ⚠️