Importing and Updating Data on Virtuoso - DDMAL/linkedmusic-datalake GitHub Wiki
This document shows how to use the bulk loader to:
- Prepare your host folders and Virtuoso container
- Transfer RDF files (e.g., Turtle
.ttl
) from your local machine to Virtuoso - Create per-folder
.graph
files to assign each set of TTLs to its own graph IRI - Use
isql
to register and bulk-load each subfolder into Virtuoso - Verify that both graphs contain data
It also contains information on updating/reuploading RDF files into Virtuoso.
This guide is divided in two sections. The first is for importing new databases, and the second is for updating data for an already present database.
Note: As of 9 June 2025, Liam is not sure why the directory structure is different on staging and production. This should be investigated. See #347.
We will be following the process outlined in this guide, which describes how to use Virtuoso's bulk loader for data import. Note that it requires the data to already be in one of the following RDF formats:
-
.grdf
– Geospatial RDF -
.nq
– N-Quads -
.nt
– N-Triples -
.owl
– OWL -
.rdf
– RDF/XML -
.trig
– TriG -
.ttl
– Turtle -
.xml
– RDF/XML
For this project, we are currently mainly using .ttl
for its general ease-of-use, although other formats have not been extensively tested.
The instructions for staging and production are the same, except for the path to the Virtuoso files. On production, it is /srv/virtuoso/my_virtdb/
while on staging it is /srv/webapps/virtuoso/my_virtdb/
.
First, SSH into the Virtuoso server (see How to SSH into Virtuoso).
Next, you will need Docker permission to add to the database. Reach out to the Virtuoso admin (Liam) and give him your username. This is the same username you use to SSH into the server. Alternatively, a system admin (Hanwen, Dylan) can add you using the following commands:
sudo usermod -aG docker <USERNAME>
sudo usermod -aG virtuoso-users <USERNAME>
You can verify everything is working by running docker ps
.
For production, create a new directory under /srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>
with:
mkdir -p /srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>
For staging, create a new directory under /srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>
with:
mkdir -p /srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>
Remember to replace <NEW_DATABASE_NAME> with the name of your new database.
Next, from your local machine (i.e., Terminal), use rsync
or scp
to transfer your RDF files from your local machine to Virtuoso. The data files are located in the /srv/virtuoso/my_virtdb/data/
directory (for staging they are in /srv/webapps/virtuoso/my_virtdb/data/
). In the Docker container, this folder is visible at /database/data
.
For production:
rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
"LOCAL_PATH_TO_DATA" \
ddmal.prod_virtuoso:/srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>
For staging:
rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
"LOCAL_PATH_TO_DATA" \
ddmal.staging_virtuoso:/srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>
Then, confirm it's on the server.
For production:
ssh ddmal.prod_virtuoso
ls /srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>
For staging:
ssh ddmal.staging_virtuoso
ls /srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>
Virtuoso uses .graph
files to assign uploaded data to its target graph IRI.
For production:
nano /srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>/global.graph
For staging:
nano /srv/webappsvirtuoso/my_virtdb/data/<NEW_DATABASE_NAME>/global.graph
Then, paste the new database's graph name. This should be https://linkedmusic.ca/graphs/<NEW_DATABASE_NAME>
. Make sure there are no extra spaces or newlines. For example, for DIAMM, this would be:
https://linkedmusic.ca/graphs/diamm/
When you run the bulk loader, it will by default look for a global.graph
file in the same directory as the data and assign the global.graph
file's IRI to all data found in that directory. If you would like to override this default, you can create a separate .graph
file. For example, if you have SAMPLE_SPECIAL_DATA.ttl
, you can make a SAMPLE_SPECIAL_DATA.ttl.graph
file and specify a different IRI in that .graph
file. Ensure that the names of the files are identical except for the .graph
file extension.
- Enter the
isql
shell. The credentials here are the same as what you use for Virtuoso Conductor on the web. Note that your Virtuoso account will needdba
permissions. Contact Liam, Dylan, or Hanwen for access or to make a new account.
docker exec -it my_virtdb isql 1111 <DBA_USERNAME>
Then enter your password.
- Register the new data. If you get "Access denied," see Troubleshooting.
DB.DBA.ld_dir(
'/database/data/<NEW_DATABASE_NAME>', -- container path
'*.ttl', -- match all TTLs in that folder (replace with desired extension)
NULL -- NULL ⇒ look for <NEW_DATABASE_NAME>.ttl.graph
);
- Run the bulk loader
DB.DBA.rdf_loader_run();
- Verify the status of the new database
Run the following command to see the status of all the files for the database. It will show each file, the named graph into which all triples were imported, and will show the import status. You should see a value of 2 in the ll_state
column for each file, this means that the file was fully imported. If you do not see a value of 2, you can add the ll_error
column to the selected columns to see any errors that arose.
SELECT ll_file, ll_graph, ll_state
FROM DB.DBA.load_list
WHERE ll_file LIKE '%/<NEW_DATABASE_NAME>/%';
- Commit loaded triples
checkpoint;
Add a prefix for each entity type in the database. See this section of the Virtuoso setup guide for instructions.
At the SQL>
prompt, run the following. Remember to replace both <NEW_DATABASE_NAME>
and <GRAPH_IRI>
. Note that the graph IRI should remain enclosed in < >
brackets.
SPARQL
SELECT (COUNT(*) AS ?count)
WHERE { GRAPH <<GRAPH_IRI>> { ?s ?p ?o } };
For example, for DIAMM, this would be:
SPARQL
SELECT (COUNT(*) AS ?count)
WHERE { GRAPH <https://linkedmusic.ca/graphs/diamm/> { ?s ?p ?o } };
You should see a non-zero number representing the number of triples in your new database.
The instructions for staging and production are the same, except for the path to the Virtuoso files. On production, it is /srv/virtuoso/my_virtdb/
while on staging it is /srv/webapps/virtuoso/my_virtdb/
.
You should have 2 terminals, one where you SSH into the server, and on just on your local machine. Starting in step 2, all commands will be run on the server.
Run the following command on the server to delete all Turtle files for the database:
For production:
rm /srv/virtuoso/my_virtdb/data/<DATABASE_NAME>/*.ttl
For staging:
rm /srv/webapps/virtuoso/my_virtdb/data/<DATABASE_NAME>/*.ttl
Then, run the following command on your machine to upload the new Turtle files:
For production:
rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
"LOCAL_PATH_TO_DATA" \
ddmal.prod_virtuoso:/srv/virtuoso/my_virtdb/data/<DATABASE_NAME>
For staging:
rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
"LOCAL_PATH_TO_DATA" \
ddmal.staging_virtuoso:/srv/webapps/virtuoso/my_virtdb/data/<DATABASE_NAME>
And confirm it's on the server by running the following command on the server:
For production:
ls /srv/virtuoso/my_virtdb/data/<DATABASE_NAME>
For staging:
ls /srv/webapps/virtuoso/my_virtdb/data/<DATABASE_NAME>
From now on, all commands will be run on the server.
The credentials here are the same as what you use for Virtuoso Conductor on the web. Note that your Virtuoso account will need dba
permissions. Contact Liam, Dylan, or Hanwen for access or to make a new account.
docker exec -it my_virtdb isql 1111 <DBA_USERNAME>
Then enter your password.
Virtuoso keeps a list of Turtle files that it's loaded to avoid re-loading files, but we need to remove the files for the database that we want to update from that list so that they actually get imported. Run the following command in the isql
prompt to clear all files for the graph that you want to update:
DELETE FROM DB.DBA.LOAD_LIST WHERE ll_file LIKE '%/<DATABASE_NAME>/%';
Ensure that DATABASE_NAME
follows the same capitalization as the folder name for that database (ideally all lowercase). For DIAMM, this would be:
DELETE FROM DB.DBA.LOAD_LIST WHERE ll_file LIKE '%/diamm/%';
When you will run the bulk loader, Virtuoso will skip any triples that already exist in the database. If all you're doing is adding new triples, skip this step. However, if you're modifying any triples in the graph, you need to clear the graph on Virtuoso to get rid of the old data.
To do this, you'll need to grant yourself permission to update the graph. To give yourself this permission, run the following command:
DB.DBA.RDF_GRAPH_USER_PERMS_SET('<GRAPH_IRI>', '<DBA_USERNAME>', 7);
DBA_USERNAME
is the same username that you use on Conductor and that you used to enter the isql
shell in step 2. For DIAMM and for Simon's user, this would be:
DB.DBA.RDF_GRAPH_USER_PERMS_SET('https://linkedmusic.ca/graphs/diamm/', 'sngassam', 7);
Then, delete the graph:
log_enable(2, 1);
SPARQL CLEAR GRAPH <<GRAPH_IRI>>;
The log_enable
function temporarily disables the transaction logging and the manual commit, which prevents clearing the graph from taking up all the memory on the server. The second argument having a value of 1
means that the disabling will only affect the next function call. As such, these 2 commands need to be run back-to-back.
For DIAMM, this would be:
log_enable(2, 1);
SPARQL CLEAR GRAPH <https://linkedmusic.ca/graphs/diamm/>;
First, register the new data. If you get "Access denied," see Troubleshooting.
DB.DBA.ld_dir(
'/database/data/<DATABASE_NAME>', -- container path
'*.ttl', -- match all TTLs in that folder
NULL -- NULL ⇒ look for <DATABASE_NAME>.ttl.graph
);
Then, run the bulk loader
DB.DBA.rdf_loader_run();
Then, verify the status of the new database. Run the following command to see the status of all the files for the database. It will show each file, the named graph into which all triples were imported, and will show the import status. You should see a value of 2 in the ll_state
column for each file, this means that the file was fully imported. If you do not see a value of 2, you can add the ll_error
column to the selected columns to see any errors that arose.
SELECT ll_file, ll_graph, ll_state
FROM DB.DBA.load_list
WHERE ll_file LIKE '%/<NEW_DATABASE_NAME>/%';
Then, commit the loaded triples
checkpoint;
Run the following command to count the triples for the database, you should see a non-zero number.
SPARQL
SELECT (COUNT(*) AS ?count)
WHERE { GRAPH <<GRAPH_IRI>> { ?s ?p ?o } };
For example, for DIAMM, this would be:
SPARQL
SELECT (COUNT(*) AS ?count)
WHERE { GRAPH <https://linkedmusic.ca/graphs/diamm/> { ?s ?p ?o } };
This is likely because Virtuoso restricts directories via DirsAllowed
. To fix it, you need to modify virtuoso.ini
on the host.
Production:
sudo -i
nano /srv/virtuoso/my_virtdb/virtuoso.ini
Staging:
sudo -i
nano /srv/webapps/virtuoso/my_virtdb/virtuoso.ini
Then find the [Parameters]
section and make sure it has:
[Parameters]
...
DirsAllowed = ., ../vad, /usr/share/proj, /database, /database/data
...
Save (Ctrl+O
, Enter
, Ctrl+X
) and restart the Docker container:
docker restart my_virtdb
If this still doesn't work, try whitelisting my_virtdb
instead:
[Parameters]
...
DirsAllowed = ., ../vad, /usr/share/proj, /`my_virtdb`, /`my_virtdb`/data
...
This could happen for a number of reasons. However, it means that no RDF files were registered. Double check that your .graph
files are not missing, misnamed, or in the wrong place. The filenames must either be <NEW_DATABASE_NAME>.<EXTENSION>.graph
(i.e., musicbrainz.ttl.graph
) for per-file .graph
loading, or you must have a single global.graph
file in the root directory for your new database.
Note, you may need to add --rsync-path="sudo rsync"
to your rsync
command if it says you don't have the right permissions. If this is the case, let an admin know to create an issue for it.
For production:
rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
"LOCAL_PATH_TO_DATA" \
--rsync-path="sudo rsync" \
ddmal.prod_virtuoso:/srv/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>
For staging:
rsync -rtvz --chmod=Fa=r,ug+w -e ssh \
"LOCAL_PATH_TO_DATA" \
--rsync-path="sudo rsync" \
ddmal.staging_virtuoso:/srv/webapps/virtuoso/my_virtdb/data/<NEW_DATABASE_NAME>