Virtuoso Setup Guide - DDMAL/linkedmusic-datalake GitHub Wiki

Current Staging Instance

The staging server (https://virtuoso.staging.simssa.ca) was setup according to the instructions bellow. For information on the server itself, see the DDMAL internal Wiki.

Set up docker

(official Virtuoso Docker setup guide here)

  1. Pull the docker image (line 1) and check the image version (optional, line 2).

    sudo docker pull openlink/virtuoso-opensource-7
    sudo docker run openlink/virtuoso-opensource-7 version
  2. Start a docker container.

    sudo mkdir my_virtdb
    cd my_virtdb
    sudo docker run \
        --name my_virtdb \
        --interactive \
        --tty \
        --env DBA_PASSWORD=mysecret \
        --publish 1111:1111 \
        --publish  8890:8890 \
        --volume `pwd`:/database \
        openlink/virtuoso-opensource-7:latest

This creates a new Virtuoso database in the my_virtdb subdirectory and starts a Virtuoso instance with the HTTP server listening on port 8890 and the ISQL data server listening on port 1111.

Note that you should change the DBA_PASSWORD to the desired password.

Add data to the local instance

This can be done before or after the configuration.

  1. Get into the local instance.
docker exec -it <docker id> bash

The <docker id> can be retrieved by running docker ps.

  1. Download data in compact json to local instance.
#!/bin/bash

# Download and rename files from different URLs
wget -O simssadb.jsonld raw.githubusercontent.com/DDMAL/linkedmusic-datalake/main/simssadb/jsonld/compact.jsonld
wget -O cantusdb.jsonld raw.githubusercontent.com/DDMAL/linkedmusic-datalake/main/cantusdb/jsonld/compact.jsonld
  1. Upload data.

Open the isql CLI (use the correct username and password)

isql -U dba -P mysecret

Then load the json-ld files: (see details of rdf_load_json_ld() here)

rdf_load_json_ld (file_to_string('simssadb.jsonld'),'', 'urn:simssadb');
rdf_load_json_ld (file_to_string('cantusdb.jsonld'),'', 'urn:cantusdb');

Add packages to virtuoso

  1. Go to the local server http://localhost:8890/. Log into conductor using
username: dba 
password: mysecret
  1. Go to System Admin > Packages. Download conductor, fct, iSPARQL, rdf_mappers (download rdf_mappers [here](http://download3.openlinksw.com/uda/vad-vos-packages/7.2/rdf_mappers_dav.vad) and install from upload). You can find the rest of the packages here if not previously installed.

  2. Check if faceted search works here http://localhost:8890/fct/. Try SPARQL here http://localhost:8890/sparql/.

  3. Configure data and permissions.

    Open the ISQL CLI:

    -- Permission for Sponging (optional)
    -- see https://github.com/openlink/virtuoso-opensource/issues/1180
    
    DB.DBA.RDF_DEFAULT_USER_PERMS_SET ('SPARQL', 7); 
    DB.DBA.RDF_DEFAULT_USER_PERMS_SET ('nobody', 7); 
    
    -- Post Installation Setup for Virtuoso Faceted Browser
    -- see: https://vos.openlinksw.com/owiki/wiki/VOS/VirtFacetBrowserInstallConfig#Post%20Installation
    RDF_OBJ_FT_RULE_ADD (null, null, 'All');
    VT_INC_INDEX_DB_DBA_RDF_OBJ ();
    urilbl_ac_init_db();
    s_rank();
    
    -- For federated SPARQL query search, see https://community.openlinksw.com/t/sparql-federated-query/4162/4
    grant execute on "DB.DBA.SPARQL_SINV_IMP" to "SPARQL";
    grant select on "DB.DBA.SPARQL_SINV_2" to "SPARQL";
    
    -- Grant privileges to user "SPARQL", might not be needed?? 
    -- TODO: See if this is actually needed
    grant SPARQL_SELECT to "SPARQL";
    grant SPARQL_UPDATE to "SPARQL";
    grant SPARQL_SPONGE to "SPARQL";
    
    

Note: Make sure to rerun these lines after loading a new JSON-LD (for text indexing and entity label table)

    VT_INC_INDEX_DB_DBA_RDF_OBJ ();
    urilbl_ac_init_db();

Other configurations:

1. Add the Name Space Prefix to facilitate the SPARQL query

  • From Conductor, navigate to "Linked Data">"Namespaces" to view the list of configured prefixes and to add your own

wd: http://www.wikidata.org/entity/
wdt: http://www.wikidata.org/prop/direct/

2. Sponger

Optional: Sponge urls within the json-ld

!Note: The current Virtuoso Staging instance doesn't Sponge external information. This documentation is here in case we decide to do it in the future.

This is for retrieving external RDF data that can be reached from the loaded JSON-LD (ie. Wikidata RDF). After discussing with Ich, this might or might not be what we want.

(See more about sponging here)

In interactive SQL (ISQL), run: (Change the grab-depth and limit)

SPARQL
define input:grab-all "yes" define input:grab-depth 2 define input:grab-limit 100
SELECT * 
FROM NAMED <urn:test>
WHERE { GRAPH ?g { ?s ?p ?o } };

Accounting for codes above:

Upon execution, one may find there appear New Named Graphs(presumed as NNG) in your local Virtuoso, which graphs are named according to instances from the <urn:test> graph. As long as an instance is an accessible URL(presumed as A), namely a visitable webpage, sponger can incorporate those URLs(presumed as B1,B2,...) that link A, and convert them into RDF in the NNG.

To focus on sponging wikidata fields:

SPARQL
define input:grab-all "yes"
define input:grab-depth 5
define input:grab-limit 20

SELECT ?s ?p ?o
FROM NAMED <urn:test>
WHERE {
  GRAPH ?g {
    ?s ?p ?o .
    FILTER(STRSTARTS(STR(?p), "http://www.wikidata.org/"))
  }
};

virtuoso.ini Configurations

Below is a complete list of modifications made to the default virtuoso.ini file on the production server. Staging is using default settings as of 18 June 2025 (with the exception of the DirsAllowed change). Referenced documentation was found on this page. However, it appears to be significantly out of date. The default virtuoso.ini file is much smaller in the documentation than the default that was on production. Some of the parameters have been changed or removed and many of the default values are different.

Parameter Default Value New Value Reason
DirsAllowed ., ../vad, /usr/share/proj ., ../vad, /usr/share/proj, /database, /database/data In order to use the bulk loader, you must enable access to the database directory. Resolves "access denied" issue when running ld_dir. See this Wiki page.
NumberOfBuffers 10000 400000 virtuoso.ini suggests the following: when running with large data sets, one should configure the Virtuoso process to use between 2/3 to 3/5 of free system memory. For the 6 GB we have available on production, this would be 510000 (linearly interpolating the suggested values for 4 GB and 8 GB in virtuoso.ini). Resolves #383. Further reduced to 400000 to reduce memory usage. See #392.
MaxDirtyBuffers 6000 300000 virtuoso.ini suggests the following: when running with large data sets, one should configure the Virtuoso process to use between 2/3 to 3/5 of free system memory. For the 6 GB we have available on production, this would be 375000 (linearly interpolating the suggested values for 4 GB and 8 GB in virtuoso.ini). Resolves #383. Further reduced to 300000 to reduce memory usage. See #392.
FileExtend 100 5000 Increased FileExtend to improve performance during database growth. This reduces the frequency of small I/O operations by allocating additional space in larger 40 MB chunks (8 KB per buffer), which is more efficient for large or growing RDF datasets.
CheckpointAuditTrail 0 1 Enabled CheckpointAuditTrail to ensure that each checkpoint generates a new transaction log file, preserving a complete history of database changes. This provides a reliable audit trail and improves recovery options in the event of system failure or data corruption.
FreeTextBatchSize 100000 10000000 FreeTextBatchSize controls how much text data (in bytes) is processed per indexing batch. Increased to allow larger chunks of text data to be indexed per batch during full-text indexing, reducing overhead and improving performance for large RDF loads and reindexing operations. Increase further to speed up indexing at the cost of RAM. ChatGPT 4o recommended 30 MB instead of 10 MB, but I know we're generally tight on RAM and don't care too much about speed, so I lowered this number.
AdjustVectorSize 0 1 Enabled AdjustVectorSize to allow Virtuoso to dynamically increase the number of rows processed per batch during query execution. This improves performance in large or fragmented queries by reducing random I/O and increasing cache and disk locality, even when using a single disk. It allows the engine to respond adaptively to the data access pattern without wasting resources on small queries.
HTTPLogFile logs/http.log (commented out) logs/http.log This enables logging to logs/http.log. This is the default path, although it is commented out by default.
HTTPLogFormat N/A (new variable) %h %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i" "%{NetId}e" This logging format is the default suggested in this page.
SQL_PREFETCH_ROWS 100 1000 The maximum number of rows the server will send in response to a single fetch call. For example, if the query returns 5000 rows, the client will now send 5 requests of 1000 rows instead of 50 requests of 100. This should be adjusted once we know how large the average query is.
SQL_PREFETCH_BYTES 16000 131072 (~128 KB) Same as SQL_PREFETCH_ROWS above but for bytes.
MaxQueryExecutionTime 60 (seconds) 900 Maximum execution time for one query. Increased to 15 minutes since queries may be large.
MaxMemInUse 0 (Unlimited) 1000000000 1 GB is an arbitrarily large but bounded number that caps the size of result structures (e.g. intermediate hash tables or construct dictionaries). MaxQueryExecutionTime would likely kick in first but this should be bounded just in case.

Below is a list of parameters that were not modified from the default, but could be considered in the future.

Parameter Default Value Reason
MaxClientConnections 10 A maximum of 10 users can connect through SPARQL, HTTP, or SQL at once.
ServerThreads 10 Same as MaxClientConnections above.
O_DIRECT 0 Potential performance improvements. See #388.
IndexTreeMaps 64 Potential performance improvements. See #389.
ResultSetMaxRows 10000 Cuts off results at this value. Increase to allow users to make very large queries.
MaxConstructTriples 10000 Similar as above, restricts the maximum size of a CONSTRUCT result.
MaxQueryCostEstimationTime 400 (seconds) This caps how long Virtuoso will spend estimating a query’s cost before execution. Reduce to reject costly queries faster. Raise if we have have very complex federated/inference rules that legitimately take longer to plan.
SQL_QUERY_TIMEOUT 0 (unlimited) Same as MaxQueryExecutionTime above (adjusted from 60 to 900 seconds) but client side. Leaving it as unlimited because MaxQueryExecutionTime should kick in first. Don't feel strongly either way about this one.
⚠️ **GitHub.com Fallback** ⚠️