Virtuoso Setup Guide - DDMAL/linkedmusic-datalake GitHub Wiki
The staging server (https://virtuoso.staging.simssa.ca) was setup according to the instructions bellow. For information on the server itself, see the DDMAL internal Wiki.
(official Virtuoso Docker setup guide here)
-
Pull the docker image (line 1) and check the image version (optional, line 2).
sudo docker pull openlink/virtuoso-opensource-7 sudo docker run openlink/virtuoso-opensource-7 version
-
Start a docker container.
sudo mkdir my_virtdb cd my_virtdb sudo docker run \ --name my_virtdb \ --interactive \ --tty \ --env DBA_PASSWORD=mysecret \ --publish 1111:1111 \ --publish 8890:8890 \ --volume `pwd`:/database \ openlink/virtuoso-opensource-7:latest
This creates a new Virtuoso database in the my_virtdb
subdirectory and starts a Virtuoso instance with the HTTP server listening on port 8890
and the ISQL
data server listening on port 1111
.
Note that you should change the DBA_PASSWORD
to the desired password.
This can be done before or after the configuration.
- Get into the local instance.
docker exec -it <docker id> bash
The <docker id>
can be retrieved by running docker ps
.
- Download data in compact json to local instance.
#!/bin/bash
# Download and rename files from different URLs
wget -O simssadb.jsonld raw.githubusercontent.com/DDMAL/linkedmusic-datalake/main/simssadb/jsonld/compact.jsonld
wget -O cantusdb.jsonld raw.githubusercontent.com/DDMAL/linkedmusic-datalake/main/cantusdb/jsonld/compact.jsonld
- Upload data.
Open the isql CLI (use the correct username and password)
isql -U dba -P mysecret
Then load the json-ld files:
(see details of rdf_load_json_ld()
here)
rdf_load_json_ld (file_to_string('simssadb.jsonld'),'', 'urn:simssadb');
rdf_load_json_ld (file_to_string('cantusdb.jsonld'),'', 'urn:cantusdb');
- Go to the local server
http://localhost:8890/
. Log into conductor using
username: dba
password: mysecret
-
Go to
System Admin
>Packages
. Downloadconductor
,fct
,iSPARQL
,rdf_mappers
(downloadrdf_mappers
[here](http://download3.openlinksw.com/uda/vad-vos-packages/7.2/rdf_mappers_dav.vad) and install from upload). You can find the rest of the packages here if not previously installed. -
Check if faceted search works here
http://localhost:8890/fct/
. Try SPARQL herehttp://localhost:8890/sparql/
. -
Configure data and permissions.
Open the
ISQL
CLI:-- Permission for Sponging (optional) -- see https://github.com/openlink/virtuoso-opensource/issues/1180 DB.DBA.RDF_DEFAULT_USER_PERMS_SET ('SPARQL', 7); DB.DBA.RDF_DEFAULT_USER_PERMS_SET ('nobody', 7); -- Post Installation Setup for Virtuoso Faceted Browser -- see: https://vos.openlinksw.com/owiki/wiki/VOS/VirtFacetBrowserInstallConfig#Post%20Installation RDF_OBJ_FT_RULE_ADD (null, null, 'All'); VT_INC_INDEX_DB_DBA_RDF_OBJ (); urilbl_ac_init_db(); s_rank(); -- For federated SPARQL query search, see https://community.openlinksw.com/t/sparql-federated-query/4162/4 grant execute on "DB.DBA.SPARQL_SINV_IMP" to "SPARQL"; grant select on "DB.DBA.SPARQL_SINV_2" to "SPARQL"; -- Grant privileges to user "SPARQL", might not be needed?? -- TODO: See if this is actually needed grant SPARQL_SELECT to "SPARQL"; grant SPARQL_UPDATE to "SPARQL"; grant SPARQL_SPONGE to "SPARQL";
Note: Make sure to rerun these lines after loading a new JSON-LD (for text indexing and entity label table)
VT_INC_INDEX_DB_DBA_RDF_OBJ ();
urilbl_ac_init_db();
- From Conductor, navigate to "Linked Data">"Namespaces" to view the list of configured prefixes and to add your own
wd:
http://www.wikidata.org/entity/
wdt:
http://www.wikidata.org/prop/direct/
!Note: The current Virtuoso Staging instance doesn't Sponge external information. This documentation is here in case we decide to do it in the future.
This is for retrieving external RDF data that can be reached from the loaded JSON-LD (ie. Wikidata RDF). After discussing with Ich, this might or might not be what we want.
(See more about sponging here)
In interactive SQL (ISQL), run: (Change the grab-depth and limit)
SPARQL
define input:grab-all "yes" define input:grab-depth 2 define input:grab-limit 100
SELECT *
FROM NAMED <urn:test>
WHERE { GRAPH ?g { ?s ?p ?o } };
Accounting for codes above:
Upon execution, one may find there appear New Named Graphs(presumed as NNG) in your local Virtuoso, which graphs are named according to instances from the
<urn:test>
graph. As long as an instance is an accessible URL(presumed as A), namely a visitable webpage, sponger can incorporate those URLs(presumed as B1,B2,...) that link A, and convert them into RDF in the NNG.
To focus on sponging wikidata fields:
SPARQL
define input:grab-all "yes"
define input:grab-depth 5
define input:grab-limit 20
SELECT ?s ?p ?o
FROM NAMED <urn:test>
WHERE {
GRAPH ?g {
?s ?p ?o .
FILTER(STRSTARTS(STR(?p), "http://www.wikidata.org/"))
}
};
Below is a complete list of modifications made to the default virtuoso.ini
file on the production server. Staging is using default settings as of 18 June 2025 (with the exception of the DirsAllowed
change). Referenced documentation was found on this page. However, it appears to be significantly out of date. The default virtuoso.ini
file is much smaller in the documentation than the default that was on production. Some of the parameters have been changed or removed and many of the default values are different.
Parameter | Default Value | New Value | Reason |
---|---|---|---|
DirsAllowed |
., ../vad, /usr/share/proj |
., ../vad, /usr/share/proj, /database, /database/data |
In order to use the bulk loader, you must enable access to the database directory. Resolves "access denied" issue when running ld_dir . See this Wiki page. |
NumberOfBuffers |
10000 |
400000 |
virtuoso.ini suggests the following: when running with large data sets, one should configure the Virtuoso process to use between 2/3 to 3/5 of free system memory . For the 6 GB we have available on production, this would be 510000 (linearly interpolating the suggested values for 4 GB and 8 GB in virtuoso.ini ). Resolves #383. Further reduced to 400000 to reduce memory usage. See #392. |
MaxDirtyBuffers |
6000 |
300000 |
virtuoso.ini suggests the following: when running with large data sets, one should configure the Virtuoso process to use between 2/3 to 3/5 of free system memory . For the 6 GB we have available on production, this would be 375000 (linearly interpolating the suggested values for 4 GB and 8 GB in virtuoso.ini ). Resolves #383. Further reduced to 300000 to reduce memory usage. See #392. |
FileExtend |
100 |
5000 |
Increased FileExtend to improve performance during database growth. This reduces the frequency of small I/O operations by allocating additional space in larger 40 MB chunks (8 KB per buffer), which is more efficient for large or growing RDF datasets. |
CheckpointAuditTrail |
0 |
1 |
Enabled CheckpointAuditTrail to ensure that each checkpoint generates a new transaction log file, preserving a complete history of database changes. This provides a reliable audit trail and improves recovery options in the event of system failure or data corruption. |
FreeTextBatchSize |
100000 |
10000000 |
FreeTextBatchSize controls how much text data (in bytes) is processed per indexing batch. Increased to allow larger chunks of text data to be indexed per batch during full-text indexing, reducing overhead and improving performance for large RDF loads and reindexing operations. Increase further to speed up indexing at the cost of RAM. ChatGPT 4o recommended 30 MB instead of 10 MB, but I know we're generally tight on RAM and don't care too much about speed, so I lowered this number. |
AdjustVectorSize |
0 |
1 |
Enabled AdjustVectorSize to allow Virtuoso to dynamically increase the number of rows processed per batch during query execution. This improves performance in large or fragmented queries by reducing random I/O and increasing cache and disk locality, even when using a single disk. It allows the engine to respond adaptively to the data access pattern without wasting resources on small queries. |
HTTPLogFile |
logs/http.log (commented out) |
logs/http.log |
This enables logging to logs/http.log . This is the default path, although it is commented out by default. |
HTTPLogFormat |
N/A (new variable) | %h %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i" "%{NetId}e" |
This logging format is the default suggested in this page. |
SQL_PREFETCH_ROWS |
100 |
1000 |
The maximum number of rows the server will send in response to a single fetch call. For example, if the query returns 5000 rows, the client will now send 5 requests of 1000 rows instead of 50 requests of 100. This should be adjusted once we know how large the average query is. |
SQL_PREFETCH_BYTES |
16000 |
131072 (~128 KB) |
Same as SQL_PREFETCH_ROWS above but for bytes. |
MaxQueryExecutionTime |
60 (seconds) |
900 |
Maximum execution time for one query. Increased to 15 minutes since queries may be large. |
MaxMemInUse |
0 (Unlimited) |
1000000000 |
1 GB is an arbitrarily large but bounded number that caps the size of result structures (e.g. intermediate hash tables or construct dictionaries). MaxQueryExecutionTime would likely kick in first but this should be bounded just in case. |
Below is a list of parameters that were not modified from the default, but could be considered in the future.
Parameter | Default Value | Reason |
---|---|---|
MaxClientConnections |
10 |
A maximum of 10 users can connect through SPARQL, HTTP, or SQL at once. |
ServerThreads |
10 |
Same as MaxClientConnections above. |
O_DIRECT |
0 |
Potential performance improvements. See #388. |
IndexTreeMaps |
64 |
Potential performance improvements. See #389. |
ResultSetMaxRows |
10000 |
Cuts off results at this value. Increase to allow users to make very large queries. |
MaxConstructTriples |
10000 |
Similar as above, restricts the maximum size of a CONSTRUCT result. |
MaxQueryCostEstimationTime |
400 (seconds) |
This caps how long Virtuoso will spend estimating a query’s cost before execution. Reduce to reject costly queries faster. Raise if we have have very complex federated/inference rules that legitimately take longer to plan. |
SQL_QUERY_TIMEOUT |
0 (unlimited) |
Same as MaxQueryExecutionTime above (adjusted from 60 to 900 seconds) but client side. Leaving it as unlimited because MaxQueryExecutionTime should kick in first. Don't feel strongly either way about this one. |