InstallTutorial - NatLibFi/Skosmos GitHub Wiki

In this tutorial we will install Skosmos on a Ubuntu or Rocky Linux Server machine. The goal is to get Skosmos 2.18 running together with an Apache Jena Fuseki 4.6.1 triple store on the same machine and serving two example vocabularies that are available as SKOS files: STW Thesaurus and UNESCO Thesaurus.

SSH access to virtual host

You can skip this if you're not using a VirtualBox (or similar) VM environment. For convenience (e.g. copy/paste working in a terminal window) the SSH server package was installed on the virtual machine:

$ sudo apt install openssh-server

Since the VM uses NAT networking by default, a port forwarding rule needs to be added to the VirtualBox settings (Network -> Port Forwarding...). The rule is: Name="ssh", Protocol="TCP", Host Port="2222", Guest Port="22", other fields can be left blank. Confirm with OK. This can be done also while the VM is running.

After these operations I can ssh in to the virtual machine from my host using ssh -p 2222 localhost.

Install Apache Jena Fuseki

The following installation is also bundled in an inofficial Debian package.

Install Java 11

First we will need a Java 11 environment (JRE is enough for Fuseki but you can also use a JDK).

$ sudo apt update
$ sudo apt install default-jre-headless

This will install a bunch of packages and take a while. Verify that Java is installed by running java -version. It should return information about the Java environment. Check that it's Java 11 i.e. version 11.0.something. If you get another version such as 1.8.0, it means you still have an older Java installed. Either remove the older Java packages, or set the Java 11 as the default version:

# show the available Java versions
$ sudo update-java-alternatives -l
# set the version to use in Ubuntu
$ sudo update-java-alternatives --set java-1.11.0-openjdk-amd64
# set the version to use in Rocky Linux
$ sudo update-java-alternatives –-set jre-11-openjdk

Install Fuseki

Fuseki is distributed in a tar.gz archive containing everything. We will download it from apache.org to the user home directory and unpack it under /opt. We will also create a symbolic link (simply called /opt/fuseki) to the current version, which will make it easier to upgrade Fuseki in the future by simply changing the symlink to point to the new version.

$ cd ~
$ wget https://archive.apache.org/dist/jena/binaries/apache-jena-fuseki-4.6.1.tar.gz
$ cd /opt
$ sudo tar xzf ~/apache-jena-fuseki-4.6.1.tar.gz
$ sudo ln -s apache-jena-fuseki-4.6.1 fuseki

Now check that Fuseki starts up:

$ cd /opt/fuseki/
$ ./fuseki-server --help
$ ./fuseki-server --version

If everything works right, these commands should give information about supported command line options and version information.

Create a Fuseki system user

We want to run Fuseki as a non-root user for better security, so we create a system user called fuseki.

$ sudo adduser --system --home /opt/fuseki --no-create-home fuseki

Create directories for Fuseki configuration and databases

The default Fuseki file system layout is mainly aimed at standalone installs. However, for a server install, following the Filesystem Hierarchy Standard (FHS) layout makes sense as it makes e.g. system backups easier. So we will split the Fuseki files into separate system directories so that we get a layout that at least mostly resembles FHS:

  • Fuseki code (the server distribution) goes into /opt/fuseki, as above (actually a symlink)
  • databases go under /var/lib/fuseki
  • log files go under /var/log/fuseki
  • configuration files go under /etc/fuseki

This needs a bit of manual setting up but it's worth the effort in the long run.

# create the database directories
$ cd /var/lib
$ sudo mkdir -p fuseki/{backups,databases,system,system_files}
$ sudo chown -R fuseki fuseki

# create the log directories
$ cd /var/log
$ sudo mkdir fuseki
$ sudo chown fuseki fuseki

# create the configuration directories
$ cd /etc
$ sudo mkdir fuseki
$ sudo chown fuseki fuseki

# finally create symlinks for databases and logs within the configuration directory
$ cd /etc/fuseki
$ sudo ln -s /var/lib/fuseki/* .
$ sudo ln -s /var/log/fuseki logs

Make Fuseki start automatically at boot

We want to have Fuseki always running. To do so, we will need to create and configure a systemd script.

Create the systemd script

To make Fuseki use the above directories, we will create a file /etc/systemd/system/fuseki.service with this content:

[Unit]
Description=Fuseki
[Service]
Environment=FUSEKI_HOME=/opt/fuseki
Environment=FUSEKI_BASE=/etc/fuseki
Environment=JVM_ARGS=-Xmx4G
User=fuseki
ExecStart=/opt/fuseki/fuseki-server
Restart=on-failure
RestartSec=15
[Install]
WantedBy=multi-user.target

The JVM_ARGS line with -Xmx parameter sets the maximum amount of memory to consume in the Java Virtual Machine and eventually, Fuseki. The default is often too low. This depends on the amount of data you have, how you load it, and what else the server is doing, but generally, giving Fuseki around half the available RAM seems to be a good starting point. Here we have set it to 4GB. STW Thesaurus and UNESCO Thesaurus are quite small so we could get by with the default amount in this case.

Check that Fuseki starts up using the systemd script

Now we test that we can start Fuseki using the above systemd script and configuration.

$ sudo systemctl start fuseki

If everything worked fine, we can see that the Fuseki was started and running by running command $ sudo systemctl status fuseki. If there are problems, you should check the log file /var/log/fuseki/stderrout.log or run $ sudo journalctl -xe for more details.

Add Fuseki as a system service

With the systemd script working, we can enable running Fuseki as a system service using the following command:

$ sudo systemctl enable fuseki

This hooks up the necessary symlinks. To make sure, you can verify that it works by rebooting the machine and checking that the Fuseki process exists after booting, for example using the command ps ax|grep fuseki which should list the Java process of Fuseki.

Create and load vocabularies database

Create and configure a database and text index

There are two ways of creating the Fuseki database: using the web interface, or from the command line.

A. Creating the database using the Fuseki web interface

We can open a browser on http://localhost:3030/ to access the Fuseki web user interface.

Note that if you are running Fuseki within a VirtualBox VM and want to use the browser from the host machine, a port forwarding rule needs to be added to the VirtualBox settings (Network -> Port Forwarding...). The rule is: Name="fuseki", Protocol="TCP", Host Port="3030", Guest Port="3030", other fields can be left blank. Confirm with OK. This can be done also while the VM is running. You will also need to tell Fuseki to allow management operations for non-localhost access by commenting out the line /$/** = anon in the security configuration /etc/fuseki/shiro.ini and restarting Fuseki. Note that this is potentially dangerous if you open up Fuseki URLs to the world, since anyone will then be able to manage your datasets.

Use the user interface to create a dataset with these options:

  • name: skosmos
  • type: persistent (TDB2)

This creates Jena TDB2 database under the directory /var/lib/fuseki/databases/skosmos and its configuration file /etc/fuseki/configuration/skosmos.ttl.

B. Creating the database from the command line

Fuseki2 has an administration protocol that we can use to create the dataset using e.g. the curl command line tool:

curl --data "dbName=skosmos&dbType=tdb2" http://localhost:3030/$/datasets

If you get no error, the operation was successful. To verify, you can check that the directory /var/lib/fuseki/databases/skosmos/ exists.

Creating a text index

The newly created dataset doesn't have a text index. Before we load any data, we should create a text index.

First we need to shut down Fuseki temporarily:

$ sudo service fuseki stop

Then we edit the database configuration file /etc/fuseki/configuration/skosmos.ttl to look like this:

@prefix :      <http://base/#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
@prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix text:  <http://jena.apache.org/text#> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#> .

ja:DatasetTxnMem  rdfs:subClassOf  ja:RDFDataset .
ja:MemoryDataset  rdfs:subClassOf  ja:RDFDataset .
ja:RDFDatasetOne  rdfs:subClassOf  ja:RDFDataset .
ja:RDFDatasetSink  rdfs:subClassOf  ja:RDFDataset .
ja:RDFDatasetZero  rdfs:subClassOf  ja:RDFDataset .

tdb2:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb2:DatasetTDB2  rdfs:subClassOf  ja:RDFDataset .

tdb2:GraphTDB  rdfs:subClassOf  ja:Model .
tdb2:GraphTDB2  rdfs:subClassOf  ja:Model .

<http://jena.hpl.hp.com/2008/tdb#DatasetTDB>
    rdfs:subClassOf  ja:RDFDataset .

<http://jena.hpl.hp.com/2008/tdb#GraphTDB>
    rdfs:subClassOf  ja:Model .

text:TextDataset
    rdfs:subClassOf  ja:RDFDataset .

:service_tdb_all  a               fuseki:Service ;
    rdfs:label                    "TDB2+text skosmos" ;
    fuseki:dataset                :text_dataset ;
    fuseki:name                   "skosmos" ;
    fuseki:serviceQuery           "query" , "" , "sparql" ;
    fuseki:serviceReadGraphStore  "get" ;
    fuseki:serviceReadQuads       "" ;
    fuseki:serviceReadWriteGraphStore "data" ;
    fuseki:serviceReadWriteQuads  "" ;
    fuseki:serviceUpdate          "" , "update" ;
    fuseki:serviceUpload          "upload" .

:text_dataset a text:TextDataset ;
    text:dataset :tdb_dataset_readwrite ;
    text:index :index_lucene . 

:tdb_dataset_readwrite
    a tdb2:DatasetTDB2 ;
    # tdb2:unionDefaultGraph true ;
    tdb2:location  "/etc/fuseki/databases/skosmos" .

:index_lucene a text:TextIndexLucene ;
    text:directory <file:/etc/fuseki/databases/skosmos/text> ;
    text:entityMap :entity_map ;
    text:storeValues true .

# Text index configuration for Skosmos
:entity_map a text:EntityMap ;
    text:entityField      "uri" ;
    text:graphField       "graph" ;
    text:defaultField     "pref" ;
    text:uidField         "uid" ;
    text:langField        "lang" ;
    text:map (
         # skos:prefLabel
         [ text:field "pref" ;
           text:predicate skos:prefLabel ;
           text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
         ]
         # skos:altLabel
         [ text:field "alt" ;
           text:predicate skos:altLabel ;
           text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
         ]
         # skos:hiddenLabel
         [ text:field "hidden" ;
           text:predicate skos:hiddenLabel ;
           text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
         ]
         # skos:notation
         [ text:field "notation" ;
           text:predicate skos:notation ;
           text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
         ]
     ) . 

Now start Fuseki again:

$ sudo service fuseki start

If everything went well this will create a jena-text Lucene index under /var/lib/fuseki/databases/skosmos/text i.e. as a subdirectory of the TDB database to which it is linked.

Load data

With the database and text index now ready, we can load the vocabulary data. Again this can be done either using the Fuseki web interface, or via the command line.

First we need to download the example datasets, i.e. STW Thesaurus and UNESCO Thesaurus (these links are to Turtle downloads though Fuseki accepts also other RDF syntaxes). The STW Thesaurus additionally needs to be uncompressed: unzip stw.ttl.zip - you may need to install the unzip tool first using the command sudo apt install unzip .

A. Loading data using the Fuseki web interface

Go to the Fuseki web interface again, open the "Dataset" tab and click on "upload files".

  • For STW Thesaurus, enter the graph name http://zbw.eu/stw/, select the file stw.ttl and click on "upload now".
  • For UNESCO Thesaurus, enter the graph name http://skos.um.es/unescothes/, select the file unescothes.ttl and click on "upload now".

The graph names may be arbitrary URIs (here we use the URI namespaces as graph names) but they must match the Skosmos configuration later on.

To be sure that the uploads went well, you can open the "info" tab and click on "count triples in all graphs". It should show that the default graph is empty (0 triples) and the two other graphs should have around 109,000 and 75,000 triples, respectively.

B. Loading data using the command line

Instead of the web interface, we can use the command line tool s-put that comes with Fuseki to load data. However, this tool requires a Ruby interpreter, so you may need to install it first:

sudo apt install ruby

Then you can use s-put to load data like this:

/opt/fuseki/bin/s-put http://localhost:3030/skosmos/data http://zbw.eu/stw/ stw.ttl
/opt/fuseki/bin/s-put http://localhost:3030/skosmos/data http://skos.um.es/unescothes/ unescothes.ttl

If you get no error message, the operations were succesful. You can verify by checking the size of the database: the command du -sh /var/lib/fuseki/databases/skosmos/ should show that the database is about 250 MB.

Congratulations, now your database is ready!

Install Skosmos and its requirements

Install Apache and PHP

Start by installing Apache and PHP7.

$ sudo apt install apache2 libapache2-mod-php7.4 php7.4 php7.4-xsl php7.4-intl php7.4-mbstring php7.4-curl

After this you should verify that Apache is running by pointing your web browser at http://localhost/. It should show the Apache default page. If not, one should be able to start it with sudo service apache2 start and to set it to start at boot with sudo systemctl enable apache2. In case you are using Windows Subsystem for Linux (WSL), you may get Protocol not available: AH00076: Failed to enable APR_TCP_DEFER_ACCEPT warning that may render http://localhost/ non-functional. To fix this warning, prepend /etc/apache2/apache2.conf with AcceptFilter http none and restart Apache.

Additionally, before continuing, set (if not already set) the timezone declaration for php: Open /etc/php/7.4/apache2/conf.d/timezone.ini and add a line like date.timezone=$YOUR_TIMEZONE e.g., date.timezone="Europe/Helsinki" to the file. Remember to save the file. Now, you will have to restart the apache server for this setting to take effect.

Note that if you are running Apache within a VirtualBox VM and want to use the browser from the host machine, a port forwarding rule needs to be added to the VirtualBox settings (Network -> Port Forwarding...). The rule is: Name="apache", Protocol="TCP", Host Port="8000", Guest Port="80", other fields can be left blank. Confirm with OK. This can be done also while the VM is running. Then you can open http://localhost:8000/ from the browser on the host machine.

Configure Apache for Skosmos

Then you'll need to allow setting options in directory-specific .htaccess files by editing the apache configuration file in '/etc/apache2/sites-enabled/000-default.conf'. Inside that file, you will find the <VirtualHost *:80> block on line 1. Inside that block, add the following block:

        <Directory /var/www/html>
                Options Indexes FollowSymLinks MultiViews
                AllowOverride All
                Order allow,deny
                allow from all
        </Directory>

You should also enable the Apache modules mod_rewrite and mod_expires since Skosmos requires those to work.

$ sudo a2enmod rewrite
$ sudo a2enmod expires

After these changes you can restart Apache and the installation should be ready for running Skosmos.

$ sudo service apache2 restart

Install Skosmos

Start by cloning the Skosmos repository to a directory on the machine. We will create the directory /srv/Skosmos for this purpose, owned by a regular (non-root) user; in the below command, whoami is used so that the directory will end up in the ownership of the user performing the operation.

$ cd /srv
$ sudo mkdir Skosmos
$ sudo chown `whoami` Skosmos

To be able to clone Skosmos we'll also need to install the git client:

$ sudo apt install git

Then we can clone the Skosmos 2.18 code from GitHub into /srv/Skosmos:

$ git clone -b v2.18-maintenance https://github.com/NatLibFi/Skosmos.git /srv/Skosmos

After git has finished cloning the repository enter it and download and install Composer for managing the library dependencies.

$ cd /srv/Skosmos/
$ curl -sS https://getcomposer.org/installer | php

After you have downloaded and installed Composer you can simply install the dependencies required to run Skosmos. If you wish to to do some software development with your Skosmos installation you should omit the --no-dev part. Then you'll be able to run the unit tests and update the gettext translations. Please note that composer.phar is not recommended to be run using root/super user privileges.

$ php composer.phar install --no-dev

To make Skosmos accessible via Apache, we will add a symlink under /var/www/html pointing to the directory /srv/Skosmos where it was installed:

$ sudo ln -s /srv/Skosmos /var/www/html/Skosmos

Configure Skosmos

After installing the dependencies you need to configure the Skosmos installation. You can start by copying the default configuration files and using those as a basis for building your own configuration file.

$ cp config.ttl.dist config.ttl

Let's start by enabling the fuseki text index we created earlier.

$ nano config.ttl

We'll make the following changes to the configuration:

  1. Set the default SPARQL endpoint to the local Fuseki and the skosmos dataset
  2. Set the default SPARQL dialect to "JenaText" to use the jena-text index
  3. Add German translation to the UI languages

Please note that the Turtle notation requires using ; instead of . whenever the shorthand syntax for predicate lists is used as per Turtle specification (provided that the triple not the last one for the common subject).

Add triple
:config skosmos:sparqlEndpoint <http://localhost:3030/skosmos/sparql> .
and comment out the other skosmos:sparqlEndpoint declarations for :config.

Switch the following triple
:config skosmos:sparqlDialect "Generic" .
into
:config skosmos:sparqlDialect "JenaText" .
// interface languages available, and the corresponding system locales (you may remove Finnish and Swedish)
:config skosmos:languages (
    [ rdfs:label "en" ; rdf:value "en_GB.utf8" ]
    [ rdfs:label "de" ; rdf:value "de_DE.utf8" ]
  ) .

Your machine may not have English and/or German locales installed, which are necessary for the Skosmos UI translations to work. To generate the locales as well as to ensure that the preliminaries exist, run these commands:

sudo apt install gettext
sudo locale-gen en_GB.utf8
sudo locale-gen de_DE.utf8

Restart apache in order to have these in effect.

Next we will add vocabulary definitions and configurations for STW and UNESCO Thesaurus so that so Skosmos knows to look for the vocabularies from our Fuseki SPARQL endpoint. Add these blocks of code after the #Skosmos vocabularies line in the config.ttl file.

:unesco a skosmos:Vocabulary, void:Dataset ;
    dc:title "UNESCO Thesaurus"@en ;
    skosmos:shortName "UNESCO";
    dc:subject :cat_general ;
    void:uriSpace "http://skos.um.es/unescothes/";
    skosmos:language "en", "es", "fr", "ru";
    skosmos:defaultLanguage "en";
    skosmos:showTopConcepts true ;
    skosmos:fullAlphabeticalIndex true ;
    skosmos:groupClass isothes:ConceptGroup ;
    void:sparqlEndpoint <http://localhost:3030/skosmos/sparql> ;
    skosmos:sparqlGraph <http://skos.um.es/unescothes/> .
 
:stw a skosmos:Vocabulary, void:Dataset ;
    dc:title "STW Thesaurus for Economics"@en ;
    skosmos:shortName "STW";
    dc:subject :cat_general ;
    void:uriSpace "http://zbw.eu/stw/";
    skosmos:language "en", "de";
    skosmos:defaultLanguage "de";
    void:sparqlEndpoint <http://localhost:3030/skosmos/sparql> ;
    skosmos:sparqlGraph <http://zbw.eu/stw/> .

You can remove the ysa and yso example vocabulary definitions, even though they should point to a separate SPARQL endpoint and work out-of-the-box.

Now you should be able to see the STW and Unescothes on the Skosmos front page. Point your browser to http://localhost/Skosmos/ (or http://localhost:8000/Skosmos/ from the host machine) and verify that you can see and open the vocabulary front pages. Replace localhost with your server ip if you're not doing this locally.

Optimizing performance

Now that basic Skosmos functionality is working, we can try to make it faster. But first we need to benchmark how well it performs so that we know that we are making progress.

Measure response time

To measure response time, we will use the simple Apache benchmark tool ab, which needs to be installed first:

$ sudo apt install apache2-utils

Best practice would be to run the benchmarking tool from another machine, but since we are only interested in relative performance, and ab is very lightweight, we can also just run it from the same machine.

For simplicity's sake we will just measure two operations: 1) how long it takes to generate a web page for a single concept - we'll pick the concept "Culture" from the UNESCO Thesaurus - and 2) how long it takes to generate the front page of the STW Thesaurus with the alphabetical index. These commands will load those pages 100 times:

$ ab -n 100 http://localhost/Skosmos/unesco/en/page/C00926
$ ab -n 100 http://localhost/Skosmos/stw/en/index

ab will report many figures, but let's just concentrate on the "Requests per second" value. On my example virtual machine, after running this several times, the reported numbers stabilize around 16 and 4, respectively. Not too bad, but could be improved!

Install APC

The first optimization step is to install the APC cache for PHP. Skosmos uses APC for caching the vocabulary configuration file since the Turtle parsing can be quite slow when you have many vocabularies in your configuration file. APC is also used for caching queries made to external resources other than the Fuseki instance. This alone can considerably speed up your Skosmos page load times.

$ sudo apt install php-apcu
$ sudo service apache2 restart

Now we can measure the performance again using ab. On my machine, the requests per second increased to about 23 and 4.6, i.e. about 15-40% faster, not bad for just installing an additional package. With a larger number of vocabularies (and thus a larger config.ttl file), the improvement would have been even larger.

Install Varnish

Another way to speed up Skosmos is to add a HTTP proxy cache in front of Fuseki. The cache will store answers to recurring SPARQL queries and answer them much faster than Fuseki could. Many of the SPARQL queries that Skosmos performs will be repeated many times, so this will speed up Skosmos. However, it doesn't improve worst case response times.

We will first install the Varnish package. This will install Varnish 6.2:

$ sudo apt install varnish

By default in Ubuntu 20.04, Varnish will listen on TCP port 6081. It will use a non-persistent in-memory cache of 256MB to store HTTP responses. This is fine for the purposes of this example, but can be changed by editing the systemd configuration for Varnish.In particular, the amount of memory allocated to the cache could be increased to improve the cache hit rate if you have lots of vocabulary data.

Here is an example of how to increase the memory allocation to 1GB. You will first need to create a varnish.service.d directory to store configuration overrides:

$ sudo mkdir /etc/systemd/system/varnish.service.d/

Then create a file called varnish-commandline.conf in that directory, with this content:

# Override the Varnish command line

[Service]
# Clear existing ExecStart= (required)
ExecStart=
# Set a new ExecStart=
ExecStart=/usr/sbin/varnishd -j unix,user=vcache -F -a :6081 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,1g

Then activate the new systemd configuration with:

$ sudo systemctl daemon-reload
$ sudo service varnish restart

The Varnish back-end configuration needs to be changed. It must be told to access Fuseki instead of some other web server. Additionally we will ask Varnish to store responses for up to one week (instead of the default 2 minutes) in a compressed form, which will allow many more responses to be stored in the cache, at the cost of some CPU time for compressing and uncompressing. Edit the /etc/varnish/default.vcl to look like this:

vcl 4.0;

backend default {
    .host = "127.0.0.1";
    .port = "3030";
}

sub vcl_backend_response {
    # store for a long time (1 week)
    set beresp.ttl = 1w;
    # always gzip before storing, to save space in the cache
    set beresp.do_gzip = true;
}

Then restart Varnish:

sudo service varnish restart

Note that since the cache is non-persistent, you can always clear the cache simply by restarting Varnish, for example if you update your vocabulary data.

Now we need to tell Skosmos to access the Fuseki SPARQL endpoint via Varnish instead of directly. To do this, we will change references to 3030 (the Fuseki port) to 6081 (the Varnish port).

In config.ttl:

For # Skosmos main configuration:

:config a skosmos:Configuration ;
    skosmos:sparqlEndpoint <http://localhost:6081/skosmos/sparql> ;

For both stw and unesco:

    void:sparqlEndpoint <http://localhost:6081/skosmos/sparql> ;

Now we can measure performance again using ab. This time the result is about 24 requests per second for the concept page and 4.6 requests per second for the STW index page. For the index page, the speedup is about 4% over just using APC. The total improvement in execution times was 50% for the index page and 15% for the concept page. Not bad!

Conclusion

In this tutorial we walked through installing Fuseki and Skosmos on an Ubuntu 20.04 server and also optimized its performance. After having set up a basic Skosmos installation this way, we could add more vocabularies, configure the text index to fine tune search behavior, or configure Skosmos to behave differently or look different.


Apache Jena and associated module names are trademarks of the Apache Software Foundation.

⚠️ **GitHub.com Fallback** ⚠️