Compiling and configuring Virtuoso - dkmfbk/knowledgestore GitHub Wiki

This page briefly summarizes some notes for compiling, configuring and using the Virtuoso triple store with the KnowledgeStore. Notes are based on Virtuoso version 7.2 (but also apply to version 7.1).

Compiling Virtuoso

Download Virtuoso sources from GitHub: https://github.com/openlink/virtuoso-opensource (clone or download zip), then enter the parent folder and execute:

export CFLAGS="-O2 -m64"
./autogen.sh
./configure --with-readline \
            --prefix=PATH_TO_INSTALL_DIR \
            --with-jdk4_1=PATH_TO_JDK7_DIR \
            --program-transform-name="s/isql/isql-vt/" \
make
make check # still broken
make install

Notes:

  • the --prefix configure flag is necessary to install Virtuoso in a specific directory (e.g., /opt/virtuoso-7.2); without it, Virtuoso binaries will be put under /usr/bin and so on; based on this setting, it may be necessary to execute make install as root;
  • the --program-transform-name flag is necessary to avoid name clashes between the isql tool by Virtuoso and the one that may be already installed on the machine
  • the --with-readline flag is necessary to compile Virtuoso with readline support (note: the command line client is almost inusable without readline!)
  • expect a lot of warnings to be logged during make - it's ok;
  • the make check command is optional; it tests the compiled binaries against a test suite; a number of tests fail but it's ok.

Configuring Virtuoso

See also http://docs.openlinksw.com/virtuoso/databaseadmsrv.html

Basic configuration (parameters must be set correctly for Virtuoso to start at all):

  • DatabaseFile, ErrorLogFile, LockFile, TransactionFile, xa_persistent_file in section [Database];
  • DatabaseFile and TransactionFile in section [TempDatabase];
  • ServerPort, DirsAllowed, VADInstallDir in section [Parameters];
  • ServerPort, ServerRoot, HTTPLogFile in section [HTTPServer];
  • LoadPath, LoadNNN in section [Plugins] (may disable some LoadNNN lines if corresponding plugins are not used)

Setting Memory To set the right amount of memory, run "status();" from isql-vt, and compute pages-free. Set the result as the NumberOfBuffers (slightly increased). Set MaxDirtyBuffers as 3/4 of NumberOfBuffers.

SPARQL configuration (section SPARQL):

  • MaxQueryCostEstimationTime (default 4000 seconds). Better not to set this parameter, as estimated execution times may be wrong and valid queries may be rejected for that reason.
  • MaxQueryExecutionTime. Set to large value in case analytical queries need to be run. The value set here is an hard constraint that prevails on any timeout passed by the client (including the KS).
  • ResultSetMaxRows. Set to large value in case dump queries need to be run (keep in mind hard 1M constraint of SPARQL HTTP endpoint).
  • DefaultQuery. Can be changed to something more informative like SELECT (COUNT(*) AS ?n) WHERE { ?s ?p ?o }.

Performance optimization (section [Parameters]):

  • VectorSize (default 1000). This roughly controls how many rows are processed together by the query processor. This parameter greatly affects performances, but on a per-query basis (i.e., some queries prefer a small value, other a large value). The default 1000 is a good tradeoff that may be slightly increased but only based on experiments.
  • MaxQueryMem (default 2G) & HashJoinSpace. The first parameter controls the amount of memory that is constantly allocated for the query processor (but more memory could be used and then released if necessary, which causes some overhead). Increasing it leads to a small increase in performances, especially for slow queries. HashJoinSpace controls the fraction of MaxQueryMem that can be used for hash joins. It seems advisable to set it equal to MaxQueryMem.
  • AdjustVectorSize (default 0) & MaxVectorSize. If set to 1, AdjustVectorsize allows to increase VectorSize adaptively for queries that require an higher value; in that case, MaxVectorSize is the maximum value that VectorSize can be set to (1000000 is the suggested value). AdjustVectorSize = 1 should be beneficial in theory (as suggested in Virtuoso docs), but we noticed a relevant decrease in performances by enabling it, so it seems better to leave it disabled as in the default Virtuoso configuration.
  • ThreadsPerQuery (default 4) and AsyncQueueMaxThreads (default 10). They control the max additional number of threads that can be allocated to a query and their sum across all queries (i.e., the pool size). They have little impact on fast queries, but it seems better to set both of them to the number of CPU threads as this may improve the execution of slow, complex queries.
  • ServerThreads (default 20). It must be set at least to the number of CPU threads, better if something more to have some margin (don't know whether they are used exclusively for client queries or also for internal tasks; to stay on the safe side, we set this parameter to twice the number of CPU threads)
  • NumberOfBuffers & MaxDirtyBuffers. The first is the number of 8 KB pages for storing (caching) the DB in memory. Memory-permitting, it should be set to a number larger than the number of pages used by the DB (use isql-vt and status() command to compute it). The second parameter is relevant only when data is modified and its suggested value is 3/4 of the first parameter value.
  • DefaultIsolation. If the database is read-only, it seems natural to set it to 1 (=READ UNCOMMITTED, i.e. no transactional guarantee) to avoid any synchronization overhead.
  • MaxMemPoolSize (default 100000000). This is the max memory used by the query planner. A larger value (200000000) was found in Internet.
  • O_DIRECT (default 0). It controls whether OS file buffering is used or skipped when accessing the DB. The default value 0 (use buffering) seems to be faster.

Configuring the KnowledgeStore for interfacing with Virtuoso

A template for the configuration of the TripleStore internal component is the following:

<obj:tripleStore>
    a <java:eu.fbk.knowledgestore.triplestore.SynchronizedTripleStore> ;
    :synchronizerSpec "NUM_CPU_THREADS_HERE:0" ;
    :delegate [
        a <java:eu.fbk.knowledgestore.triplestore.LoggingTripleStore> ;
        :delegate [
            a <java:eu.fbk.knowledgestore.triplestore.virtuoso.VirtuosoJdbcTripleStore> ;
            :host "VIRTUOSO_HOST_HERE" ;
            :port "VIRTUOSO_PORT_HERE" ;
            :username "VIRTUOSO_USERNAME" ;
            :password "VIRTUOSO_PASSWORD" ;
            :fetchSize 200 ;
        ]
    ] .

The important point is to use the VirtuosoJdbcTripleStore driver that offers better performances.

It is also important to fine tune the KnowledgeStore thread pool, which should be larger than # CPU threads + 2 * (# HTTP server acceptors + # HTTP server selectors). Note that # acceptors and # selectors can be safely set to 1 unless a very large number of concurrent client connections is expected. The relevant configuration fragment is:

<obj:launcher>
    :threadCount NUM_OF_THREADS_IN_THE_POOL_HERE ;
    ...

<obj:httpServer>
    ...
    :acceptors NUM_HTTP_ACCEPTOR_THREADS_HERE ;
    :selectors NUM_HTTP_SELECTOR_THREADS_HERE ;
    ...