Links_Discovery - ALIADA/aliada-tool GitHub Wiki

Links Discovery

The Links Discovery module programs several processes (subjobs) with the LINUX crontab utility. Each of these processes (subjobs) is the links-discovery-task-runner.sh shell script that executes the Java class eu.aliada.linksDiscovery.impl.LinkingProcess. This Java class executes:

  • either the SILK library functions, only when the external dataset provides a public SPARQL endpoint. The SILK library functions have been modified for ALIADA (SILK modified version for ALIADA);
  • or calls the ad-hoc API provided by the external dataset to discover new links.

The Links Discovery module programs these processes (subjobs) to be executed. These processes (subjobs) will search for the links of the Aliada dataset with other datasets in the Open Linked Data Cloud, such as DBpedia, Geonames, Europeana, BNB, BNE, Freebase, NSZL, MARC Code Lists, Openlibray, Lobid, Viaf and Library of Congress Subject Headings. There will be one process (subjob) programmed for each dataset in the Open linked Data Cloud to link with.
The `links-discovery-task-runner.sh` shell script and the `eu.aliada.linksDiscovery.impl.LinkingProcess` Java class are in the `aliada-links-discovery-application-client` software module.

REST Interface

The Links Discovery module provides a RESTful interface. It offers the following services:

  • Create a new links discovery job. The identifier of the job to be initiated must be provided, and it is supposed to be a valid integer.

    • method: POST
    • URL: http://<host>:<port>/links-discovery/job
    • parameters sent inside a form (APPLICATION_FORM_URLENCODED):
      • jobid=<job identifier>
  • Get a links discovery job state/info. The identifier of the job must be provided, and it is supposed to be a valid integer.

    • method: GET
    • URL: http://<host>:<port>/links-discovery/job/<job identifier>

Once the Links Discovery module receives any of these service invocations, it reads the input parameters of the job from table aliada.linksdiscovery_job_instances of a relational DB. The parameters to connect to this DB are obtained from the "context.xml" file of the Links Discovery module. The services will return an XML or JSON structure with the following information:

  • id: the job identifier.
  • startDate: the starting date of the job.
  • endDate: the end date of the job.
  • numLinks: the number of links generated by all the subjobs.
  • durationSeconds: how many seconds have been needed to discover the links.
  • status: the status of the job. Possible values:
    • idle: the job hasn´t started yet. That is, the DB table row exists, but the job creation REST service hasn´t been invoked yet.
    • running : the job is still running.
    • finished : the job has finished and the number of links generated is in the “numLinks” field.
  • subjobs: a list with the following information for each subjob initiated by the job:
    • id: the subjob identifier.
    • name: the name of the subjob.
    • startDate: the starting date of the subjob.
    • endDate: the end date of the subjob.
    • numLinks: the number of links generated by the subjob.
    • durationSeconds: how many seconds have been needed to discover the links.
    • status: the status of the job. Possible values:
      • idle: the subjob hasn´t started yet. That is, the DB table row exists, but the subjob hasn´t started executing yet.
      • running : the subjob is still running.
      • finished : the subjob has finished and the number of links generated is in the “numLinks” field.

Here is an example in JSON format:

    {
    "subjobs":[
        {"endDate":"2014-07-09T10:22:49","id":1,"name":"LIDO_DBpedia","numLinks":1,"startDate":"2014-07-09T10:22:15","durationSeconds":"2040","status":"finished"},
        {"endDate":"2014-07-09T10:33:27","id":2,"name":"LIDO_Geonames","numLinks":1,"startDate":"2014-07-09T10:33:21","durationSeconds":"360","status":"finished"}
        ]
        ,"endDate":"2014-07-09T10:33:43","id":1,"numLinks":2,"startDate":"2014-07-10T10:33:08","durationSeconds":"2100", "status":"finished"
    }

Relational DB tables used

The Links Discovery module uses two tables:

  • Table aliada.linksdiscovery_job_instances. This table is used for saving the configuration parameters and the state of each job instance. The configuration parameters are set by the module that creates the job instance in the DB, that is the IU module. The state related fields are set by the job itself.
  • Table aliada.linksdiscovery_subjob_instances. This table is used for saving the configuration parameters and the state of each subjob instance created by the corresponding job instance. The configuration parameters are set by the module that creates the subjob instance in the DB, that is the Links Discovery module. The state related fields are set by the subjob itself.

Table aliada.linksdiscovery_job_instances

This table contains the following fields grouped by configuration parameters fields and state related fields:

  • job_id
  • Configuration fields:
    • input_uri: the URI of the SPARQL endpoint of Aliada.
    • input_login: the login of the SPARQL endpoint indicated by the input_uri field.
    • input_password: the password of the SPARQL endpoint indicated by the input_uri field.
    • input_graph: the URI of the dataset graph to be accessed through the SPARQL endpoint indicated by the input_uri field.
    • output_uri: the URI of the SPARQL endpoint where to store the generated links of Aliada. This field will have the same value as the input_uri field.
    • output_login: the login of the SPARQL endpoint indicated by the output_uri field.
    • output_password: the password of the SPARQL endpoint indicated by the output_uri field.
    • output_graph: the URI of the dataset graph to be accessed through the SPARQL endpoint indicated by the output_uri field.
    • tmp_dir: the name of the temporary folder to be used by the Links Discovery module to:
      • generate the subjob configuration file containing the parameters to connect to the relational DB;
      • generate the SILK configuration file;
      • save the files containing the triples of the generated links. These files are inserted in the Virtuoso graph ;
    • client_app_bin_dir: the name of the folder where the links-discovery-task-runner.sh shell script has been installed. This script executes the Java class eu.aliada.linksDiscovery.impl.LinkingProcess, which in turn executes the SILK library functions to discover new links. This path is used to introduce the whole path of the script in the crontab of LINUX. The Links Discovery module programs several processes with the crontab to execute SILK at one hour intervals to search for the links of the Aliada dataset with DBpedia, Geonames, etc.
    • client_app_user: the machine´s user for whom the crontab will be programmed to execute the links-discovery-task-runner.sh shell script.

* State fields: * start_date * end_date

Table aliada.linksdiscovery_subjob_instances

This table contains the following fields grouped by configuration parameters fields and state related fields:

  • job_id
  • subjob_id
  • Configuration fields:
    • name: the name of the subjob, which includes the name of the external dataset to link with (e.g.: ALIADA_DBPedia).
    • config_file: the whole path of the configuration file for SILK.
    • num_threads: the number of threads to be passed as input parameter to SILK. 1 by default.
    • reload_source: the flag (true-false) to be passed as input parameter to SILK, indicating to reload or not the source dataset, that is, the ALIADA generated dataset. False by default.
    • reload_target: the flag (true-false) to be passed as input parameter to SILK, indicating to reload or not the target dataset, that is, the external dataset. False by default.
    • output_uri: the URI of the SPARQL endpoint where to upload the files containing the triples of the generated links.
    • output_login: the login of the SPARQL endpoint indicated by the output_uri field.
    • output_password: the password of the SPARQL endpoint indicated by the output_uri field.
    • output_graph: the URI of the dataset graph to be accessed through the SPARQL endpoint indicated by the output_uri field.
    • tmp_dir: the name of the temporary folder. This field is not used by now.

* State fields: * num_links: the number of links generated. * start_date * end_date
⚠️ **GitHub.com Fallback** ⚠️