kg pipeline - OpenCS-ontology/OpenCS GitHub Wiki

Description

KG-pipeline is created to generate a comprehensive knowledge graph from CS documents within two specific archives, namely SCPE and CSIS. However, this list may be extended if someone wishes to implement a scraper compatible with a different archive and integrate it with our scraper module. Our system relies on the utilization of Docker containers, each specifically designed to execute a distinct data processing step. These containers are orchestrated seamlessly through a single shell file that is stored within this repository. The solution consists of the components covered in Modules section.

Environment configuration and requirements

Environment configuration:

  • (recommendation) use Linux system or WSL while using Windows,
  • have at least 30 GB of free disk space

Requirements:

  • Bash: 5.1.16(1)-release
  • Docker engine: 24.0.6, download one of the following:
    • Docker Desktop from the official page 7 following the installation guide (recommended if you want UI for containers),
    • Docker engine from the official page 8 following the installation guide.
  • git: 2.34.1

Instalation

  1. Open Linux/WSL terminal,

  2. choose the directory where you want to store the main pipeline repository and cd to it,

  3. run git clone https://github.com/OpenCS-ontology/kg-pipeline to clone the main pipeline repository

  4. run cd kg-pipeline

  5. if you decided to install Docker Desktop, run it

User guide

  1. Make sure your working directory is set to kg-pipeline repository folder

  2. Make sure Docker engine is running

  3. Run bash ./run_project.sh csis_volumes=volume_numbers scpe_issues=issue_numbers, where:

    • volume_numbers is a string consisting of CSIS archive volumes you want to process in a form of numbers separated by commas (e.g. "1,2,3"). If you don't specify this variable, every volume from archive web page will be taken for the final graph.
    • issue_numbers is a string consisting of SCPE archive issues you want to process in a form of numbers separated by commas (e.g. "1,2,3"). If you don't specify this variable, every issue from archive web page will be taken for the final graph.

After invoking this script, images will be downloaded and then containers will be initialized. This process might take a dozen minutes on a slower device. Following initialization, the pipeline will process the provided papers, and the final knowledge graph will be visible on a host machine within a folder kg-pipeline/final_output. Additionally, after the execution of each module, a directory will be created in kg-pipeline to store its respective output. Due to the file size limitations, we divide a Turtle file containing the final graph into a few sub-files describing information about:

  • articles,
  • authors,
  • bibliography,
  • conference papers,
  • papers
  • organizations

Solution schema

solution_schema

Modules description

Developer guide

Module modification guide

In order to modify majority of KG-pipeline it is the best to refer to developer guides in specific module. However, if there is a need to customize the entire pipeline orchestration, an analysis of the following components is recommended:

  • run_project.sh script which plays a crucial role in activating Docker Compose and executing the main scripts within the containers,
  • compose.yaml which is a Composer configuration file responsible for setting up containers and volumes.

For instance, when incorporating a new module into the pipeline, it is vital to follow these steps:

  • update compose.yaml: Integrate the new module's Docker container into the compose.yaml file, specifying any required volumes.
  • modify run_project.sh: Include a docker run line within run_project.sh to execute a script inside the newly added container. This script should encompass all the functionalities desired for the new module.
⚠️ **GitHub.com Fallback** ⚠️