kg pipeline - OpenCS-ontology/OpenCS GitHub Wiki
KG-pipeline is created to generate a comprehensive knowledge graph from CS documents within two specific archives, namely SCPE and CSIS. However, this list may be extended if someone wishes to implement a scraper compatible with a different archive and integrate it with our scraper module. Our system relies on the utilization of Docker containers, each specifically designed to execute a distinct data processing step. These containers are orchestrated seamlessly through a single shell file that is stored within this repository. The solution consists of the components covered in Modules section.
Environment configuration:
- (recommendation) use Linux system or WSL while using Windows,
- have at least 30 GB of free disk space
Requirements:
- Bash: 5.1.16(1)-release
- Docker engine: 24.0.6, download one of the following:
- Docker Desktop from the official page 7 following the installation guide (recommended if you want UI for containers),
- Docker engine from the official page 8 following the installation guide.
- git: 2.34.1
-
Open Linux/WSL terminal,
-
choose the directory where you want to store the main pipeline repository and
cd
to it, -
run
git clone https://github.com/OpenCS-ontology/kg-pipeline
to clone the main pipeline repository -
run
cd kg-pipeline
-
if you decided to install Docker Desktop, run it
-
Make sure your working directory is set to
kg-pipeline
repository folder -
Make sure Docker engine is running
-
Run
bash ./run_project.sh csis_volumes=volume_numbers scpe_issues=issue_numbers
, where:-
volume_numbers
is a string consisting of CSIS archive volumes you want to process in a form of numbers separated by commas (e.g. "1,2,3"). If you don't specify this variable, every volume from archive web page will be taken for the final graph. -
issue_numbers
is a string consisting of SCPE archive issues you want to process in a form of numbers separated by commas (e.g. "1,2,3"). If you don't specify this variable, every issue from archive web page will be taken for the final graph.
-
After invoking this script, images will be downloaded and then containers will be initialized. This process might take a dozen minutes on a slower device. Following initialization, the pipeline will process the provided papers, and the final knowledge graph will be visible on a host machine within a folder kg-pipeline/final_output
. Additionally, after the execution of each module, a directory will be created in kg-pipeline
to store its respective output. Due to the file size limitations, we divide a Turtle file containing the final graph into a few sub-files describing information about:
- articles,
- authors,
- bibliography,
- conference papers,
- papers
- organizations
- Publication scraper
- Information extractor
- Publication embedder
- Topical classifier
- Publication recommender
In order to modify majority of KG-pipeline it is the best to refer to developer guides in specific module. However, if there is a need to customize the entire pipeline orchestration, an analysis of the following components is recommended:
- run_project.sh script which plays a crucial role in activating Docker Compose and executing the main scripts within the containers,
- compose.yaml which is a Composer configuration file responsible for setting up containers and volumes.
For instance, when incorporating a new module into the pipeline, it is vital to follow these steps:
- update
compose.yaml
: Integrate the new module's Docker container into thecompose.yaml
file, specifying any required volumes. - modify
run_project.sh
: Include adocker run
line withinrun_project.sh
to execute a script inside the newly added container. This script should encompass all the functionalities desired for the new module.