Publication scraper - OpenCS-ontology/OpenCS GitHub Wiki

Publication scraper

This component is responsible for scraping information about articles from SCPE and FedCSIS archives and creating Turtle files that include fundamental article details such as authors, title, abstract, publication date, etc. It also downloads the articles in PDF format for possible further processing.

Module overview

Given CSIS volumes and SCPE issues numbers, this module scrapes their sites
Save articles' PDF files
Save scraped information in the form of a dataclass (one instance for each article)
Convert dataclass instance to a Turtle file

Module output

Developer guide

Languages and technologies used

Python: 3.10.9
Docker engine: 24.0.6

Python libraries used

The libraries and their respective versions used in this project are outlined in the requirements.txt file.

Module modification guide

In the realm of potential modifications, the following files stand out as central to the module's functionality:

scpe_scraper - Within this folder, all the necessary components for extracting information from SCPE archive pages are included. For those wishing to customize the scraping process, it is feasible to create a function that extracts information from papers' pages using BeautifulSoup. This new function can then be added to the existing scrape_paper function within the scpe_scraper/paper_scraper.py module. Additionally, any modifications to the extracted information should be considered when adjusting the PaperScraperResponse to accommodate the new data.
csis_scraper - In this directory, you'll find all the essential components for extracting information from CSIS archive pages. Similar to the process in scpe_scraper, adjustments to the scope of information scraped can be made. However, the scraping logic is stored in csis_scraper/scrape/scrape.py. If you wish to extract new data, the function responsible for this task should be used within the tranverse_papers function and saved to PaperModel dataclass.
paper_model_to_ttl.py - This Python script is responsible for converting information taken from webpages stored in a PaperModel dataclass to a Turtle file for each article. If there were some changes made to the scope of information scraped, it is necessary to reflect these changes within this script to ensure their visibility in the resulting Turtle files.
main.py - This Python module serves as the orchestrator for the overall module functionality. While it's generally not recommended to modify existing code for SCPE and CSIS archives, as changes to the scope of information scraped can often be accommodated without touching this script, there's provision for adding custom scrapers for other archives. For those interested in integrating a custom scraper, it should be incorporated into the main function in a manner similar to the existing scrapers. It is crucial to store the scraped data in the form of a PaperModel data class. This ensures compatibility with the paper_model_to_ttl.py script, which is used to convert the data into Turtle files.

Communication with other KG-pipeline modules

This module integrates with other components of KG-pipeline through Docker volumes. It efficiently utilizes the following volumes:

ttl_folder volume for storing input Turtle files. Folder undergoes clearing before every run, ensuring that previously processed data is not reprocessed. Files from this folder are further processed by the Information extractor module.
pdf_folder volume for storing input PDF files. Folder undergoes clearing before every run, ensuring that previously processed data is not reprocessed. Information extractor module

Module testing

Test for this module are automated and integrated with CI thanks to the Github Workflows. The workflow is triggered after each push to the main branch and consists of the following steps: scraper_test

container-test

This step is responsible for creating test environment, running test script and cleaning environment after the tests are completed.

Set up job
Checkout code
Set up containers
Check if container runs properly
Clean up
Post checkout code
Complete job

The step labeled Check if container runs properly contains the primary test script. It executes the container on a test paper, verifying the capability to scrape pages from the archives. Subsequently, it stores the scraped data and conducts the necessary data transformation. Finally, it assesses whether the graph generated from the test paper is isomorphic to a correct solution tailored for that specific paper.

build-and-push-image

This step is responsible for building image for this module and pushing it to Github image repository. This repository serves as the source from which we download the image during the KG-pipeline run. Step will only succeed when the Dockerfile is specified correctly, so it's also a crucial test for this component. The following jobs are executed as part of this step:

Set up job
Checkout repository
Log in to the Container registry
Extract metadata (tags, labels) for Docker
Build and push Docker image
Post build and push Docker image
Post log in to the container registry
Post checkout repository
Complete job