Information extractor - OpenCS-ontology/OpenCS GitHub Wiki

Information extractor

This component, for a given research article, extracts various structured information such as figures, labels, formulas, sections and bibliographies using the GROBID Docker container. It takes articles' PDFs and Turtle files as input and returns new Turtle files enriched with the extracted information.

Module overview

The solution performs the following steps on each paper:

Take PDF files of the given paper from the input folder
Process it using GROBID library which extracts many information from these files (input: PDF file, output: XML)
Parse XML and convert it into a dictionary using xmltodict library
Extract needed information from dictionaries
Save the extracted information in the form of a Turtle file using rdflib library
Merge the basic Turtle file of the original article with the one created during the processing in this module using rdflib library

Module output

info_out

Developer guide

Languages and technologies used

Python: 3.10.9
Bash: 5.1.16(1)-release
jq: 1.6-2.1ubuntu3
git: 2.34.1
Docker engine: 24.0.6
Grobid

Python libraries used

rdflib: 6.2.0
xmltodict: 0.13.0
rapidfuzz: 3.5.2

Module modification guide

In the realm of potential modifications, the following files stand out as central to the module's functionality:

fig_tab_ie.py - Python file, responsible for extracting crucial information from input PDF files, consider expanding the scope by incorporating data from the GROBID output XML file. To achieve this, it is recommended to implement a custom function. This function should take the graph and the relevant XML file as inputs, integrate the additional information into the graph, and finally, return the updated graph. Place this custom function after the section that extracts XML data from PDFs to ensure a systematic flow in the information extraction process.
merge_ttle_files.py - This Python file is responsible for merging input Turtle files from certain scientific papers with Turtle files generated within this module. Within this module, you have the flexibility to modify either the method of merging Turtle files, specifically the merge_ttl function, or the process of identifying corresponding input and output Turtle files, as indicated by the for loops inside the main function.
container_run.sh - The shell script, container_run.sh, runs important scripts in sequence, handling tasks like data extraction and Turtle file merging. It's not expected to change unless someone wants to modify the main functionalities of this module. In such cases, it's suggested to add a new Python file with the necessary functionality and call it from within the shell script.

Communication with other KG-pipeline modules

This module integrates with other components of KG-pipeline through Docker volumes. It efficiently utilizes the following volumes:

ttl_folder volume for storing input Turtle files. Folder undergoes clearing before every run, ensuring that previously processed data is not reprocessed. These Turtle files are taken from Publication scraper
pdf_folder volume for storing input PDF files. Folder undergoes clearing before every run, ensuring that previously processed data is not reprocessed. These PDF files are taken from Publication scraper
ttl_files_for_every_run volume for storing output Turtle files. This folder is not cleared before each run to preserve every file processed by specific users across multiple runs and offer continuity and accessibility to historical data. These Turtle files are then further processed by Publication embedder

Module testing

Test for this module are automated and integrated with CI thanks to the Github Workflows. The workflow is triggered after each push to the main branch and consists of the following steps:

test_flow

build-and-push-image

This step is responsible for building image for this module and pushing it to Github image repository. This repository serves as the source from which we download the image during the KG-pipeline run. Step will only succeed when the Dockerfile is specified correctly, so it's also a crucial test for this component. The following jobs are executed as part of this step:

Set up job
Checkout repository
Log in to the Container registry
Extract metadata (tags, labels) for Docker
Build and push Docker image
Post build and push Docker image
Post log in to the container registry
Post checkout repository
Complete job

container-test

This step is responsible for creating test environment, running test script and cleaning environment after the tests are completed.

Set up job
Checkout code
Set up containers
Check if container runs properly
Clean up
Post checkout code
complete job

The step labeled Check if container runs properly contains the primary test script. It executes the container on 14 test papers, verifying the capability to scrape information from PDF files. Subsequently, it stores the scraped data and conducts the necessary data transformation for each paper. Finally, it assesses whether the graph generated from the test paper is isomorphic to a correct solution tailored for that specific paper.