Information extractor - OpenCS-ontology/OpenCS GitHub Wiki
Information extractor
This component, for a given research article, extracts various structured information such as figures, labels, formulas, sections and bibliographies using the GROBID
Docker container. It takes articles' PDFs and Turtle files as input and returns new Turtle files enriched with the extracted information.
Module overview
The solution performs the following steps on each paper:
- Take PDF files of the given paper from the input folder
- Process it using
GROBID
library which extracts many information from these files (input
: PDF file,output
: XML) - Parse XML and convert it into a dictionary using
xmltodict
library - Extract needed information from dictionaries
- Save the extracted information in the form of a Turtle file using
rdflib
library - Merge the basic Turtle file of the original article with the one created during the processing in this module using
rdflib
library
Module output
Developer guide
Languages and technologies used
- Python: 3.10.9
- Bash: 5.1.16(1)-release
- jq: 1.6-2.1ubuntu3
- git: 2.34.1
- Docker engine: 24.0.6
- Grobid
Python libraries used
Module modification guide
In the realm of potential modifications, the following files stand out as central to the module's functionality:
fig_tab_ie.py
- Python file, responsible for extracting crucial information from input PDF files, consider expanding the scope by incorporating data from the GROBID output XML file. To achieve this, it is recommended to implement a custom function. This function should take the graph and the relevant XML file as inputs, integrate the additional information into the graph, and finally, return the updated graph. Place this custom function after the section that extracts XML data from PDFs to ensure a systematic flow in the information extraction process.merge_ttle_files.py
- This Python file is responsible for merging input Turtle files from certain scientific papers with Turtle files generated within this module. Within this module, you have the flexibility to modify either the method of merging Turtle files, specifically the merge_ttl function, or the process of identifying corresponding input and output Turtle files, as indicated by the for loops inside the main function.container_run.sh
- The shell script, container_run.sh, runs important scripts in sequence, handling tasks like data extraction and Turtle file merging. It's not expected to change unless someone wants to modify the main functionalities of this module. In such cases, it's suggested to add a new Python file with the necessary functionality and call it from within the shell script.
Communication with other KG-pipeline modules
This module integrates with other components of KG-pipeline through Docker volumes. It efficiently utilizes the following volumes:
ttl_folder
volume for storing input Turtle files. Folder undergoes clearing before every run, ensuring that previously processed data is not reprocessed. These Turtle files are taken from Publication scraperpdf_folder
volume for storing input PDF files. Folder undergoes clearing before every run, ensuring that previously processed data is not reprocessed. These PDF files are taken from Publication scraperttl_files_for_every_run
volume for storing output Turtle files. This folder is not cleared before each run to preserve every file processed by specific users across multiple runs and offer continuity and accessibility to historical data. These Turtle files are then further processed by Publication embedder
Module testing
Test for this module are automated and integrated with CI thanks to the Github Workflows. The workflow is triggered after each push to the main branch and consists of the following steps:
build-and-push-image
This step is responsible for building image for this module and pushing it to Github image repository. This repository serves as the source from which we download the image during the KG-pipeline run. Step will only succeed when the Dockerfile is specified correctly, so it's also a crucial test for this component. The following jobs are executed as part of this step:
- Set up job
- Checkout repository
- Log in to the Container registry
- Extract metadata (tags, labels) for Docker
- Build and push Docker image
- Post build and push Docker image
- Post log in to the container registry
- Post checkout repository
- Complete job
container-test
This step is responsible for creating test environment, running test script and cleaning environment after the tests are completed.
- Set up job
- Checkout code
- Set up containers
- Check if container runs properly
- Clean up
- Post checkout code
- complete job
The step labeled Check if container runs properly
contains the primary test script. It executes the container on 14 test papers, verifying the capability to scrape information from PDF files. Subsequently, it stores the scraped data and conducts the necessary data transformation for each paper. Finally, it assesses whether the graph generated from the test paper is isomorphic to a correct solution tailored for that specific paper.