Publication embedder - OpenCS-ontology/OpenCS GitHub Wiki
Publication embedder
This component creates embeddings for abstracts and titles of academic research papers. It takes the articles' Turtle files and enriches them with embedding vectors. Additionally, this module generates embeddings for keywords extracted from the OpenCS ontology and saves them in JSON format.
Module overview
For each input article following operations are executed:
- Extract the article title and abstract from its Turtle file
- Create a single embedding vector using title and abstract using allenai specter2 model for longer sequences.
- Add created vector to input Turtle file
After that our solution splits concepts from OpenCS ontology into batches and for every batch, it creates json
file containing concepts with information about them in a format:
{
"OpenCS_ID_1": {
"prefLabel": "string"
"related": "string or list of strings"
"broader": "string or list of strings"
"embedding": "list with embedding vector"
},
}
Embedding vectors are created using allenai specter2 model for short sequences
Module output
Developer guide
Languages and technologies used
- Python: 3.10.9
- Bash: 5.1.16(1)-release
- Docker engine: 24.0.6
- allenai specter2 aug 2023 refresh base model
Python libraries used
The libraries and their respective versions used in this project are outlined in the requirements.txt file.
Module modification guide
In the realm of potential modifications, the following files stand out as central to the module's functionality:
embed_concepts.py
- This Python module serves the purpose of downloading the current OpenCS ontology. Subsequently, it leverages the preferred labels of each concept to generate embedding vectors, and then compiles JSON files containing fundamental concept information (such as labels, more common concepts, etc.) followed by the associated vectors. While there are limited opportunities for modification within this script, individuals may wish to adjust the size of batches of concepts stored within a single JSON file. Additionally, there is flexibility in selecting the model for creating embeddings, with the current choice being allenai/specter2_aug2023refresh_base. Those interested can experiment with other models by updating thetokenizer
variable within the main function.embed_papers.py
- This Python module takes on the responsibility of generating embedding vectors for each article based on their titles and abstracts, subsequently appending these vectors to their respective Turtle files. It offers versatility in modification, allowing adjustments to the content embedded for each article within the deepest for loop in the main function. Presently, the embedding is constructed using the article's title followed by a separator and the abstract. Moreover, there is flexibility in selecting the model for embeddings. The current choice is allenai/specter2_aug2023refresh_base, but experimentation with alternative models is encouraged. The model can be changed by modifying the tokenizer variable within the main function. Additionally, the method for adding vectors to Turtle files is straightforward, involving the inclusion of vectors as a custom property in string form, with coordinates listed sequentially after a comma. For those seeking optimization, exploring and adjusting the most efficient way to store these vectors is recommended within the extract_abstract_title function.
Communication with other KG-pipeline modules
This module integrates with other components of KG-pipeline through Docker volumes. It efficiently utilizes the following volumes:
ttl_files_for_every_run
volume for storing input Turtle files. This folder is not cleared before each run to preserve every file processed by specific users across multiple runs and offer accessibility to historical data in order to match it with data from newest runs. These Turtle files are taken from Information extractorconcepts_embeddings
volume used for storing JSON files containing information (such as prefered label or embedding vector) about OPENCS ontology concepts. These concepts are then further processed by Topical classifierembedded_ttls
volume used for storing output Turtles containing embedding vectors. These Turtle files are then further processed by Topical classifier
Module testing
Test for this module are automated and integrated with CI thanks to the Github Workflows. The workflow is triggered after each push to the main branch and consists of the following steps:
build-and-push-image
This step is responsible for building image for this module and pushing it to Github image repository. This repository serves as the source from which we download the image during the KG-pipeline run. Step will only succeed when the Dockerfile is specified correctly, so it's also a crucial test for this component. The following jobs are executed as part of this step:
- Set up job
- Checkout repository
- Log in to the Container registry
- Extract metadata (tags, labels) for Docker
- Build and push Docker image
- Post build and push Docker image
- Post log in to the container registry
- Post checkout repository
- Complete job
container-test
This step is responsible for creating test environment, running test script and cleaning environment after the tests are completed.
- Set up job
- Checkout code
- Set up containers
- Check if paper embedding runs properly
- Check if concept embedding runs properly
- Clean up
- Post checkout code
- complete job
Container test directory in this module contains an input Turtle file from the
previous module, a true output file generated with a validated version of this module, compare script
for embeddings and shell scripts for organizing the work. They run (in Check if paper embedding runs properly
step) the embed papers script on the input file and embed concepts script on one batch. For embeddings, the compare script then checks if
the output generated after changes to the module is similar (identical up to 3 digits) to the true output.
It cannot always be identical because of the underlying model which is why this is implemented. For
the concepts, assert statements check if the most important objects are valid (test run in Check if concept embedding runs properly
step).