Publication embedder - OpenCS-ontology/OpenCS GitHub Wiki

Publication embedder

This component creates embeddings for abstracts and titles of academic research papers. It takes the articles' Turtle files and enriches them with embedding vectors. Additionally, this module generates embeddings for keywords extracted from the OpenCS ontology and saves them in JSON format.

Module overview

For each input article following operations are executed:

Extract the article title and abstract from its Turtle file
Create a single embedding vector using title and abstract using allenai specter2 model for longer sequences.
Add created vector to input Turtle file

After that our solution splits concepts from OpenCS ontology into batches and for every batch, it creates json file containing concepts with information about them in a format:

{
   "OpenCS_ID_1": {
      "prefLabel": "string"
      "related": "string or list of strings"
      "broader": "string or list of strings"
      "embedding": "list with embedding vector"
   },
}

Embedding vectors are created using allenai specter2 model for short sequences

Module output

Developer guide

Languages and technologies used

Python: 3.10.9
Bash: 5.1.16(1)-release
Docker engine: 24.0.6
allenai specter2 aug 2023 refresh base model

Python libraries used

The libraries and their respective versions used in this project are outlined in the requirements.txt file.

Module modification guide

In the realm of potential modifications, the following files stand out as central to the module's functionality:

embed_concepts.py - This Python module serves the purpose of downloading the current OpenCS ontology. Subsequently, it leverages the preferred labels of each concept to generate embedding vectors, and then compiles JSON files containing fundamental concept information (such as labels, more common concepts, etc.) followed by the associated vectors. While there are limited opportunities for modification within this script, individuals may wish to adjust the size of batches of concepts stored within a single JSON file. Additionally, there is flexibility in selecting the model for creating embeddings, with the current choice being allenai/specter2_aug2023refresh_base. Those interested can experiment with other models by updating the tokenizer variable within the main function.
embed_papers.py - This Python module takes on the responsibility of generating embedding vectors for each article based on their titles and abstracts, subsequently appending these vectors to their respective Turtle files. It offers versatility in modification, allowing adjustments to the content embedded for each article within the deepest for loop in the main function. Presently, the embedding is constructed using the article's title followed by a separator and the abstract. Moreover, there is flexibility in selecting the model for embeddings. The current choice is allenai/specter2_aug2023refresh_base, but experimentation with alternative models is encouraged. The model can be changed by modifying the tokenizer variable within the main function. Additionally, the method for adding vectors to Turtle files is straightforward, involving the inclusion of vectors as a custom property in string form, with coordinates listed sequentially after a comma. For those seeking optimization, exploring and adjusting the most efficient way to store these vectors is recommended within the extract_abstract_title function.

Communication with other KG-pipeline modules

This module integrates with other components of KG-pipeline through Docker volumes. It efficiently utilizes the following volumes:

ttl_files_for_every_run volume for storing input Turtle files. This folder is not cleared before each run to preserve every file processed by specific users across multiple runs and offer accessibility to historical data in order to match it with data from newest runs. These Turtle files are taken from Information extractor
concepts_embeddings volume used for storing JSON files containing information (such as prefered label or embedding vector) about OPENCS ontology concepts. These concepts are then further processed by Topical classifier
embedded_ttls volume used for storing output Turtles containing embedding vectors. These Turtle files are then further processed by Topical classifier

Module testing

Test for this module are automated and integrated with CI thanks to the Github Workflows. The workflow is triggered after each push to the main branch and consists of the following steps: embedder_test

build-and-push-image

This step is responsible for building image for this module and pushing it to Github image repository. This repository serves as the source from which we download the image during the KG-pipeline run. Step will only succeed when the Dockerfile is specified correctly, so it's also a crucial test for this component. The following jobs are executed as part of this step:

Set up job
Checkout repository
Log in to the Container registry
Extract metadata (tags, labels) for Docker
Build and push Docker image
Post build and push Docker image
Post log in to the container registry
Post checkout repository
Complete job

container-test

This step is responsible for creating test environment, running test script and cleaning environment after the tests are completed.

Set up job
Checkout code
Set up containers
Check if paper embedding runs properly
Check if concept embedding runs properly
Clean up
Post checkout code
complete job

Container test directory in this module contains an input Turtle file from the previous module, a true output file generated with a validated version of this module, compare script for embeddings and shell scripts for organizing the work. They run (in Check if paper embedding runs properly step) the embed papers script on the input file and embed concepts script on one batch. For embeddings, the compare script then checks if the output generated after changes to the module is similar (identical up to 3 digits) to the true output. It cannot always be identical because of the underlying model which is why this is implemented. For the concepts, assert statements check if the most important objects are valid (test run in Check if concept embedding runs properly step).