Topical classifier - OpenCS-ontology/OpenCS GitHub Wiki

Topical classifier

This component assigns the best matching concepts from the OpenCS ontology to given scientific articles based on their titles, abstracts, and embeddings. Although the name contains the word 'classifier,' the task is unsupervised, and the solution employs the ElasticSearch search engine. This tool takes articles' turtle files as input and creates an ElasticSearch index that encompasses both text and vector search to identify the most relevant concepts for each paper, facilitating the association of papers with pertinent ontology terms.

Module overview

The solution presupposes the existence of a running ElasticSearch container on port 9200, a service facilitated by KG-pipeline orchestration. It endeavors to establish a connection with the container by attempting a GET request to localhost:9200. If this attempt proves unsuccessful, the process halts temporarily and retries after a few seconds. This delay is implemented to account for the time required by the ES container to complete its setup process.
It creates ES index for concepts taken from OpenCS ontology repository using the following mapping:

{
    "mappings": {
        "properties": {
            "prefLabel": {"type": "text"},
            "broader": {"type": "text"},
            "related": {"type": "text"},
            "embedding": {
                "type": "dense_vector",
                "dims": 768,
            },
            "opencs_uid": {"type": "text"},
        }
    }
}

where

prefLabel: skos:prefLabel property which means the preferred lexical label for a resource, in a given language (english)
broader: skos:broader property used for storing papers which are more general than a subjecr
related: skos:related property used for storing papers related to a subject
embedding: embedding vector for a concept created in Publication embedder
opencs_uid: ID of a concept taken from the OpenCs ontology repository

Now it is possible to query each input paper to created index to find the most related concepts for them. We use the following query:

{
    "size": 20,
    "query": {
        "bool": {
            "should": [
                {
                    "function_score": {
                        "query": {
                            "dis_max": {
                                "queries": [
                                    {
                                        "multi_match": {
                                            "query": "title",
                                            "type": "most_fields",
                                            "analyzer": "standard",
                                            "fields": [
                                                "prefLabel^3",
                                                "related",
                                                "broader",
                                            ],
                                            "tie_breaker": 0.5,
                                        }
                                    },
                                    {
                                        "multi_match": {
                                            "query": "abstract",
                                            "type": "most_fields",
                                            "analyzer": "standard",
                                            "fields": [
                                                "prefLabel^3",
                                                "related",
                                                "broader",
                                            ],
                                            "tie_breaker": 0.5,
                                        }
                                    },
                                ]
                            }
                        },
                        "boost": 0.1,
                    }
                },
                {
                    "function_score": {
                        "query": {
                            "script_score": {
                                "query": {"match_all": {}},
                                "script": {
                                    "source": "cosineSimilarity(params.query_vector, 'embedding')*500",
                                    "params": {"query_vector": "embedding"},
                                },
                            }
                        },
                        "boost": 1,
                    }
                },
            ]
        }
    },
}

where

title: queried paper's title
abstract: queried paper's abstract
embedding: queried paper's embedding vector created in Publication embedder module

To put it more simply this query for every paper and concept calculates multi_match query for both queried paper's title and abstract using most_fields type for concept's prefLabel(score to the power of 3), related and broader traits with 0.5 tie_breaker. It also calculates cosine similarity between the paper's embedding vector and the one from a given concept

Final score for each concept is calculated using the formula 500*cosine_similarity + 0.1*MAX(multi_match_title_score, multi_match_abstract_score)
The solution saves the most related concepts for each paper to the corresponding paper's Turtle file, including their associated relation scores.

Note that:

Multiplying the cosine similarity by 500 is crucial as it serves to recalibrate this score, typically ranging between 0 and 1. This adjustment aligns the cosine similarity scores with those generated by the 'multi-search' query, which typically yields scores falling within the range of 300 to 500. It's noteworthy that the algorithm employed within the 'multi_match' query makes it challenging to establish precise upper and lower limits for these scores.
The final score for the maximum of 'multi_match' queries is scaled down by a factor of 0.1. This adjustment is made because raw scores derived from text similarity aren't an optimal means of identifying the most fitting concepts, a conclusion supported by empirical evidence. The utilization of such scores is necessitated by the potential inaccuracies in embeddings for certain concepts, particularly those that are unique and lack broader or related counterparts. In such cases, 'cosine_similarity' may yield erroneously low scores, but 'multi_match' can provide a higher score if the concept is indeed referenced in the title or abstract.
To determine the number of concepts to be saved as related for each paper, the solution leverages the KneeLocator from the kneed library. This approach helps identify a suitable threshold or "knee point" in the relation scores, aiding in the determination of the optimal number of related concepts to be included in the output file.

Module output

Developer guide

Languages and technologies used

Python: 3.10.9
Bash: 5.1.16(1)-release
Docker engine: 24.0.6
Elasticsearch

Python libraries used

The libraries and their respective versions used in this project are outlined in the requirements.txt file.

Module modification guide

In the realm of potential modifications, the following files stand out as central to the module's functionality:

pipeline.py - This Python module serves as the central core of the overall functionality. It extracts embedding vectors for papers and concepts, establishes an Elasticsearch index from concepts for querying papers, and subsequently matches the most relevant concepts with articles. While it's not recommended to alter core functionalities such as Elasticsearch usage, you have the flexibility to modify certain query parameters. For example, you can adjust the query itself, stored in the get_query function, or modify the number of related concepts to be added to the final output, as defined in the find_n_best function.

Communication with other KG-pipeline modules

This module integrates with other components of KG-pipeline through Docker volumes. It efficiently utilizes the following volumes:

embedded_ttls volume used for storing input Turtles containing embedding vectors taken from Publication embedder module. After modifications made to these files within this module, they are used as an input in Publication recommender
concepts_embeddings volume used for storing JSON files containing information (such as prefered label or embedding vector) about OPENCS ontology concepts. These concepts are taken from Publication embedder

This module utilizes Elasticsearch for conducting concept matching with papers, and for this purpose, it collaborates with:

elasticsearch container (docker.elastic.co/elasticsearch/elasticsearch:7.12.1)

It connects to ES container utilizing Python client elasticsearch.

Module testing

Test for this module are automated and integrated with CI thanks to the Github Workflows. The workflow is triggered after each push to the main branch and consists of the following steps: topical_tests

container-test

This step is responsible for creating test environment, running test script and cleaning environment after the tests are completed.

Set up job
Checkout code
Set up containers
Check if container runs properly
Clean up
Post checkout code
Complete job

The critical step in this process is the Check if container runs properly stage. It executes the module using a set of test concepts and a test Turtle file, actively verifying whether functions return values in proper format or if literals are in possible range.

build-and-push-image

This step is responsible for building image for this module and pushing it to Github image repository. This repository serves as the source from which we download the image during the KG-pipeline run. Step will only succeed when the Dockerfile is specified correctly, so it's also a crucial test for this component. The following jobs are executed as part of this step:

Set up job
Checkout repository
Log in to the Container registry
Extract metadata (tags, labels) for Docker
Build and push Docker image
Post build and push Docker image
Post log in to the container registry
Post checkout repository
Complete job