Publication recommender - OpenCS-ontology/OpenCS GitHub Wiki

Publication recommender

This component is responsible for finding similar articles based on their embeddings. It also merges the articles' Turtle files into one knowledge graph. The component takes articles' Turtle files with embeddings as input and creates a Turtle file linking similar papers and a folder with a merged knowledge graph.

Module overview

The solution establishes a connection to the milvus-standalone container, operating under the assumption that the containers milvus-standalone, milvus-etcd, and milvus-minio are already running. This prerequisite is anticipated and integrated into the KG-pipeline, ensuring the seamless execution of the solution by leveraging the presence of these specified containers.
It creates a Milvus collection within a database with the following schema:

CollectionSchema(
    fields=[paper_id, embedding_field, paper_title],
    description="Paper similarity finder",
    enable_dynamic_field=True,
)

where

paper_id = FieldSchema(
  name="paper_id",
  dtype=DataType.VARCHAR,
  is_primary=True,
  max_length=200
)
paper_title = FieldSchema(
  name="paper_title",
  dtype=DataType.VARCHAR,
  max_length=200
)
embedding_field = FieldSchema(
  name="embedding_field",
  dtype=DataType.FLOAT_VECTOR,
  dim=768
)

For each scraped article it extracts their id (uri), title, and abstract and inserts this into the created collection
The solution creates an index for this collection with the following parameters:

{
    "metric_type":"COSINE",
    "index_type":"FLAT"
}

For each article solution queries the index to find the most related to its papers. It uses the following query:

collection.search(
            data=[embedding], 
            anns_field="embedding_field", 
            param=search_params,
            limit=16,
            expr=None,
            output_fields=['paper_title', 'distance']
        )

where

search_params = {
    "metric_type": "COSINE", 
    "offset": 1, 
    "ignore_growing": False
}

We employ offset: 1 to skip the most closely related article. This is done intentionally, as the queried article itself is considered.

The solution generates an output file that comprises each scraped article along with a list of related papers, accompanied by their respective relation scores.

Note that:

To determine the number of articles to be saved as related papers for each paper, the solution leverages the KneeLocator from the kneed library.

During the KG-pipeline run, every Turtle file generated is now utilized to construct a unified knowledge graph. To ensure efficient data management and avoid creating excessively large files, the final graph is divided into separate files, each containing specific information categories such as

articles
authors
bibliography
conference papers
papers
organizations
other information

Module output

Developer guide

Languages and technologies used

Python: 3.10.9
Bash: 5.1.16(1)-release
Docker engine: 24.0.6
Milvus: vector database
kneed: 0.8.5

Python libraries used

The libraries and their respective versions used in this project are outlined in the requirements.txt file.

Module modification guide

In the realm of potential modifications, the following files stand out as central to the module's functionality:

similar_papers.py - This Python script finds and saves similar papers for each article using the Milvus vector database. It compares embedding vectors for articles, employing cosine similarity to identify the most related papers. Results are saved in a separate file. Several modifications can be made, such as changing the database (which requires redefining vector storage), modifying the vector comparison metric (adjusting the index_params variable), altering the results' saving method, or customizing the displayed results in the loop iterating over input Turtle files.
merge_graphs.py - This Python script is responsible for the final graph creation. Turtle files from each processed paper and consolidates them into files that define the knowledge graph. The final graph is partitioned into multiple files, each dedicated to distinct areas such as authors, bibliography, etc. For those desiring a different segmentation of the final graph, modifications can be introduced within the loop responsible for dividing files and storing information for a specified subject. It's crucial to save the modified file with a new subject and an appropriate name to reflect these changes effectively.

Communication with other KG-pipeline modules

This module integrates with other components of KG-pipeline through Docker volumes. It efficiently utilizes the following volumes:

embedded_ttls: volume used for storing input Turtles containing embedding vectors taken from Topical classifier module.

This module is dependent on the Milvus database, requiring the simultaneous operation of the following containers within a shared network:

milvus etcd (quay.io/coreos/etcd:v3.5.5)
milvus minio (minio/minio:RELEASE.2023-03-20T20-16-18Z)
milvus standalone (milvusdb/milvus:v2.3.3)

This module connects to this network using a Python client pymilvus

Module testing

Test for this module are automated and integrated with CI thanks to the Github Workflows. The workflow is triggered after each push to the main branch and consists of the following steps: recommender_test

container-test

This step is responsible for creating test environment, running test script and cleaning environment after the tests are completed.

Set up job
Checkout repository
Set up Docker Compose
Build and run containers
Wait for services to start
Check if container runs properly
Stop and remove containers
Post checkout Repository
Complete job

The most important step here is Check if container runs properly. It runs the merge_graphs.py script and similar_papers.py script on 6 sample papers: 3 from SCPE and 3 from CSIS archives. In both of those scripts, assert commands were used to verify whether key functions return values and if these values adhere to the expected format.

build-and-push-image

This step is responsible for building image for this module and pushing it to Github image repository. This repository serves as the source from which we download the image during the KG-pipeline run. Step will only succeed when the Dockerfile is specified correctly, so it's also a crucial test for this component. The following jobs are executed as part of this step:

Set up job
Checkout repository
Log in to the Container registry
Extract metadata (tags, labels) for Docker
Build and push Docker image
Post build and push Docker image
Post log in to the container registry
Post checkout repository
Complete job