Publication recommender - OpenCS-ontology/OpenCS GitHub Wiki
Publication recommender
This component is responsible for finding similar articles based on their embeddings. It also merges the articles' Turtle files into one knowledge graph. The component takes articles' Turtle files with embeddings as input and creates a Turtle file linking similar papers and a folder with a merged knowledge graph.
Module overview
- The solution establishes a connection to the milvus-standalone container, operating under the assumption that the containers milvus-standalone, milvus-etcd, and milvus-minio are already running. This prerequisite is anticipated and integrated into the KG-pipeline, ensuring the seamless execution of the solution by leveraging the presence of these specified containers.
- It creates a Milvus collection within a database with the following schema:
CollectionSchema(
fields=[paper_id, embedding_field, paper_title],
description="Paper similarity finder",
enable_dynamic_field=True,
)
where
paper_id = FieldSchema(
name="paper_id",
dtype=DataType.VARCHAR,
is_primary=True,
max_length=200
)
paper_title = FieldSchema(
name="paper_title",
dtype=DataType.VARCHAR,
max_length=200
)
embedding_field = FieldSchema(
name="embedding_field",
dtype=DataType.FLOAT_VECTOR,
dim=768
)
- For each scraped article it extracts their id (
uri
),title
, andabstract
and inserts this into the created collection - The solution creates an index for this collection with the following parameters:
{
"metric_type":"COSINE",
"index_type":"FLAT"
}
- For each article solution queries the index to find the most related to its papers. It uses the following query:
collection.search(
data=[embedding],
anns_field="embedding_field",
param=search_params,
limit=16,
expr=None,
output_fields=['paper_title', 'distance']
)
where
search_params = {
"metric_type": "COSINE",
"offset": 1,
"ignore_growing": False
}
We employ offset: 1 to skip the most closely related article. This is done intentionally, as the queried article itself is considered.
- The solution generates an output file that comprises each scraped article along with a list of related papers, accompanied by their respective relation scores.
Note that:
- To determine the number of articles to be saved as related papers for each paper, the solution leverages the KneeLocator from the kneed library.
- During the KG-pipeline run, every Turtle file generated is now utilized to construct a unified knowledge graph. To ensure efficient data management and avoid creating excessively large files, the final graph is divided into separate files, each containing specific information categories such as
- articles
- authors
- bibliography
- conference papers
- papers
- organizations
- other information
Module output
Developer guide
Languages and technologies used
- Python: 3.10.9
- Bash: 5.1.16(1)-release
- Docker engine: 24.0.6
- Milvus: vector database
- kneed: 0.8.5
Python libraries used
The libraries and their respective versions used in this project are outlined in the requirements.txt file.
Module modification guide
In the realm of potential modifications, the following files stand out as central to the module's functionality:
similar_papers.py
- This Python script finds and saves similar papers for each article using the Milvus vector database. It compares embedding vectors for articles, employing cosine similarity to identify the most related papers. Results are saved in a separate file. Several modifications can be made, such as changing the database (which requires redefining vector storage), modifying the vector comparison metric (adjusting the index_params variable), altering the results' saving method, or customizing the displayed results in the loop iterating over input Turtle files.merge_graphs.py
- This Python script is responsible for the final graph creation. Turtle files from each processed paper and consolidates them into files that define the knowledge graph. The final graph is partitioned into multiple files, each dedicated to distinct areas such as authors, bibliography, etc. For those desiring a different segmentation of the final graph, modifications can be introduced within the loop responsible for dividing files and storing information for a specified subject. It's crucial to save the modified file with a new subject and an appropriate name to reflect these changes effectively.
Communication with other KG-pipeline modules
This module integrates with other components of KG-pipeline through Docker volumes. It efficiently utilizes the following volumes:
embedded_ttls
: volume used for storing input Turtles containing embedding vectors taken from Topical classifier module.
This module is dependent on the Milvus database, requiring the simultaneous operation of the following containers within a shared network:
milvus etcd
(quay.io/coreos/etcd:v3.5.5)milvus minio
(minio/minio:RELEASE.2023-03-20T20-16-18Z)milvus standalone
(milvusdb/milvus:v2.3.3)
This module connects to this network using a Python client pymilvus
Module testing
Test for this module are automated and integrated with CI thanks to the Github Workflows. The workflow is triggered after each push to the main branch and consists of the following steps:
container-test
This step is responsible for creating test environment, running test script and cleaning environment after the tests are completed.
- Set up job
- Checkout repository
- Set up Docker Compose
- Build and run containers
- Wait for services to start
- Check if container runs properly
- Stop and remove containers
- Post checkout Repository
- Complete job
The most important step here is Check if container runs properly
. It runs the merge_graphs.py
script and similar_papers.py
script on 6 sample papers: 3 from SCPE and 3 from CSIS archives. In both of those scripts, assert commands were used to verify whether key functions return values and if these values adhere to the expected format.
build-and-push-image
This step is responsible for building image for this module and pushing it to Github image repository. This repository serves as the source from which we download the image during the KG-pipeline run. Step will only succeed when the Dockerfile is specified correctly, so it's also a crucial test for this component. The following jobs are executed as part of this step:
- Set up job
- Checkout repository
- Log in to the Container registry
- Extract metadata (tags, labels) for Docker
- Build and push Docker image
- Post build and push Docker image
- Post log in to the container registry
- Post checkout repository
- Complete job