GraphRAG - serratus-bio/open-virome GitHub Wiki

RAG Review

Retrieval-augmented generation (RAG) is an approach to augment LLM generated text using data retrieved from a knowledge base to provide specialized responses and prevent hallucinations. Initial approaches to RAG systems relied on vector embeddings to encode and search for documents most relevant to a user's query. This approach is fast, but a substantial amount of relational and structural content is lost when documents are reduced to vector representations. To mitigate this, GraphRAG encodes documents into graph structures which can be clustered into communities and summarized with an LLM. At query time, community summaries are used to produce intermediate answers to a user's prompt. Intermediate answers are then ranked and aggregated into a final global response, which was shown to outperform naive RAG systems in the examined benchmarks. Although more computationally expensive, the GraphRAG algorithm follows a MapReduce pattern and can take advantage of parallelized calls to lightweight distilled LLMs (like gpt4o-mini).

For a more detailed overview of GraphRAG, you can refer to Microsoft's release notes and this popular Medium blog.

Ontology ETL

The Open Virome GraphRAG system makes some modifications to the original indexing and extraction steps that take advantage of properties in our data. Our knowledge graph was constructed using various well-maintained ontologies (i.e. NCBI Taxonomy, sOTU Phylogeny, Brenda Tissue Ontology, Human Disease Ontology). Instead of using an LLM to identify entities and relationships, we use curated biological ontologies to extract entities and relationships which form the "principal components" of our virome networks.

This approach reduces the number of false positives that may be produced by an LLM and also provides a strategy for entity resolution since the ontology includes synonym mappings for entities and entity-entity similarity relationships which is used in the community clustering process.

Grounding with sequence data

Aside from relying only on user submitted metadata, we use sequence data to detect viruses with palmscan and detect other organisms with STAT. Additionally, instead of using an LLM to generate claims and covariates as is done in the original GraphRAG paper, we use MWAS to identify statistically significant associations between viral palmprint reads and metadata terms, which are used to provide supporting evidence to generated responses.

Virome clustering

We use the weighted Leiden algorithm on the full heterogeneous network to create communities. The Leiden algorithm provides parameters that can be used to control the number of resulting clusters and their size distributions. This introduces a trade off between speed and accuracy, since having more communities with smaller sizes would conserve more details but will take longer to query. Conversely, having fewer clusters will be faster to query but causes information loss in outliers during the summarization process. Overall, the clustering process and the summarization prompts are the main contributor to performance (both latency and accuracy).

Virome LLM summarizations

Clusters are summarized using bioproject titles and descriptions as well as counts of top occuring metadata (Host label, STAT organisms, Tissues, Diseases, Sex, Biome, Country). In the future, the prompts we develop to summarize figures can be used to improve cluster summarizations to feedback into an improved GraphRAG system.

Evaluations

[placeholder]

Retrievals: benchmark with various types of questions and expected top k filters using Precision@k, Recall@k, MAP@k
Grounding: detect hallucinated filters, invalid filters, inaccurate filters (compare to all possible valid vs actual)
Safety: benchmark of extreme to medium risk biology and virus related questions

Future research

GNN Clustering: We can train a GNN to learn more specific virus embeddings and KNN clusters for viromes. sOTUs are assigned features using the metadata ontologies that they are associated with in the SRA dataset. The GNN uses these features to predict sOTU co-occurrence relationships in a semi-supervised link prediction task.

STAT GraphRAG: It's possible to expand this architecture to include the entire SRA (not just the runs with viruses) and use STAT organisms + Metadata as the edges that are clustered on. Removing PalmDB edges and context would make the responses less steered toward virus related responses, though we can still use palmprints and STAT organisms for safety-related checks. MWAS results could be computed using kmer abundance instead of using palmprint reads. Some consideration would be needed for selecting a taxonomic rank and kmer threshold to use for STAT, otherwise storing the graph and the computed number of clusters may become intractable. Fast GraphRAG may resolve some of these issues.