LLM Integration - serratus-bio/open-virome GitHub Wiki

[!WARNING] This feature is still under development and may contain incorrect information. Please report any bugs or issues you run into on our issues page.

Responsible AI

Safety

By developing safety mechanisms early as large language model (LLM) capabilities continue to grow, we can safeguard general applications of LLM research assistants in biological sciences.

The baseline safety mechanisms in place are content filters to annotate and block harmful inputs to and outputs of the LLM, including "Hate and Fairness", "Violence", "Sexual", "Self-Harm", "User Prompt Attacks", "Indirect Attacks", and "Profanity".

Biosaftey levels were downloaded from ePATHogen and are associated to NCBI taxonomy nodes in our knowledge graph. This allows us to increase content filter sensitivity thresholds for entities that are classified as being high-risk (BSL-4/Risk Group 4) with the aim of preventing the model from generating information hazards. We associate SRA runs to biological entities using detected viral palmprints, SRA run organism metadata label, and kmer detected STAT organisms. Default settings for low and medium risk entities use content filters with high sensitivity on user inputs and medium sensitivity for LLM generated outputs.

Groundedness detection

Grounding an LLM involves enriching a base language model with relevant and specific knowledge to ensure it maintains context. The primary function of a RAG system is to constrain a pretrained LLM to a knowledge base. Along with including grounding instructions in our prompts, we apply a groundedness detection content filter policy to annotate and block LLM responses that include information that was not provided from our SRA-virus knowledge base context, which prevents hallucinations.

Although RAG systems reduce hallucinations, they don't necessarily eliminate them altogether. Moreover, for certain queries we may actually want to allow 'ungrounded' LLM responses that are steered by genetic or metadata information, as is done in MWAS hypothesis generation. For these cases, we prompt the model to annotate ungrounded sentences with the disclaimer tag [LLM: verify].

LLM Applications

Global virome query (GraphRAG)

Retrieval-augmented generation (RAG) is an approach to augment LLM generated text using data retrieved from a knowledge base to provide specialized responses and prevent hallucinations. In Open Virome, we use our heterogeneous knowledge graph to form natural clusters of similar virome communities, which are then summarized and made queryable with an LLM. Generated responses are prompted to include relevant filters which users can then add on click to investigate the local virome data. More details can be found in the GraphRAG wiki.

Local virome query

After filtering into a virome, users can query the local data with text input to develop a deeper understanding of the data, which may include thousands and runs and BioProjects. The conversation includes context derived from the currently queried virome, including all metadata terms, counts, top MWAS results, and BioProject abstracts. It will also be possible to toggle LLM responses to allow inclusion of external knowledge attained during pre-training.

Figure and BioProjects summarization

An LLM is used to summarize the aggregated metadata and research project abstracts associated with a virome being queried. These summaries are displayed in the app as wiki-like captions, associated with a module or with the plots within the module. The summaries are also used to provide context for local virome queries, and the prompts are used for generating summaries of the unsupervised clusters in global virome queries.

MWAS hypothesis generation

Given an association between viral palmprint read abundance and a metadata term from an MWAS significance test, an LLM is prompted to use contextual BioProjects to propose a hypothesis that explains the fold change. The MWAS data and top hypotheses are also used to provide context for local virome queries as well as in summaries for GraphRAG global search.

LLM configurations

Models

  1. The GraphRAG system uses gpt-4o-mini to generate cluster summaries offline and for querying cluster summaries as they are submitted. This takes advantage of the cheap and fast capabilities of the distilled model.
  2. The final step of aggregating and consolidating responses from the various clusters is done by the gpt-4o model. This model is slightly slower and more expensive, but handles interpreting bulk context well (better than chain-of-thought models like o1).
  3. gpt-4o is used for generating summaries, which also require summarizing large amounts of context data.
  4. o1 is used for generating MWAS hypotheses, which don't include a substantial amount of context and takes advantage of the CoT reasoning capabilities of the model.

Temperature

Currently, all models use the default temperature value of 1.

Rate limits

We use fairly low rate rate limits on our models to prevent excess usage and expenses. If you encounter an intermittent error, it is likely caused by this and can be resolved by waiting and trying again later. Additionally, our app has bot detection and the API has WAF to block DDoS attacks and web crawlers.