Glossary - serratus-bio/open-virome GitHub Wiki

Open Virome Project Specific Terminology

A

  • Accession: Unique and persistent identifiers used in various tables/databases to refer to a specific dataset or object.

B

  • BioProject: A project-level summary and meta-data which is composed of one or more children BioSamples and/or runs. Meta-data includes project objectives, materials and methods, submitor and publications. See, BioProject Database
  • BioSample: A distinct "biological sample" entity as defined in the BioSample Metadata database. BioSamples are assigned membership to one or more parent BioProject. Each BioSample can have one or more sequences (GenBank) or runs (Sequence Read Archive) associated with it. See, BioSample Database

NCBI Ecosystem

  • Bipartite network: A graph where nodes are split into two distinct sets (U and V), and edges only link nodes from one set to the other. In other words, every edge connects a node in set U to a node in set V, with no connections within the same set.

C

  • Contig: A contiguous representation of a genomic region, usually derived from assembling a bunch of smaller sequence segments that overlap to some extent. See more details on Logan Contigs here. A palmprint can appear more than once in one particular run (after assembling), this could mean that multiple copies of the same viral sequence were present; contig is the proper organization level to count and differentiate them from each other.

  • Community: A cluster of data points within a database that are similar to each other, usually algorithmically formed. (i.e a group of nodes within a graph database that share a property)

E

  • Experiment (NCBI Ecosystem "Experiment Accession", Not currently used)

H

  • Heterogenous network: a graph with more than one type of node and/or edge. This means that nodes and edge have different sets of features, likely with different encodings and dimensions.
  • Hit: This term is almost never an appropriate choice as least ambiguous ones exist in the glossary. If that is not the case, feel free to add a new entry here.
  • Homogenous network: A homogeneous network is a graph where all nodes and edges are of the same type. In particular, all nodes and edges have the same feature encoding and dimension.

L

  • Library: Synonym for Run. Used to describe a specific "Sequencing Library", or dataset.

  • LLM: Short for Large Language Model, this term describes a machine learning model with a focus on NLP by training on massive datasets of text and generating "human-like" responses.

  • Location: A point in space, particularly, in the globe (we use WGS84). While this definition may seem self-evident, it's important to state that this is the term that should be used over others (e.g. place, site, spot) which have a different meaning in the context of this project.

  • Logan: Logan is an expansive genetics dataset consisting of the genome assemblies of all 27M DNA/RNA-seq data (44+ petabases) in the Sequence Read Archive.

M

  • Monopartite network: A graph where all nodes belong to a single set or type, and edges can connect any pair of nodes within that set.

N

  • NLP: Short for Natural Language Processing, this term denotes the subfield of AI that focuses on processing human language, with the main goal of enabling machines to understand and output regular "human-like" expressions.

P

  • Palmprint: The protein molecular barcode sequence of an RNA virus RdRp spanning three well-conserved sequence motifs A, B and C, including intervening variable regions. Each unique palmprint sequence has it's own numeric identifier (u00001), and each palmprint is a member of one "species represntative" sOTU. See, Defining palmprint in palmDB

  • Phylogenetic Network or pnet: Is a type of graph used to represent an all-vs-all sequence alignment between RNA virus palmprints. In contrast to a phylogenetic tree, a pnet is highly interconnected.

  • Precision: Proportion of relevant documents retrieved out of total retrieved documents.

R

  • RAG: Short for Retrieval-Augmented generation, an LLM architecture that can dynamically retrieve data relevant to a user query and generate outputs based on retrieved data, allowing for real-time updates to databases to be used in responses output from the model. See, Wikipedia.

  • RdRp: RNA-Dependent RNA Polymerase is a gene and protein which is universally shared among RNA viruses. We use a sub-sequence of the RdRp protein sequence called the palmprint, as a molecular barcode to organize/categorize all RNA viruses. See, What is RdRP?

  • Recall: Proportion of relevant documents retrieved out of total relevant documents.

  • Run: The term in the SRA referring to a specific DNA/RNA sequencing dataset, and is an atomic-unit by which Open Virome is organized. Note: One biological sample may be sequenced many times (contain multiple runs). Run's are identified by it's Accession, which is of the form "SRRnnnnnn", "ERRnnnnnn", or "DRRnnnnnn".

S

  • sOTU: species-like Operational Taxonomic Unit, inferred from clustering all palmprints in groups of sequences with >90% identity. A single palmprint is chosen as representative for each of these groups, which is defined to be the sOTU.

  • Spot: Operationally a synonym with a sequencing read. Originates from the fact that "reads" are physical spots when generating high-throughput sequencing data. See, Bridge Amplification Sequencing

  • SRA: NCBI's Sequence Read Archive, see, https://www.ncbi.nlm.nih.gov/sra/docs/.

U

  • Unitig: Represents a non-branching sequence in a de Bruijn assembly graph. In Logan unitigs, kmers of size k=31 are used. See more details on Logan Unitigs here.

V

  • Virome: The collection of all viruses associated with a set of sequencing runs. Can also be denoted with single-curly brackets as {Virome} or to state a specific virome like that of Eimeria species, {Eimeria}.