Virome - serratus-bio/open-virome GitHub Wiki
In the Open Virome a {Virome}
as the collection of all viruses associated with a set of sequencing datasets (called runs
).
Any property
unifying a set of runs constitutes that specific <property> virome
, and runs can belong to multiple viromes, for example
(1) All runs which are labelled as Eimeria sp.
make the Eimeria Virome
(2) All runs originating from Lake Garibaldi
make the Lake Garibaldi Virome
(3) All runs taken from freshwater lakes make the Freshwater Lake Virome
One run
may be a member of (1), (2), (3), it is part of the intersection of these three viromes.
The {Virome}
is represented as a weighted
undirected
, bipartite
graph where:
-
Virus Node
(hexagon): an abstract unit of virus, defined here asspecies-like Operational Taxonomic Units (sOTU)
of RNA viruse (See:palmDB
) -
Run Node
(circle): sequencingruns
from the Sequence Read Archive -
Edge
(solid line): a contig within therun
, with ansOTU
identified on it -
Edge
(weight): line thickness is scaled by contig 'read coverage'/expression
The label
on the Virome Graph is the GenBank "taxonomic species" when the sOTU
(u26089) is aligned to GenBank nr
. Here,u26089
aligns to Eimeria stiedai RNA virus 1, YP_009551684.1
with 100% amino acid identity.
Each {Virome}
can be divided into connected components
, which are communities of virus/run nodes, joined by at least one edge.
The 45 run nodes
contain 30 virus nodes
with 107 detection edges
. These can be grouped into 6 components
shown below as disjoint graphs.
Consider an example with 10 nodes, you can have either (left, virus-rich
) 8 viruses in 2 runs, or (right, run-rich
) 2 viruses in 8 runs with varying relationship density (contig, edges) and expression (coverage, edge-weight). Component figures summarize these relationships.
The component count figure show the number of nodes (distinct virus + run) and edges (contigs) per components which shows how internally interconnected each component is.
The component degree figure shows the average number of viruses per run
or runs per virus
for each component. A component is virus rich
when sOTU degree < run degree
or run rich
when sOTU degree > run degree
. In the Eimeria Virome
, all the components are run rich
, meaning that on average each virus is represented by multiple runs.
For each sOTU
the Virome Enrichment (Vrich
) is the fraction of all sOTU
observations contained in the {Virome}
. Ranges in values from 0.0 - 1.0
.
Vrich =
[ Number of times Virus occurs in {Virome} ] / [ Number of times Virus occurs in all Datasets ]
The size of the virus nodes are scaled by Vrich
, for example the 5 / 6
observations of Eimeria stiedai RNA virus
and 2 / 1250
observations Red Mite associated Cystovirus
yield Vrich
values of 0.833
and 0.0016
, respectively.
The Vexact
score is the -log10( p.value )
of a Fisher's Exact Test
with Bonferroni
multiple-testing correction, scaled to [0.1 , 10]
{Virome} is the sum observation of all i sOTU in given virome
{Serratus} is the sum observation of all sOTU across all Serratus
N : Count of all sOTU observations across {Virome}
n_vir : The observed ith sOTU count in {Virome}
n_out : The observed ith sOTU count in {Serratus}, outside {V}
n_total : The total count of ith sOTU observations
M : Count of all sOTU observations across {Serratus}
m_vir : The sum of all non-ith sOTU in {Virome}
m_out : The sum of all non-{V} sOTU in {Serratus}
m_total : The total count of all non-ith sOTU observations
# Fisher's Exact Test
FT <- fisher.test( rbind( c( n_vir, m_vir ),
c( n_out, m_out )) ,
alternative = 'greater' )
# Virome Exact Score
v.exact <- -log10( min(1 , FT$p.value * n.tests) )
# IF Virome Exact is >10 or INF, set to 10
if ( v.exact > 10 ){
v.exact <- 10
}
# IF Virome Exact is == 0, set it to 0.1
if ( v.exact == 0){
v.exact <- 0.1
}
The Virome Rank
or Vrank
is a heuristic score combining a measurement of the centrality of a virus within a Virome using the Google PageRank
algorithm. This can be thought of as a way of identifying viruses which are "core" or abundant within the virome, supported by multiple datasets.
To calculate ViromeRank
ViromeRank_sOTU = PageRank_sOTU * ViromeEnrichment_sOTU * VExact_sOTU
And the relationship between the two via Vrich
. Note, while the u653854
Rabbit hemorrhagic disease virus node is central, it is not virome
-specific and thus down-weighted in importance by VRank
.