Find_novel_viruses_E_viromic - ababaian/serratus GitHub Wiki

E. Virome Analaysis - Toxoplasma gondii Case Study

[~1 day] A Virome is a set of viruses which share a common host, environmental, or ecological niche. For example one can describe the "Human Virome", a "Antarctic Soil Virome" or the "Rainforest Virome". As such it is useful to query for all Serratus database viruses which associate with common factor. This tutorial explains how to generate and execute such a search query efficiently.

As an illustrative example, this tutorial will aim to describe the Serratus-associated virome of the unicellular parasite Toxoplasma gondii(NCBI taxid: 5811).

Upfront: This is not a "solved" tutorial, in fact there's a lot it leaves to be desired but it's a current work through of a complex problem. Follow along, let me know your thoughts, where this is good and where it is bad. Let's make it better together.

Prerequisites

Internet web browser
Unix-based command-line
pgAdmin (postgreSQL client)
Google Cloud Account [Optional]
R (RStudio)

Download Tutorial E Files

1. Defining "Virome Search Criteria"

There is not a perfect means to define a Virome Inclusion and Exclusion criteria. What is important is to define the criteria in a systematic, unbiased and reproducible manner.

A) SRA Taxonomy Search

Every SRA sequencing run is associated with a limited set of "run-specific meta-data" called the SRARunInfo. This contains a column called scientific_name which is a human-written label for the sequencing run which captures what taxonomic "organism" the dataset is from. To search for Toxoplasma gondii (txid: 5811) we will select its taxonomic parent, the genus Toxoplasma (txid: 5810) to include Toxoplasma sp. YC-2010a, a Toxo which is not categorized under T. gondii.

To search the SRA for such organisms, we can search the SRA:

[`txid5810[Organism:exp]`](https://www.ncbi.nlm.nih.gov/sra/?term=txid5810[Organism:exp])

which on 23/06/28 returned 2,813 datasets. Download this SraRunInfo file (txid5810_SraRunInfo.csv in tutorial package). Downloaded file has 1,856 matching 'runs'.

Figure 1. Taxonomy Search

B) SRA Meta-data Search

Sometimes datasets are not labelled based on the parasite T. gondii, instead the host organism is choosen (Homo sapiens, Mus musculus, ...). To search if T. gondii appears in any meta-data we can search for different spellings/synonyms. Search is exact here.

[`"Toxoplasma gondii" OR "Toxoplasma" OR "T. gondii" OR "T gondii"`](https://www.ncbi.nlm.nih.gov/sra/?term=%22Toxoplasma+gondii%22+OR+%22Toxoplasma%22+OR+%22T.+gondii%22)

which on 23/06/28 returned 6,394 datasets. Download this SraRunInfo file (toxo_SraRunInfo.csv). Downloaded file has 23,531 matching 'runs'.

NOTE: This type of search is more likely to yield False Positive matches for T. gondii, or places where your search returns synonym results. For example this Neospora caninum sample mentions "T. gondii".

Figure 2. Meta-data Search

C) Sequence Content Search (optional)

What if the data-generators were unaware that their sample contained T. gondii? It's possible to search for known-organism genomes using a precomputed hashed-kmer database called NCBI STAT. To use STAT, you must register for a GCP account and follow these instructions.

STAT BigQuery for libraries containing great than 50 Toxoplasma-mapping reads:

    SELECT
      acc,
      tax_id,
      name,
      total_count,
      self_count
    FROM
      nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
    WHERE
      tax_id in (5810)
      AND total_count>50

which on 23/06/28 returned 20,198 datasets. Download this statbigquery file (txid5810_statbigquery.csv). The number of T.gondii+ reads detected by STAT in each of the 20,198 returned runs. There are 12,863 runs which contain >1000 reads.

T gondii STAT

Set Comparison

Set Overlaps. Made with https://bioinformatics.psb.ugent.be/cgi-bin/liste/Venn/calculate_venn.htpl

Metadata Only Dataset: SRR11160759 : Described as 10T1/2 cells treated with parasite-free lysate (mock) 2 hpi . A control dataset from a T.gondii infection study.
BigQuery Only Dataset: SRR21616415 : Described as HFFs (human cells) infected with T.gondii in Sample description only. Meta-data not captured by SRA metadata search
BigQuery Only Dataset: SRR870621 : Only described as infected with type II (Pru A7) for 8 hrs, which is a strain of T. gondii. Meta-data false negative, big-query true postive.

To err on the side of inclusion for now, we will take the Union of these sets, so 32,601 runs as representing the total T. gondii sequencing set.

D) Set Union and Virome-adjacent controls (optional)

We, unfortunately, now have two different csv tables with slightly varying meta-data. For simplicity we will take the Union of these sets and generate a uniform collection of meta-data so that the data can be processed uniformly.

Depending on your experimental design it may be beneficial to include a Virome adjacent or Virome control datasets in downstream analysis. Briefly, every SRA 'run' belongs to a greater bioproject. One bioproject can group together say 20 runs, and based on meta-data our queries could return only a sub-set of a given bioproject. For example by searching for T. gondii with STAT, in a study containing 20 samples of cat muscle biopsies, only 2/20 may be positive for T gondii, and therefore we would like to include the remaining 18/20 cat biopsies of that study. This will prove useful as a negative control to deplete for non-specific viruses with respect to our query of T. gondii.

#!/bin/bash
# Generate list of all UNIQUE SRA run id from all query sets above
cut -f1 -d',' toxo_SraRunInfo.csv        > tg_run.list.tmp
cut -f1 -d',' txid5810_SraRunInfo.csv   >> tg_run.list.tmp
cut -f1 -d',' txid5810_statbigquery.csv >> tg_run.list.tmp

# Remove header lines
grep -v "^Run" tg_run.list.tmp \
  | grep -v "^acc" - \
  | sort -du - > tg.runlist 

wc -l tg.runlist
# 32601 tg.runlist

This list can now be used as an input to query for all SRA meta-data in big query

#GCP Big Query
    SELECT
      acc, organism, biosample, bioproject, assay_type, librarysource, libraryselection, mbases, mbytes, avgspotlen, releasedate, consent 
    FROM
      `nih-sra-datastore.sra.metadata` 
    WHERE
      acc in ('DRR001705', 'DRR001706', 'DRR002461', ..., SRR9997107', 'SRR9997108')

Saved as tg.SraMetadata.csv returned 32,474 "tg-positive" SRA runs (99.6%) of the search query. From this we will create a list of all 2,586 unique bioproject identifiers (column 4) and re-query for all related runs.

#!/bin/bash
cut -f4 -d', tg_virome.SraMetadata.csv | sort -u - > tg_bioproject.list

#GCP Big Query
    SELECT
      acc, organism, biosample, bioproject, assay_type, librarysource, libraryselection, mbases, mbytes, avgspotlen, releasedate, consent 
    FROM
      `nih-sra-datastore.sra.metadata` 
    WHERE
      bioproject in ('PRJDA33425', 'PRJDA33429', ..., 'PRJNA985929', 'PRJNA988114')

Saved as tg_adjacent.SraMetadata.csv Which yields 668,195 total, or 635,721 "tg-negative" SRA runs. This will serve as a "negative control" for non-specific virus-associations.

2. Generating the "Target Virome" and "Off-target Virome"

Using the "Tg-postive" and "Tg-negative" runs, we will query the Serratus SQL tables to retrieve sOTU which were found within those runs. This molecular barcode approximates RNA virus species and can be used to meaningfully group related sequences when looking for an association.

# Serratus postgres SQL Query
SELECT * FROM public.palm_sra2
WHERE qc_pass = 'true' AND
run_id in ('DRR001701', 
'DRR001702', 
'DRR001703', 
'DRR001704',
...
'SRR9997116', 
'SRR9997117', 
'SRR9997118')
ORDER BY run_id ASC

Saved as tg_adj_U_palmsra.csv which returned 121,907 virus-run observations in 15,256 SRA runs, containing 30,828 distinct virus sOTU.

3. Virome summary statistics

The following section is all coded in R for analysis of the above generated datasets. The R-markdown file tg_virome_analysis.Rmd is provided.

Initialize workspace and imports

#{r}
# Library imports
library(ggplot2)
library(reshape2)
library(dplyr)

#{r}
# Import SRA Meta-data
srameta <- read.csv2('tg_adjacent.SraMetadata.csv', sep = ",")
  tg.list <- read.csv2('tg.runlist', header = FALSE, col.names = 'acc')
  srameta$tg.status <- 'tg.negative'
  srameta$tg.status[ srameta$acc %in% tg.list$acc ] <- 'tg.positive'

# Import Serratus sOTU table
palmsra <- read.csv2('tg_adj_U_palmsra.csv', sep = ",")

Basic SRA run metadata by Tg-status

#{r}
# Basic summary statistic plots - Labelled Organism
srameta.organism         <- dcast( srameta, organism ~ tg.status )
  srameta.organism$total   <- apply(srameta.organism[ ,2:3], 1, sum) 
  srameta.organism$tg.perc <- 100 * srameta.organism$tg.positive / srameta.organism$total 
  srameta.organism <- srameta.organism[ order(srameta.organism$tg.positive, decreasing = T), ]
  
# Look at top-20 matches
org20 <- srameta.organism[1:20, ]
print( org20 )

# Basic summary statistic plots - Summary Run and BioProject Counts
org20.melt <- melt(org20[, 1:3],
                   id.vars = 'organism',
                   variable.name = 'tg.status')

# Re-order factor levels to retain Tg abundance levels
org20.melt$organism <- factor(x = org20.melt$organism,
                              levels = rev(as.character(org20$organism)) )

# Create bar plot of top-20 occuring organisms
org.bar <- ggplot( org20.melt, aes( x = organism, y = value, fill = tg.status)) +
  geom_bar(stat = 'identity') +
  theme_bw()+ theme(legend.position = "none") +
  coord_flip() + facet_grid(~tg.status, scales = "free") +
  xlab('Organism label') + ylab('Count SRA runs')
org.bar

Output:

                            organism tg.negative tg.positive  total     tg.perc
13247           Toxoplasma gondii RH           0       11503  11503 100.0000000
13212              Toxoplasma gondii           0        8002   8002 100.0000000
8375                    Mus musculus       29930        3368  33298  10.1147216
6228                    Homo sapiens      311760        2731 314491   0.8683873
7696               marine metagenome        5966         841   6807  12.3549287
13471              Triticum aestivum        4778         644   5422  11.8775360
10012          Plasmodium falciparum       96617         399  97016   0.4112724
6255                 Hordeum vulgare        1225         181   1406  12.8733997
12688                     Sus scrofa         936         128   1064  12.0300752
11830            seawater metagenome         842         104    946  10.9936575
10893              Rattus norvegicus         468         103    571  18.0385289
5016                Escherichia coli         836         102    938  10.8742004
5777                  gut metagenome        3631          94   3725   2.5234899
8341            mouse gut metagenome          57          80    137  58.3941606
11353      Saccharum hybrid cultivar          41          72    113  63.7168142
6257  Hordeum vulgare subsp. vulgare        1713          58   1771   3.2749859
13234          Toxoplasma gondii GT1           0          57     57 100.0000000
10889             rat gut metagenome           0          54     54 100.0000000
13470                       Triticum         121          53    174  30.4597701
3824                     Danio rerio        1091          50   1141   4.3821209

SRA Tg organisms

#{r}
# Basic summary statistic plots - Assay Types
sra.assay <- ggplot( srameta, aes(x=assay_type, fill = assay_type)) +
  geom_bar() +
  ggtitle("Run Assay Type by T. gondii status") +
  xlab('') + ylab('Count SRA Runs (log)') + 
  scale_y_log10() + coord_polar("x", start = 0) +
  theme_bw() + theme(legend.position = "none") + 
  facet_grid(~tg.status)
sra.assay


# Basic summary statistic plots - library source
sra.source <- ggplot( srameta, aes(x=librarysource, fill = librarysource)) +
  geom_bar() +
  ggtitle("Library source by T. gondii status") +
  xlab('') + ylab('Count SRA Runs (log)') + 
  scale_y_log10() + coord_polar("x", start = 0) +
  theme_bw() + theme(legend.position = "none") + 
  facet_grid(~tg.status)
sra.source

SRA Tg assay type

SRA Tg Library source

Serratus palmsra (sOTU) statistics

Layer in RNA virus sOTU and Tg Status, establish a basic statistical rank-ordering method (Fisher Score) to prioritize inspection of data.

#{r}
# Merge SRA Metadata and palmSRA
palmsra   <- merge(   x = palmsra,     y =  srameta,
                   by.x = 'run_id', all.x = TRUE,
                   by.y = 'acc'   , all.y = FALSE)

# u717319 -- 52 runs -- 5 bioprojects


# Filter for only TRANSCRIPTOMIC / VIRAL RNA libraries
# to simplify analysis
palmsra_total <- palmsra
palmsra <- palmsra[ ( palmsra$librarysource %in% c('TRANSCRIPTOMIC', 'VIRAL RNA')), ]

collapseUnique <- function( key.vec, collapse.vec, status.factor = NULL ){
  # Input two vectors, key and collapse
  # returns a count of each key vector after removing
  # duplicated collapse vector entries
  # Input three vectors, same as above except report 
  # output frequency based on status factor
  # key and collapse vector must be the same length
  # only makes sense if status is assigned based on collapse.vec
  # otherwise status is randomly chosen
  if ( length(key.vec) == length(collapse.vec) ){
    merge.vec <- paste0( key.vec, "-", collapse.vec )
  } else {
    stop()
  }
 
  if ( is.null(status.factor) ){
    # No Status Factor provided
    # Sort and remove duplicates of "key-collapse" vector 
    collapse.df <- data.frame( key.vec, collapse.vec, merge.vec )
    collapse.df <- collapse.df[ order(collapse.df$merge.vec), ]
    collapse.df <- collapse.df[ !duplicated( collapse.df$merge.vec), ]
    
    # Create output data.frame of Key , Frequency post-collapse
    count.df <- data.frame(table(collapse.df$key.vec))
    names(count.df) <- c('key', 'freq')
    count.df <- count.df[ order( count.df$freq, decreasing = TRUE ), ]
    
  } else {
    # Status Factor provided
    collapse.df <- data.frame( key.vec, collapse.vec, merge.vec, status.factor )
    collapse.df <- collapse.df[ order(collapse.df$merge.vec), ]
    collapse.df <- collapse.df[ !duplicated( collapse.df$merge.vec), ]
    
    # Create output data.frame of Key , Frequency post-collapse
    count.df <- table( collapse.df[ , c('key.vec', 'status.factor')] )
    count.df <- data.frame (count.df)
    count.df <- dcast(count.df, key.vec ~ status.factor, value.var = "Freq")
    count.df <- count.df[ order(count.df[, 2], decreasing = TRUE), ]
  }

  return(count.df)
  
}

# For each sOTU, count the number of unique BioProjects for the sOTU
# Requires collapsing sOTU which are observed in multiple RUN within
# a single BioProject
sotu.bp <- collapseUnique( palmsra$sotu, palmsra$bioproject)
  names(sotu.bp) <- c('sotu', 'bp_freq')

# For each sOTU, count the unique number of RUNS it occurs in
# in tg.positive and tg.negative libraries
sotu.run <- collapseUnique( palmsra$sotu, palmsra$run_id, palmsra$tg.status)
  names(sotu.run) <- c('sotu', 'tg.negative.run_freq', 'tg.positive.run_freq')                          

# For each sOTU, count the total occurrence of contigs bearing that sOTUs
# in tg.positive and tg.negative libraries
sotu.contig <- collapseUnique( palmsra$sotu, palmsra$row_id, palmsra$tg.status)
  names(sotu.contig) <- c('sotu', 'tg.negative.contig_freq', 'tg.positive.contig_freq') 

  
# Merge sOTU counting metrics
  tg.sotu.df <- merge(sotu.contig, sotu.run, by = 'sotu')
  tg.sotu.df <- merge(tg.sotu.df , sotu.bp,  by = 'sotu')
  tg.sotu.df <- tg.sotu.df[ order( tg.sotu.df$bp_freq, decreasing = TRUE), ]
  
  # Sort by tg.negative, then tg.positive
  tg.sotu.df <- tg.sotu.df[ order(tg.sotu.df$tg.negative.run_freq, decreasing = T) ,]
  tg.sotu.df <- tg.sotu.df[ order(tg.sotu.df$tg.positive.run_freq, decreasing = T) ,]

# Calculate Fisher's Exact Test (score)
# for each sOTU (Vx), calculate a 2x2 contingency table if 
# Vx is enriched in Tg.positive datasets relative to Tg.negative datasets
# compared against the global occurances of all other sOTU (Vy, where
# Vy = V_total - Vx) in Tg.p and Tg.n datasets.

Tg.p_Vt <- sum(tg.sotu.df$tg.positive.run_freq)
Tg.n_Vt <- sum(tg.sotu.df$tg.negative.run_freq)

# For example, "u173627" which is 8 Tg- and 14 Tg+
V_name <- 'u173627'

Tg.p_Vx <- tg.sotu.df$tg.positive.run_freq[ tg.sotu.df$sotu == V_name ]
Tg.n_Vx <- tg.sotu.df$tg.negative.run_freq[ tg.sotu.df$sotu == V_name ]

test.table <- data.frame(
  "tg.positive" = c( Tg.p_Vx, Tg.p_Vt - Tg.p_Vx ),
  "tg.negative" = c( Tg.n_Vx, Tg.n_Vt - Tg.n_Vx ),
  row.names = c("Test_sOTU", "Other_sOTUs"),
  stringsAsFactors = FALSE
)

print( paste0( "For example sOTU: ", V_name ))
print('')
print( as.matrix( test.table) )

# Fisher's Exact Test
fisher.result <- fisher.test(test.table, alternative = "greater")
print(fisher.result)

# Apply Multiple Testing Correction
mt.correction <- length(tg.sotu.df$sotu) 
print( paste0( "with Bonferonni testing correction factor: ", mt.correction ) )
print( paste0( "  corrected p-value : ", min(1, fisher.result$p.value * mt.correction ) ) )
print( paste0( "  corrected p-score : ", -log( min(1, fisher.result$p.value * mt.correction ) ) ) )

Output:

[1] "For example sOTU: u173627"
[1] ""
            tg.positive tg.negative
Test_sOTU            27          36
Other_sOTUs        1897       20168

  Fisher's Exact Test for Count Data

data:  test.table
p-value = 4.25e-13
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 5.047827      Inf
sample estimates:
odds ratio 
  7.972087 

[1] "with Bonferonni testing correction factor: 8051"
[1] "  corrected p-value : 3.42134144248517e-09"
[1] "  corrected p-score : 19.493233128131"

#{r}
# Wrap the above as a function to return corrected Fisher Score
# Fisher Score = -log( corrected p-value )

fisher.score <- function( tg.sotu.row, Tg.p_Vt, Tg.n_Vt, mt.correction ){

  # col1 <- postive_runs
  # col2 <- negative_runs  
  Tg.p_Vx <- tg.sotu.row[1]
  Tg.n_Vx <- tg.sotu.row[2]
  
  test.table <- data.frame(
  "tg.positive" = c( Tg.p_Vx, Tg.p_Vt - Tg.p_Vx ),
  "tg.negative" = c( Tg.n_Vx, Tg.n_Vt - Tg.n_Vx ),
  row.names = c("Test_sOTU", "Other_sOTUs"),
  stringsAsFactors = FALSE
  )
  
  # Fisher's Exact Test
 fisher.result <- fisher.test(test.table, alternative = "greater")
 
 fisher.pscore <- -log( min(1, fisher.result$p.value * mt.correction ) )
 
 return(fisher.pscore)
 
}

# Calculate Fisher Score for each sOTU 
tg.sotu.df$fisher.score <- apply( tg.sotu.df[, c('tg.positive.run_freq', 'tg.negative.run_freq') ], 1,
                                  fisher.score, Tg.p_Vt, Tg.n_Vt, mt.correction)
# Rank order the output based on fisher.score
tg.sotu.df      <- tg.sotu.df[ order(tg.sotu.df$fisher.score, decreasing = TRUE), ]
tg.sotu.df$rank <- 1:length(tg.sotu.df$sotu)

# Calculate Tg+ : Tg- runs
tg.sotu.df$tg.ratio <- tg.sotu.df$tg.positive.run_freq / tg.sotu.df$tg.negative.run_freq
tg.sotu.df$tg.ratio[ tg.sotu.df$tg.ratio == Inf ] <- 25


# Top 10 hits
print(tg.sotu.df[1:10, c("rank", "sotu", "tg.positive.run_freq", "tg.negative.run_freq", "fisher.score", "tg.ratio")])

Looking at the Top 10 hits, focus on the sOTU where it is present in 11 Tg+ libraries and zero Tg- libraries.

Output:

     rank    sotu tg.positive.run_freq tg.negative.run_freq fisher.score  tg.ratio
1352    1 u173627                   27                   36    19.493233  0.750000
5673    2  u72795                   11                    0    17.899413 25.000000 *** INSEPCTION ***
6909    3 u874078                   14                    8    13.240015  1.750000
1681    4     u20                   10                    1    13.136257 10.000000
2557    5   u2879                    9                    0    13.005497 25.000000
1049    6  u14499                    8                    0    10.559254 25.000000
1260    7  u16867                   11                    6     8.970801  1.833333
1897    8  u22027                   10                    5     7.853949  2.000000
2592    9 u291054                   12                   12     6.531971  1.000000
4342   10  u54378                    6                    0     5.668199 25.000000

#{r}
# Scatterplot of Tg+/- runs and their BioProject frequency and Fisher Score
run.scatter <- ggplot( tg.sotu.df, aes( tg.negative.run_freq, tg.positive.run_freq,
                                        color = fisher.score, size = bp_freq) ) +
  geom_point(alpha = 0.4) + 
  geom_abline( slope = 1, intercept = 0, color = 'red') +
  theme_bw() + theme(aspect.ratio = 1) +
  scale_color_gradient(
    low = "gray30",
    high = "red",
    limits = c(0,3)) +
  scale_x_log10(limits = c(0.8, 1500)) + scale_y_log10(limits = c(0.8, 1500))
run.scatter

Tg postive vs negative scatter

#{r}
# Fisher Score vs. the ratio of Tg+ : Tg-
# 
run.rank <- ggplot( tg.sotu.df, aes( fisher.score, tg.ratio,
                                      color = rank,
                                      size = bp_freq) ) +
  geom_point(alpha = 0.2) + 
  geom_vline(xintercept = 3, color = 'red') +
  theme_bw() + theme(aspect.ratio = 1) +
  scale_color_gradient2(
    low = "firebrick",
    mid = "red",
    high = "gray80",
    limits = c(1,1000))
  run.rank

Tg postive vs negative scatter

sOTU Inpsection

Inspection of u72795

We have now essentially created a rank-ordered list of virus-hits to inspect further. Let's consider u72795 with 11 Tg+ and 0 Tg- observations in our dataset across 3 BioProjects.

# Inspect whhere contigs matched this sOTU

print(palmsra[ palmsra$sotu == 'u72795',
               c('run_id', 'organism', 'bioproject', 'sotu', 'coverage', 'evalue', 'q_sequence' ,'tg.status') ])

            run_id     organism  bioproject   sotu coverage  evalue
32550  SRR10593837 Mus musculus PRJNA593831 u72795        2 1.8e-69
32551  SRR10593838 Mus musculus PRJNA593831 u72795        2 1.2e-69
32554  SRR10593841 Mus musculus PRJNA593831 u72795        1   5e-70
39169  SRR11792738 Mus musculus PRJNA632742 u72795        2 1.2e-69
39170  SRR11792740 Mus musculus PRJNA632742 u72795        2 3.2e-69
39171  SRR11792741 Mus musculus PRJNA632742 u72795        2 1.8e-69
120458  SRR9639534 Mus musculus PRJNA552423 u72795        2 3.2e-69
120459  SRR9639536 Mus musculus PRJNA552423 u72795        2 1.8e-69
120460  SRR9639537 Mus musculus PRJNA552423 u72795        2 3.2e-69
120461  SRR9639538 Mus musculus PRJNA552423 u72795        2 1.2e-69
120462  SRR9639539 Mus musculus PRJNA552423 u72795        2 1.8e-69
                                                                                                       q_sequence   tg.status
32550  FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
32551  FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
32554  FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
39169  FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
39170  FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
39171  FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120458 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120459 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120460 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120461 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120462 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive

BLASTP of this sequence identifies it as [Crab-eating macaque hepacivirus] CAI5757786.1. In addition to the above libraries, this virus sOTU is found in several Macaca fascicularis datasets as identified via palmID. As you can see, this list is able to prioritize hits to analyze, but is not definitive on it's own and requires additional analysis.

Happy hunting!

Find_novel_viruses_E_viromic - ababaian/serratus GitHub Wiki

E. Virome Analaysis - Toxoplasma gondii Case Study

Prerequisites

Download Tutorial E Files

1. Defining "Virome Search Criteria"

A) SRA Taxonomy Search

B) SRA Meta-data Search

C) Sequence Content Search (optional)

Set Comparison

D) Set Union and Virome-adjacent controls (optional)

2. Generating the "Target Virome" and "Off-target Virome"

3. Virome summary statistics

Initialize workspace and imports

Basic SRA run metadata by Tg-status

Serratus palmsra (sOTU) statistics

sOTU Inpsection

See also