Find_novel_viruses_E_viromic - ababaian/serratus GitHub Wiki
E. Virome Analaysis - Toxoplasma gondii Case Study
[~1 day] A Virome is a set of viruses which share a common host, environmental, or ecological niche. For example one can describe the "Human Virome", a "Antarctic Soil Virome" or the "Rainforest Virome". As such it is useful to query for all Serratus database viruses which associate with common factor. This tutorial explains how to generate and execute such a search query efficiently.
As an illustrative example, this tutorial will aim to describe the Serratus-associated virome of the unicellular parasite Toxoplasma gondii(NCBI taxid: 5811).
Upfront: This is not a "solved" tutorial, in fact there's a lot it leaves to be desired but it's a current work through of a complex problem. Follow along, let me know your thoughts, where this is good and where it is bad. Let's make it better together.
Prerequisites
- Internet web browser
- Unix-based command-line
- pgAdmin (postgreSQL client)
- Google Cloud Account [Optional]
- R (RStudio)
Download Tutorial E Files
1. Defining "Virome Search Criteria"
There is not a perfect means to define a Virome Inclusion
and Exclusion
criteria. What is important is to define the criteria in a systematic, unbiased and reproducible manner.
A) SRA Taxonomy Search
Every SRA sequencing run is associated with a limited set of "run-specific meta-data" called the SRARunInfo
. This contains a column called scientific_name
which is a human-written label for the sequencing run which captures what taxonomic "organism" the dataset is from. To search for Toxoplasma gondii
(txid: 5811) we will select its taxonomic parent, the genus Toxoplasma
(txid: 5810) to include Toxoplasma sp. YC-2010a, a Toxo which is not categorized under T. gondii.
To search the SRA for such organisms, we can search the SRA:
[`txid5810[Organism:exp]`](https://www.ncbi.nlm.nih.gov/sra/?term=txid5810[Organism:exp])
which on 23/06/28
returned 2,813
datasets. Download this SraRunInfo
file (txid5810_SraRunInfo.csv
in tutorial package). Downloaded file has 1,856
matching 'runs'.
B) SRA Meta-data Search
Sometimes datasets are not labelled based on the parasite T. gondii
, instead the host organism is choosen (Homo sapiens
, Mus musculus
, ...). To search if T. gondii appears in any meta-data we can search for different spellings/synonyms. Search is exact here.
[`"Toxoplasma gondii" OR "Toxoplasma" OR "T. gondii" OR "T gondii"`](https://www.ncbi.nlm.nih.gov/sra/?term=%22Toxoplasma+gondii%22+OR+%22Toxoplasma%22+OR+%22T.+gondii%22)
which on 23/06/28
returned 6,394
datasets. Download this SraRunInfo
file (toxo_SraRunInfo.csv
). Downloaded file has 23,531
matching 'runs'.
NOTE: This type of search is more likely to yield False Positive
matches for T. gondii
, or places where your search returns synonym results. For example this Neospora caninum sample mentions "T. gondii".
C) Sequence Content Search (optional)
What if the data-generators were unaware that their sample contained T. gondii
? It's possible to search for known-organism genomes using a precomputed hashed-kmer database called NCBI STAT
. To use STAT
, you must register for a GCP
account and follow these instructions.
STAT BigQuery for libraries containing great than 50 Toxoplasma-mapping reads:
SELECT
acc,
tax_id,
name,
total_count,
self_count
FROM
nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
WHERE
tax_id in (5810)
AND total_count>50
which on 23/06/28
returned 20,198
datasets. Download this statbigquery
file (txid5810_statbigquery.csv
). The number of T.gondii
+ reads detected by STAT in each of the 20,198
returned runs. There are 12,863
runs which contain >1000 reads.
Set Comparison
- Metadata Only Dataset: SRR11160759 : Described as
10T1/2 cells treated with parasite-free lysate (mock) 2 hpi
. A control dataset from aT.gondii
infection study. - BigQuery Only Dataset: SRR21616415 : Described as HFFs (human cells) infected with T.gondii in
Sample
description only. Meta-data not captured by SRA metadata search - BigQuery Only Dataset: SRR870621 : Only described as
infected with type II (Pru A7) for 8 hrs
, which is a strain of T. gondii. Meta-data false negative, big-query true postive.
To err on the side of inclusion for now, we will take the Union of these sets, so 32,601
runs as representing the total T. gondii sequencing set.
D) Set Union and Virome-adjacent controls (optional)
We, unfortunately, now have two different csv
tables with slightly varying meta-data. For simplicity we will take the Union
of these sets and generate a uniform collection of meta-data so that the data can be processed uniformly.
Depending on your experimental design it may be beneficial to include a Virome adjacent
or Virome control
datasets in downstream analysis. Briefly, every SRA 'run' belongs to a greater bioproject
. One bioproject
can group together say 20 runs
, and based on meta-data our queries could return only a sub-set of a given bioproject. For example by searching for T. gondii
with STAT
, in a study containing 20 samples of cat muscle biopsies, only 2/20 may be positive for T gondii
, and therefore we would like to include the remaining 18/20 cat biopsies of that study. This will prove useful as a negative control
to deplete for non-specific viruses with respect to our query of T. gondii
.
#!/bin/bash
# Generate list of all UNIQUE SRA run id from all query sets above
cut -f1 -d',' toxo_SraRunInfo.csv > tg_run.list.tmp
cut -f1 -d',' txid5810_SraRunInfo.csv >> tg_run.list.tmp
cut -f1 -d',' txid5810_statbigquery.csv >> tg_run.list.tmp
# Remove header lines
grep -v "^Run" tg_run.list.tmp \
| grep -v "^acc" - \
| sort -du - > tg.runlist
wc -l tg.runlist
# 32601 tg.runlist
This list can now be used as an input to query for all SRA meta-data in big query
#GCP Big Query
SELECT
acc, organism, biosample, bioproject, assay_type, librarysource, libraryselection, mbases, mbytes, avgspotlen, releasedate, consent
FROM
`nih-sra-datastore.sra.metadata`
WHERE
acc in ('DRR001705', 'DRR001706', 'DRR002461', ..., SRR9997107', 'SRR9997108')
Saved as tg.SraMetadata.csv
returned 32,474
"tg-positive" SRA runs (99.6%) of the search query. From this we will create a list of all 2,586
unique bioproject
identifiers (column 4) and re-query for all related runs.
#!/bin/bash
cut -f4 -d', tg_virome.SraMetadata.csv | sort -u - > tg_bioproject.list
#GCP Big Query
SELECT
acc, organism, biosample, bioproject, assay_type, librarysource, libraryselection, mbases, mbytes, avgspotlen, releasedate, consent
FROM
`nih-sra-datastore.sra.metadata`
WHERE
bioproject in ('PRJDA33425', 'PRJDA33429', ..., 'PRJNA985929', 'PRJNA988114')
Saved as tg_adjacent.SraMetadata.csv
Which yields 668,195
total, or 635,721
"tg-negative" SRA runs. This will serve as a "negative control" for non-specific virus-associations.
2. Generating the "Target Virome" and "Off-target Virome"
Using the "Tg-postive" and "Tg-negative" runs, we will query the Serratus
SQL tables to retrieve sOTU
which were found within those runs. This molecular barcode approximates RNA virus species and can be used to meaningfully group related sequences when looking for an association.
# Serratus postgres SQL Query
SELECT * FROM public.palm_sra2
WHERE qc_pass = 'true' AND
run_id in ('DRR001701',
'DRR001702',
'DRR001703',
'DRR001704',
...
'SRR9997116',
'SRR9997117',
'SRR9997118')
ORDER BY run_id ASC
Saved as tg_adj_U_palmsra.csv
which returned 121,907
virus-run observations in 15,256
SRA runs, containing 30,828
distinct virus sOTU.
3. Virome summary statistics
The following section is all coded in R
for analysis of the above generated datasets. The R-markdown file tg_virome_analysis.Rmd
is provided.
Initialize workspace and imports
#{r}
# Library imports
library(ggplot2)
library(reshape2)
library(dplyr)
#{r}
# Import SRA Meta-data
srameta <- read.csv2('tg_adjacent.SraMetadata.csv', sep = ",")
tg.list <- read.csv2('tg.runlist', header = FALSE, col.names = 'acc')
srameta$tg.status <- 'tg.negative'
srameta$tg.status[ srameta$acc %in% tg.list$acc ] <- 'tg.positive'
# Import Serratus sOTU table
palmsra <- read.csv2('tg_adj_U_palmsra.csv', sep = ",")
Basic SRA run metadata by Tg-status
#{r}
# Basic summary statistic plots - Labelled Organism
srameta.organism <- dcast( srameta, organism ~ tg.status )
srameta.organism$total <- apply(srameta.organism[ ,2:3], 1, sum)
srameta.organism$tg.perc <- 100 * srameta.organism$tg.positive / srameta.organism$total
srameta.organism <- srameta.organism[ order(srameta.organism$tg.positive, decreasing = T), ]
# Look at top-20 matches
org20 <- srameta.organism[1:20, ]
print( org20 )
# Basic summary statistic plots - Summary Run and BioProject Counts
org20.melt <- melt(org20[, 1:3],
id.vars = 'organism',
variable.name = 'tg.status')
# Re-order factor levels to retain Tg abundance levels
org20.melt$organism <- factor(x = org20.melt$organism,
levels = rev(as.character(org20$organism)) )
# Create bar plot of top-20 occuring organisms
org.bar <- ggplot( org20.melt, aes( x = organism, y = value, fill = tg.status)) +
geom_bar(stat = 'identity') +
theme_bw()+ theme(legend.position = "none") +
coord_flip() + facet_grid(~tg.status, scales = "free") +
xlab('Organism label') + ylab('Count SRA runs')
org.bar
Output:
organism tg.negative tg.positive total tg.perc
13247 Toxoplasma gondii RH 0 11503 11503 100.0000000
13212 Toxoplasma gondii 0 8002 8002 100.0000000
8375 Mus musculus 29930 3368 33298 10.1147216
6228 Homo sapiens 311760 2731 314491 0.8683873
7696 marine metagenome 5966 841 6807 12.3549287
13471 Triticum aestivum 4778 644 5422 11.8775360
10012 Plasmodium falciparum 96617 399 97016 0.4112724
6255 Hordeum vulgare 1225 181 1406 12.8733997
12688 Sus scrofa 936 128 1064 12.0300752
11830 seawater metagenome 842 104 946 10.9936575
10893 Rattus norvegicus 468 103 571 18.0385289
5016 Escherichia coli 836 102 938 10.8742004
5777 gut metagenome 3631 94 3725 2.5234899
8341 mouse gut metagenome 57 80 137 58.3941606
11353 Saccharum hybrid cultivar 41 72 113 63.7168142
6257 Hordeum vulgare subsp. vulgare 1713 58 1771 3.2749859
13234 Toxoplasma gondii GT1 0 57 57 100.0000000
10889 rat gut metagenome 0 54 54 100.0000000
13470 Triticum 121 53 174 30.4597701
3824 Danio rerio 1091 50 1141 4.3821209
#{r}
# Basic summary statistic plots - Assay Types
sra.assay <- ggplot( srameta, aes(x=assay_type, fill = assay_type)) +
geom_bar() +
ggtitle("Run Assay Type by T. gondii status") +
xlab('') + ylab('Count SRA Runs (log)') +
scale_y_log10() + coord_polar("x", start = 0) +
theme_bw() + theme(legend.position = "none") +
facet_grid(~tg.status)
sra.assay
# Basic summary statistic plots - library source
sra.source <- ggplot( srameta, aes(x=librarysource, fill = librarysource)) +
geom_bar() +
ggtitle("Library source by T. gondii status") +
xlab('') + ylab('Count SRA Runs (log)') +
scale_y_log10() + coord_polar("x", start = 0) +
theme_bw() + theme(legend.position = "none") +
facet_grid(~tg.status)
sra.source
Serratus palmsra (sOTU) statistics
Layer in RNA virus sOTU and Tg Status, establish a basic statistical rank-ordering method (Fisher Score) to prioritize inspection of data.
#{r}
# Merge SRA Metadata and palmSRA
palmsra <- merge( x = palmsra, y = srameta,
by.x = 'run_id', all.x = TRUE,
by.y = 'acc' , all.y = FALSE)
# u717319 -- 52 runs -- 5 bioprojects
# Filter for only TRANSCRIPTOMIC / VIRAL RNA libraries
# to simplify analysis
palmsra_total <- palmsra
palmsra <- palmsra[ ( palmsra$librarysource %in% c('TRANSCRIPTOMIC', 'VIRAL RNA')), ]
collapseUnique <- function( key.vec, collapse.vec, status.factor = NULL ){
# Input two vectors, key and collapse
# returns a count of each key vector after removing
# duplicated collapse vector entries
# Input three vectors, same as above except report
# output frequency based on status factor
# key and collapse vector must be the same length
# only makes sense if status is assigned based on collapse.vec
# otherwise status is randomly chosen
if ( length(key.vec) == length(collapse.vec) ){
merge.vec <- paste0( key.vec, "-", collapse.vec )
} else {
stop()
}
if ( is.null(status.factor) ){
# No Status Factor provided
# Sort and remove duplicates of "key-collapse" vector
collapse.df <- data.frame( key.vec, collapse.vec, merge.vec )
collapse.df <- collapse.df[ order(collapse.df$merge.vec), ]
collapse.df <- collapse.df[ !duplicated( collapse.df$merge.vec), ]
# Create output data.frame of Key , Frequency post-collapse
count.df <- data.frame(table(collapse.df$key.vec))
names(count.df) <- c('key', 'freq')
count.df <- count.df[ order( count.df$freq, decreasing = TRUE ), ]
} else {
# Status Factor provided
collapse.df <- data.frame( key.vec, collapse.vec, merge.vec, status.factor )
collapse.df <- collapse.df[ order(collapse.df$merge.vec), ]
collapse.df <- collapse.df[ !duplicated( collapse.df$merge.vec), ]
# Create output data.frame of Key , Frequency post-collapse
count.df <- table( collapse.df[ , c('key.vec', 'status.factor')] )
count.df <- data.frame (count.df)
count.df <- dcast(count.df, key.vec ~ status.factor, value.var = "Freq")
count.df <- count.df[ order(count.df[, 2], decreasing = TRUE), ]
}
return(count.df)
}
# For each sOTU, count the number of unique BioProjects for the sOTU
# Requires collapsing sOTU which are observed in multiple RUN within
# a single BioProject
sotu.bp <- collapseUnique( palmsra$sotu, palmsra$bioproject)
names(sotu.bp) <- c('sotu', 'bp_freq')
# For each sOTU, count the unique number of RUNS it occurs in
# in tg.positive and tg.negative libraries
sotu.run <- collapseUnique( palmsra$sotu, palmsra$run_id, palmsra$tg.status)
names(sotu.run) <- c('sotu', 'tg.negative.run_freq', 'tg.positive.run_freq')
# For each sOTU, count the total occurrence of contigs bearing that sOTUs
# in tg.positive and tg.negative libraries
sotu.contig <- collapseUnique( palmsra$sotu, palmsra$row_id, palmsra$tg.status)
names(sotu.contig) <- c('sotu', 'tg.negative.contig_freq', 'tg.positive.contig_freq')
# Merge sOTU counting metrics
tg.sotu.df <- merge(sotu.contig, sotu.run, by = 'sotu')
tg.sotu.df <- merge(tg.sotu.df , sotu.bp, by = 'sotu')
tg.sotu.df <- tg.sotu.df[ order( tg.sotu.df$bp_freq, decreasing = TRUE), ]
# Sort by tg.negative, then tg.positive
tg.sotu.df <- tg.sotu.df[ order(tg.sotu.df$tg.negative.run_freq, decreasing = T) ,]
tg.sotu.df <- tg.sotu.df[ order(tg.sotu.df$tg.positive.run_freq, decreasing = T) ,]
# Calculate Fisher's Exact Test (score)
# for each sOTU (Vx), calculate a 2x2 contingency table if
# Vx is enriched in Tg.positive datasets relative to Tg.negative datasets
# compared against the global occurances of all other sOTU (Vy, where
# Vy = V_total - Vx) in Tg.p and Tg.n datasets.
Tg.p_Vt <- sum(tg.sotu.df$tg.positive.run_freq)
Tg.n_Vt <- sum(tg.sotu.df$tg.negative.run_freq)
# For example, "u173627" which is 8 Tg- and 14 Tg+
V_name <- 'u173627'
Tg.p_Vx <- tg.sotu.df$tg.positive.run_freq[ tg.sotu.df$sotu == V_name ]
Tg.n_Vx <- tg.sotu.df$tg.negative.run_freq[ tg.sotu.df$sotu == V_name ]
test.table <- data.frame(
"tg.positive" = c( Tg.p_Vx, Tg.p_Vt - Tg.p_Vx ),
"tg.negative" = c( Tg.n_Vx, Tg.n_Vt - Tg.n_Vx ),
row.names = c("Test_sOTU", "Other_sOTUs"),
stringsAsFactors = FALSE
)
print( paste0( "For example sOTU: ", V_name ))
print('')
print( as.matrix( test.table) )
# Fisher's Exact Test
fisher.result <- fisher.test(test.table, alternative = "greater")
print(fisher.result)
# Apply Multiple Testing Correction
mt.correction <- length(tg.sotu.df$sotu)
print( paste0( "with Bonferonni testing correction factor: ", mt.correction ) )
print( paste0( " corrected p-value : ", min(1, fisher.result$p.value * mt.correction ) ) )
print( paste0( " corrected p-score : ", -log( min(1, fisher.result$p.value * mt.correction ) ) ) )
Output:
[1] "For example sOTU: u173627"
[1] ""
tg.positive tg.negative
Test_sOTU 27 36
Other_sOTUs 1897 20168
Fisher's Exact Test for Count Data
data: test.table
p-value = 4.25e-13
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
5.047827 Inf
sample estimates:
odds ratio
7.972087
[1] "with Bonferonni testing correction factor: 8051"
[1] " corrected p-value : 3.42134144248517e-09"
[1] " corrected p-score : 19.493233128131"
#{r}
# Wrap the above as a function to return corrected Fisher Score
# Fisher Score = -log( corrected p-value )
fisher.score <- function( tg.sotu.row, Tg.p_Vt, Tg.n_Vt, mt.correction ){
# col1 <- postive_runs
# col2 <- negative_runs
Tg.p_Vx <- tg.sotu.row[1]
Tg.n_Vx <- tg.sotu.row[2]
test.table <- data.frame(
"tg.positive" = c( Tg.p_Vx, Tg.p_Vt - Tg.p_Vx ),
"tg.negative" = c( Tg.n_Vx, Tg.n_Vt - Tg.n_Vx ),
row.names = c("Test_sOTU", "Other_sOTUs"),
stringsAsFactors = FALSE
)
# Fisher's Exact Test
fisher.result <- fisher.test(test.table, alternative = "greater")
fisher.pscore <- -log( min(1, fisher.result$p.value * mt.correction ) )
return(fisher.pscore)
}
# Calculate Fisher Score for each sOTU
tg.sotu.df$fisher.score <- apply( tg.sotu.df[, c('tg.positive.run_freq', 'tg.negative.run_freq') ], 1,
fisher.score, Tg.p_Vt, Tg.n_Vt, mt.correction)
# Rank order the output based on fisher.score
tg.sotu.df <- tg.sotu.df[ order(tg.sotu.df$fisher.score, decreasing = TRUE), ]
tg.sotu.df$rank <- 1:length(tg.sotu.df$sotu)
# Calculate Tg+ : Tg- runs
tg.sotu.df$tg.ratio <- tg.sotu.df$tg.positive.run_freq / tg.sotu.df$tg.negative.run_freq
tg.sotu.df$tg.ratio[ tg.sotu.df$tg.ratio == Inf ] <- 25
# Top 10 hits
print(tg.sotu.df[1:10, c("rank", "sotu", "tg.positive.run_freq", "tg.negative.run_freq", "fisher.score", "tg.ratio")])
Looking at the Top 10 hits, focus on the sOTU where it is present in 11 Tg+ libraries and zero Tg- libraries.
Output:
rank sotu tg.positive.run_freq tg.negative.run_freq fisher.score tg.ratio
1352 1 u173627 27 36 19.493233 0.750000
5673 2 u72795 11 0 17.899413 25.000000 *** INSEPCTION ***
6909 3 u874078 14 8 13.240015 1.750000
1681 4 u20 10 1 13.136257 10.000000
2557 5 u2879 9 0 13.005497 25.000000
1049 6 u14499 8 0 10.559254 25.000000
1260 7 u16867 11 6 8.970801 1.833333
1897 8 u22027 10 5 7.853949 2.000000
2592 9 u291054 12 12 6.531971 1.000000
4342 10 u54378 6 0 5.668199 25.000000
#{r}
# Scatterplot of Tg+/- runs and their BioProject frequency and Fisher Score
run.scatter <- ggplot( tg.sotu.df, aes( tg.negative.run_freq, tg.positive.run_freq,
color = fisher.score, size = bp_freq) ) +
geom_point(alpha = 0.4) +
geom_abline( slope = 1, intercept = 0, color = 'red') +
theme_bw() + theme(aspect.ratio = 1) +
scale_color_gradient(
low = "gray30",
high = "red",
limits = c(0,3)) +
scale_x_log10(limits = c(0.8, 1500)) + scale_y_log10(limits = c(0.8, 1500))
run.scatter
#{r}
# Fisher Score vs. the ratio of Tg+ : Tg-
#
run.rank <- ggplot( tg.sotu.df, aes( fisher.score, tg.ratio,
color = rank,
size = bp_freq) ) +
geom_point(alpha = 0.2) +
geom_vline(xintercept = 3, color = 'red') +
theme_bw() + theme(aspect.ratio = 1) +
scale_color_gradient2(
low = "firebrick",
mid = "red",
high = "gray80",
limits = c(1,1000))
run.rank
sOTU Inpsection
Inspection of u72795
We have now essentially created a rank-ordered list of virus-hits to inspect further. Let's consider u72795
with 11 Tg+ and 0 Tg- observations in our dataset across 3 BioProjects.
# Inspect whhere contigs matched this sOTU
print(palmsra[ palmsra$sotu == 'u72795',
c('run_id', 'organism', 'bioproject', 'sotu', 'coverage', 'evalue', 'q_sequence' ,'tg.status') ])
run_id organism bioproject sotu coverage evalue
32550 SRR10593837 Mus musculus PRJNA593831 u72795 2 1.8e-69
32551 SRR10593838 Mus musculus PRJNA593831 u72795 2 1.2e-69
32554 SRR10593841 Mus musculus PRJNA593831 u72795 1 5e-70
39169 SRR11792738 Mus musculus PRJNA632742 u72795 2 1.2e-69
39170 SRR11792740 Mus musculus PRJNA632742 u72795 2 3.2e-69
39171 SRR11792741 Mus musculus PRJNA632742 u72795 2 1.8e-69
120458 SRR9639534 Mus musculus PRJNA552423 u72795 2 3.2e-69
120459 SRR9639536 Mus musculus PRJNA552423 u72795 2 1.8e-69
120460 SRR9639537 Mus musculus PRJNA552423 u72795 2 3.2e-69
120461 SRR9639538 Mus musculus PRJNA552423 u72795 2 1.2e-69
120462 SRR9639539 Mus musculus PRJNA552423 u72795 2 1.8e-69
q_sequence tg.status
32550 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
32551 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
32554 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
39169 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
39170 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
39171 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120458 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120459 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120460 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120461 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
120462 FACDVVCFDSTITPQDVDFERQLYRAAATEDETKLAIDTLHRELYAGGPMVSPDGEDLGVRRCRASGVFTTSTSNTLTAWMKVHAACDMAGITSPTLLVNGDDVVC tg.positive
BLASTP of this sequence identifies it as [Crab-eating macaque hepacivirus] CAI5757786.1
. In addition to the above libraries, this virus sOTU is found in several Macaca fascicularis
datasets as identified via palmID
. As you can see, this list is able to prioritize hits to analyze, but is not definitive on it's own and requires additional analysis.
Happy hunting!
See also
- Tutorial A: Serratus.io Lookup
- Tutorial B: RdRP palmprint sequence-search
- Tutorial C: RdRP microassembly sequence-search
- Tutorial D: Running DIAMOND nr search