Motifator, a new tool for classifiying RdRPs and close homologs - ababaian/serratus GitHub Wiki
Binary: s3://serratus-public/rce/motifator/bin/motifator1.1.1114
motifator -search_rdrp input.fasta [options]
Query is amino acid or nucleotide sequence, the type is detected automatically. Output files are:
-report report.txt
-fevout output.fev
-trim_fastaout trim.faa
-trim_fastaout_nt trim.fna
-bedout hits.bed
-motifs_fastaout motifs.fa
The trim_fastaout_nt
file reports the nt palm sequence for translated searches; not supported if the query is aa.
All output options are optional. Fev is "field equals value" format, which is tabbed text with fields such as qlen=10
. Trimmed output is the segment from the beginning of motif A to the end of motif C (or C..A if the domain is permuted). Motifs output is the three motifs in canonical A, B, C order separated by xxx. By default, FASTA output is written if the query is predicted to be RdRP.
If -hionly
is specified, only high-confidence predictions are written to the trim_fastaout
, trim_fastaout_nt
and motifs_fastaout
files.
By default, 10 threads will be started, or one thread per CPU core, whichever is smaller. The -threads
option can be used to specify the number of threads, e.g threads 8
.
This is typical report output for a valid RdRP which illustrates what motifator
is designed to do.
>A0A1L3KJH1_9VIRU/1426-1810 Length 385aa ABC 173-282(110) A:173-184(17.3) B:237-250(22.6) C:275-282(12.3) VAGDFKNFDKRV SGCFFTSIVNNIVN VLGDDHIY +||||+|||+++ |||++|||.|.||| |.|||.|+ iagDySkFDssl SGsplTsidNSivN vyGDDnii Score 52.3, high-confidence-RdRP: good-ABC-order.good-motif-spacing.high-PSSM-score.
Motifator
looks for the characteristic functional motifs called A, B and C in the catalytic "palm" of the RdRP domain. Position Specific Scoring Matrices (PSSMs) are used to search for the motifs. Additional evidence comes from the distances between the motifs. There are PSSMs for RdRP and for RT (reverse transcriptase). RT is an RdRP homolog which also has A, B and C motifs.
Motifator
reports a score which is the sum of PSSM log-odds scores minus a penalty if the A-B-C spacings are out of the typical range. The query is reported in categories such as high-confidence-RdRP
based on this score and other heuristics.
Results updated for v1.1.1114
Name N Nhi Nlo Desc uniprot 838 785 8 UniProt RdRP PF00680_RdRP_1 795 757 6 PFAM RdRP_1 PF00978_RdRP_2 397 379 1 PFAM RdRP_2 PF00998_RdRP_3 205 194 2 PFAM RdRP_3 PF02123_RdRP_4 216 191 1 PFAM RdRP_4 PF04197_Birn_RdRP 11 1 0 RdRP Birna_RdRP PF05919_MitoVir_RdRP 181 55 2 PF05919_MitoVir_RdRP PF17501_Viral_RdRP_C 4 0 0 PFAM Viral_RdRP_C PF00972_Flavi_NS5 14 8 0 PFAM Flavi RdRP quenya.protref 50 6 17 Quenya proteins permuted 117 75 28 Curated permuted RdRP rdrp1 14680 12455 198 Serratus RdRP query complete 826 817 0 Complete Cov nt genomes decoy 296536 4 17 Curated non-RdRP PF00078_RT1 46876 0 0 PFAM RVT_1 (RT) PF07727_RT2 12037 0 0 PFAM RVT_2 (RT) gb241_orf 360114 218687 1358 GB241 viral ORFs vgb241 3261824 228655 1592 GB241 viral nt
N=nr sequences, Nhi, Nlo=nr classified as high-, low-confidence RdRP by motifator
.
# Download binary (x86)
wget https://serratus-public.s3.amazonaws.com/rce/motifator/bin/motifator1.1.1109
mv motifator* motifator; chmod 755 motifator
# Run Motifator with outputs
INPUT='ERR2756788.cs.fa'
OUTNAME='frank'
./motifator -search_rdrp $INPUT -hionly \
-report $OUTNAME.txt \
-fevout $OUTNAME.fev \
-trim_fastaout $OUTNAME.trim.fa \
-motifs_fastaout $OUTNAME.motifs.fa