barapost prober - masikol/barapost GitHub Wiki
Description
barapost-prober.py -- This script is designed for taxonomic classification of nucleotide sequences by finding the most similar sequence in NCBI Nucleotide database using BLAST web service.
Processing entire data set with "barapost-prober.py" is not expedient, since servers are in public use and handling a request can linger for a long time.
Therefore the main goal of this script is to submit a relatively small probing batch (see -b option) of sequences to NCBI BLAST service and discover, what Genbank records can be downloaded and used as reference sequences in a database stored on local machine. Further classification will be performed by "barapost-local.py" on local machine.
This script processes FASTQ and FASTA (as well as .fastq.gz and .fasta.gz) files.
If you have your own FASTA files that can be used as reference sequences database to align against, you can omit "barapost-prober.py" step and go to "barapost-local.py" (see -l option in "barapost-local.py" description.
Input files
- Input FASTA and FASTQ files should be specified as positional arguments (see examples below).
- Input files must have different names. I.e. files
.../dir1/reads.fastqand.../dir2/reads.fastqare not allowed.
Default parameters
- if no input files are specified, "barapost-prober.py" processes all FASTQ and FASTA files in working directory;
- algorithm (see
-aoption):0(zero, i.e. megaBlast); - packet forming mode (see
-coption):0(zero); - packet size: (see
-poption): 100 sequences for packet forming mode0, 20,000 base pairs for packet forming mode1; - probing batch size (see
-boption): 200 sequences; - database slices (see
-goption): wholenr/ntdatabase, i.e. no slices; - output directory (
-ooption): directory namedbarapost_resultnested in working directory; - prober sends sequences intact
(i.e. does not prune them before submission (see
-xoption));
Options
-h (--help) --- show help message.
'-h' -- brief, '--help' -- full;
-v (--version) --- show version;
-d (--indir) --- directory which contains FASTQ or FASTA
files meant to be processed.
I.e. all FASTQ and FASTA files in this direcory will be processed;
-o (--outdir) --- output directory.
Default value: `barapost_result`;
-c (--packet-mode) --- indicates how prober calculates size of packet.
See section "Example of how mode 1 works" below for details.
Values:
0 -- packet size equals number of sequences in it [default mode];
1 -- packet size equals sum of lengths of sequences in it;
Mode 1 performs better if input sequences are binned by length within input file;
-p (--packet-size) --- size of the packet (see option '-c' above for definition).
It implies that "barapost-prober.py" can perform
multiple requests during single run.
Value: positive integer number.
Default value:
100 (sequences) for mode 0;
20,000 (base pairs) for mode 1;
-a (--algorithm) --- BLASTn algorithm to use for alignment.
Available values: 0 for megaBlast, 1 for discoMegablast, 2 for blastn.
Default is 0 (megaBlast);
* -g (--organisms) --- TaxIDs of organisms to align your sequences against.
I.e. with this options you can specify 'nr/nt' database slices.
Format of value (see Examples #3 and #4 below.):
<organism1_taxid>,<organism2_taxid>...
Spaces are NOT allowed.
Default value is full 'nr/nt' database, i.e. no slices.
You can find your Taxonomy IDs here:
'https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi';
-b (--probing-batch-size) --- total number of sequences that
will be submitted to BLAST server during 'barapost-prober.py' run.
You can specify '-b all' to process all your sequeces by 'barapost-prober.py'.
Value: positive integer number.
Default value is 200;
-x (--max-seq-len) --- maximum length of a sequence that
prober submits to NCBI BLAST service.
If specified, prober will prune your sequences before
submission in order to spare NCBI servers.
This feature is disabled by default;
* - Functionality of -g option is totally equal to "Organism" text boxes on this BLASTn page:
https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome.
Explanation of output files
-
"hits_to_download.tsv"file in root of output directory. In this file you can find: accessions and names of Genbank records to be further downloaded by "barapost-local.py" in order to create a database on your local machine. -
"classification.tsv"file in directory corresponding to a particular input file (for example, for input filesomething.fastqresult directory name will besomething). In this file you can find:- IDs of classified sequences;
- classification (usually lineage);
- accession(s) of best hit(s) divided with
&&; - some alignment statistics;
- quality information (for FASTQ files).
Example of how mode 1 works.
If packet forming mode 1 is set, prober will add sequences to each packet until sum of ther length reaches (or exceeds) specified size of packet.
prober takes preventive pruning (see -x option) into account when calculating packet size in mode 1: lengths of pruned sequences are considered.
Example:
- Input fasta file:
>sequence_1_length_6_bp
ACTGAA
>sequence_2_length_5_bp
GTGTA
>sequence_3_length_4_bp
TAGC
>sequence_4_length_4_bp
GGAC
-
Packet size is
5(-p 5), mode is1(-c 1),-xis not enabled. -
Packets for submission:
- Packet #1. Length of
sequence_1is greater than packet size -- packet contains this sequence only:
>sequence_1_length_101_bp ACTGAA- Packet #2. Length of
sequence_2equals packet size -- packet contains this sequence only:
>sequence_2_length_100_bp GTGTA- Packet #3. Length of
sequence_3is less than packet size, so prober will addsequence_4to this packet, and then sum of their lengths will exceed packet size, making it ready for submission:
>sequence_3_length_99_bp TAGC >sequence_4_length_99_bp GGAC - Packet #1. Length of
Examples
Note for Windows users: run py -3 barapost-prober.py in Windows console. barapost-prober.py won't work.
- Process all FASTA and FASTQ files in working directory with default settings: barapost-prober.py
barapost-prober.py
- Process one file with default settings:
barapost-prober.py reads.fastq
- Process a FASTQ file and a FASTA file with discoMegablast, packet size of 100 sequences. Search only among Erwinia sequences (551 is Erwinia taxID):
barapost-prober.py reads_1.fastq.gz some_sequences.fasta -a discoMegablast -p 100 -g 551
- Process all FASTQ and FASTA files in directory named
some_dir. Process 300 sequences, packet size is 100 sequnces (3 packets will be sent). Search only among Escherichia (taxID 561) and viral (taxID 10239) sequences:
barapost-prober.py -d some_dir -g 561,10239 -o outdir -b 300 -p 100