barapost prober - masikol/barapost GitHub Wiki

Description

barapost-prober.py -- This script is designed for taxonomic classification of nucleotide sequences by finding the most similar sequence in NCBI Nucleotide database using BLAST web service.

Processing entire data set with "barapost-prober.py" is not expedient, since servers are in public use and handling a request can linger for a long time.

Therefore the main goal of this script is to submit a relatively small probing batch (see -b option) of sequences to NCBI BLAST service and discover, what Genbank records can be downloaded and used as reference sequences in a database stored on local machine. Further classification will be performed by "barapost-local.py" on local machine.

This script processes FASTQ and FASTA (as well as .fastq.gz and .fasta.gz) files.

If you have your own FASTA files that can be used as reference sequences database to align against, you can omit "barapost-prober.py" step and go to "barapost-local.py" (see -l option in "barapost-local.py" description.

Input files

Input FASTA and FASTQ files should be specified as positional arguments (see examples below).
Input files must have different names. I.e. files .../dir1/reads.fastq and .../dir2/reads.fastq are not allowed.

Default parameters

if no input files are specified, "barapost-prober.py" processes all FASTQ and FASTA files in working directory;
algorithm (see -a option): 0 (zero, i.e. megaBlast);
packet forming mode (see -c option): 0 (zero);
packet size: (see -p option): 100 sequences for packet forming mode 0, 20,000 base pairs for packet forming mode 1;
probing batch size (see -b option): 200 sequences;
database slices (see -g option): whole nr/nt database, i.e. no slices;
output directory (-o option): directory named barapost_result nested in working directory;
prober sends sequences intact (i.e. does not prune them before submission (see -x option));

Options

    -h (--help) --- show help message.
       '-h' -- brief, '--help' -- full;

    -v (--version) --- show version;

    -d (--indir) --- directory which contains FASTQ or FASTA
       files meant to be processed.
       I.e. all FASTQ and FASTA files in this direcory will be processed;

    -o (--outdir) --- output directory.
       Default value: `barapost_result`;

    -c (--packet-mode) --- indicates how prober calculates size of packet.
       See section "Example of how mode 1 works" below for details.
       Values:
       0 -- packet size equals number of sequences in it [default mode];
       1 -- packet size equals sum of lengths of sequences in it;
       Mode 1 performs better if input sequences are binned by length within input file;

    -p (--packet-size) --- size of the packet (see option '-c' above for definition).
       It implies that "barapost-prober.py" can perform
       multiple requests during single run.
       Value: positive integer number.
       Default value:
       100 (sequences) for mode 0;
       20,000 (base pairs) for mode 1;

    -a (--algorithm) --- BLASTn algorithm to use for alignment.
       Available values: 0 for megaBlast, 1 for discoMegablast, 2 for blastn.
       Default is 0 (megaBlast);

 *  -g (--organisms) --- TaxIDs of organisms to align your sequences against.
       I.e. with this options you can specify 'nr/nt' database slices.
       Format of value (see Examples #3 and #4 below.):
         <organism1_taxid>,<organism2_taxid>...
       Spaces are NOT allowed.
       Default value is full 'nr/nt' database, i.e. no slices.
       You can find your Taxonomy IDs here:
       'https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi';

    -b (--probing-batch-size) --- total number of sequences that
       will be submitted to BLAST server during 'barapost-prober.py' run.
       You can specify '-b all' to process all your sequeces by 'barapost-prober.py'.
       Value: positive integer number.
       Default value is 200;

    -x (--max-seq-len) --- maximum length of a sequence that
       prober submits to NCBI BLAST service.
       If specified, prober will prune your sequences before
       submission in order to spare NCBI servers.
       This feature is disabled by default;

* - Functionality of -g option is totally equal to "Organism" text boxes on this BLASTn page: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome.

Explanation of output files

"hits_to_download.tsv" file in root of output directory. In this file you can find: accessions and names of Genbank records to be further downloaded by "barapost-local.py" in order to create a database on your local machine.
"classification.tsv" file in directory corresponding to a particular input file (for example, for input file something.fastq result directory name will be something). In this file you can find:
- IDs of classified sequences;
- classification (usually lineage);
- accession(s) of best hit(s) divided with &&;
- some alignment statistics;
- quality information (for FASTQ files).

Example of how mode `1` works.

If packet forming mode 1 is set, prober will add sequences to each packet until sum of ther length reaches (or exceeds) specified size of packet.

prober takes preventive pruning (see -x option) into account when calculating packet size in mode 1: lengths of pruned sequences are considered.

Example:

Input fasta file:

>sequence_1_length_6_bp
ACTGAA
>sequence_2_length_5_bp
GTGTA
>sequence_3_length_4_bp
TAGC
>sequence_4_length_4_bp
GGAC

Packet size is 5 (-p 5), mode is 1 (-c 1), -x is not enabled.
Packets for submission:
1. Packet #1. Length of sequence_1 is greater than packet size -- packet contains this sequence only:
```
>sequence_1_length_101_bp
ACTGAA
```
1. Packet #2. Length of sequence_2 equals packet size -- packet contains this sequence only:
```
>sequence_2_length_100_bp
GTGTA
```
1. Packet #3. Length of sequence_3 is less than packet size, so prober will add sequence_4 to this packet, and then sum of their lengths will exceed packet size, making it ready for submission:
```
>sequence_3_length_99_bp
TAGC
>sequence_4_length_99_bp
GGAC
```

Examples

Note for Windows users: run py -3 barapost-prober.py in Windows console. barapost-prober.py won't work.

Process all FASTA and FASTQ files in working directory with default settings: barapost-prober.py

barapost-prober.py

Process one file with default settings:

barapost-prober.py reads.fastq

Process a FASTQ file and a FASTA file with discoMegablast, packet size of 100 sequences. Search only among Erwinia sequences (551 is Erwinia taxID):

barapost-prober.py reads_1.fastq.gz some_sequences.fasta -a discoMegablast -p 100 -g 551

Process all FASTQ and FASTA files in directory named some_dir. Process 300 sequences, packet size is 100 sequnces (3 packets will be sent). Search only among Escherichia (taxID 561) and viral (taxID 10239) sequences:

barapost-prober.py -d some_dir -g 561,10239 -o outdir -b 300 -p 100