Usages - labgem/CAESAR GitHub Wiki
The CAESAR workflow consists of five consecutive steps:
- Search step (blastp or hmmsearch)
- Filter step
- Clustering step
- Candidate selection step
- Phylogenetic step
CAESAR have five subcommands, each corresponding to an entry point:
- blastp
- hmmsearch
- filter
- clsutering
NB: It's not possible to perform only the phylogenetic step
Complete Workflow
To perform a complete workflow, just start at the search step with a blastp
or a hmmsearch
.
Start with a Blastp
Quick Usage
python ./CAESAR/set_caesar.py blastp -q references_sequences.fasta -c config.yml
Options
usage: set_caesar.py blastp [-h] [-o] [-t] [-m] -c -q [--id] [--cov] [--min-len] [--max-len] [--tax] [--cluster-id] [--cluster-cov] [-n | --cov-per-cluster ] [--gc] [-p] [-r] [-u]
options:
-h, --help show this help message and exit
-o , --outdir output directory [default: ./]
-t , --threads number of cpu threads [default: 6]
-m , --mem memory limit for the clustering step [default: 4G]
Mandatory inputs:
-c , --config the yaml config file
-q , --query set of reference sequences
Blastp options:
--id retains only sequences above the specified percentage of sequence identity [default: 30.0]
--cov retains only sequences above the specified percentage of query cover [default: 80.0]
--min-len retains only sequences above the specified sequence length [default: 200]
--max-len retains only sequences below the specified sequence length [default: 200]
--tax Superkingdom filter, A: Archaea, B: Bacteria and E: Eukaryota [default: 'ABE']
Clustering options:
--cluster-id identity cutoff for the clustering [default: 80.0]
--cluster-cov minimum coverage of cluster member sequence
Candidates selection options:
-n , --nb-cand maximum number of candidate per cluster [default: 1]
--cov-per-cluster uses a percentage of each cluster as maximum number of candidates rather than a given number
--gc target GC percentage to decide between candidates [default: 50.0]
Phylogeny options:
-p , --phylo 1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
-r , --reduce 1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]
Exclude some protein ids:
Can be used to re-run pipeline and try to get other candidates
-u , --update File containing the list of proteins ids not to be selected as candidates
Example of modifying blastp options
python ./CAESAR/set_caesar.py blastp -q references_sequences.fasta -c config.yml --id 55 --cov 90 --min-len 150 --max-len 400 --tax AB
Start with a Hmmsearch
Quick Usage
python ./CAESAR/set_caesar.py hmmsearch -q reference_profile.hmm -c config.yml
Options
usage: set_caesar.py hmmsearch [-h] [-o] [-t] [-m] -c -q [--score] [--cov] [--min-len] [--max-len] [--tax] [--cluster-id] [--cluster-cov] [-n | --cov-per-cluster ] [--gc] [-p] [-r] [-u]
options:
-h, --help show this help message and exit
-o , --outdir output directory [default: ./]
-t , --threads number of cpu threads [default: 6]
-m , --mem memory limit for the clustering step [default: 4G]
Mandatory inputs:
-c , --config the yaml config file
-q , --query hmm file
Hmmsearch options:
--score retains only sequences above the specified full sequence score [default: 0.0]
--cov retains only sequences above the specified percentage of hmm profile cover [default: 80.0]
--min-len retains only sequences above the specified sequence length [default: 200]
--max-len retains only sequences below the specified sequence length [default: 200]
--tax Superkingdom filter, A: Archaea, B: Bacteria and E: Eukaryota [default: 'ABE']
Clustering options:
--cluster-id identity cutoff for the clustering [default: 80.0]
--cluster-cov minimum coverage of cluster member sequence
Candidates selection options:
-n , --nb-cand maximum number of candidate per cluster [default: 1]
--cov-per-cluster uses a percentage of each cluster as maximum number of candidates rather than a given number
--gc target GC percentage to decide between candidates [default: 50.0]
Phylogeny options:
-p , --phylo 1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
-r , --reduce 1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]
Exclude some protein ids:
Can be used to re-run pipeline and try to get other candidates
-u , --update File containing the list of proteins ids not to be selected as candidates
Example of modifying hmmsearch options
python ./CAESAR/set_caesar.py hmmsearch -q reference_profile.hmm -c config.yml --score 125 --cov 90 --min-len 150 --max-len 400 --tax AB
Start at the Filter step
It is not possible to give as input a directory containing both blastp output and .domtbl files without generating an error.
The blastp output must be formatted as with the diamond software with the option: --outfmt 6 qseqid qlen sseqid slen length pident qcovhsp positive mismatch gaps evalue
.
The filter options are the same as for blastp and hmmsearch, as they are used at this step. The --id
and --score
options are mutually exclusive.
usage: set_caesar.py filter [-h] [-o] [-t] -c -q -d [--id | --score ] [--cov] [--min-len] [--max-len] [--tax] [--cluster-id] [--cluster-cov] [-n | --cov-per-cluster ] [--gc] [-p] [-r] [-u]
options:
-h, --help show this help message and exit
-o , --outdir output directory [default: ./]
-t , --threads number of cpu threads [default: 6]
-m , --mem memory limit for the clustering step [default: 4G]
Mandatory inputs:
-c , --config the yaml config file
-q , --query set of reference sequences or hmm file
-d , --data directory containing tsv file from diamond blastp or .domtbl file from hmmsearch
Filter options:
--id retains only sequences above the specified percentage of sequence identity [default: 30.0]
--score retains only sequences above the specified full sequence score [default: 0.0]
--cov retains only sequences above the specified percentage of query cover or hmm profile cover [default: 80.0]
--min-len retains only sequences above the specified sequence length [default: 200]
--max-len retains only sequences below the specified sequence length [default: 200]
--tax Superkingdom filter, A: Archaea, B: Bacteria and E: Eukaryota [default: 'ABE']
Clustering options:
--cluster-id identity cutoff for the clustering [default: 80.0]
--cluster-cov minimum coverage of cluster member sequence
Candidates selection options:
-n , --nb-cand maximum number of candidate per cluster [default: 1]
--cov-per-cluster uses a percentage of each cluster as maximum number of candidates rather than a given number
--gc target GC percentage to decide between candidates [default: 50.0]
Phylogeny options:
-p , --phylo 1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
-r , --reduce 1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]
Exclude some protein ids:
Can be used to re-run pipeline and try to get other candidates
-u , --update File containing the list of proteins ids not to be selected as candidates
Example of Usage:
python ./CAESAR/set_caesar.py filter -q references_sequences.fasta -c config.yml -d data_directory
Start at the Clustering step
To start at this step, the user must specify 2 files (these are usually written by the previous step):
- a fasta file with the
f
option - a text file with the
--sources
option
a third optional file can be given:
- a tsv file with the
-d
option
Checks the File format page to find out more about these files
Quick usage
python ./CAESAR/set_caesar.py clustering -q references_sequences.fasta -c config.yml -d data_file -f fasta_file --sources sources_file
Options
usage: set_caesar.py clustering [-h] [-o] [-t] -c -q -f --sources [-d] [--cluster-id] [--cluster-cov] [-n | --cov-per-cluster ] [--gc] [-p] [-r] [-u]
options:
-h, --help show this help message and exit
-o , --outdir output directory [default: ./]
-t , --threads number of cpu threads [default: 6]
-m , --mem memory limit for the clustering step [default: 4G]
Mandatory inputs:
-c , --config the yaml config file
-q , --query set of reference sequences or hmm file
-f , --fasta-cand multi fasta file
--sources file indicating the sources database of each sequences
Optional data:
-d , --data file formatted as filetered_data.tsv returned by the filter step
Clustering options:
--cluster-id identity cutoff for the clustering [default: 80.0]
--cluster-cov minimum coverage of cluster member sequence
Candidates selection options:
-n , --nb-cand maximum number of candidate per cluster [default: 1]
--cov-per-cluster uses a percentage of each cluster as maximum number of candidates rather than a given number
--gc target GC percentage to decide between candidates [default: 50.0]
Phylogeny options:
-p , --phylo 1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
-r , --reduce 1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]
Exclude some protein ids:
Can be used to re-run pipeline and try to get other candidates
-u , --update File containing the list of proteins ids not to be selected as candidates
Example of modifying clustering options
python ./CAESAR/set_caesar.py clustering -q references_sequences.fasta -c config.yml -d data_file -f fasta_file --sources sources_file --cluster-id 70 --cluster-cov 70
Start at the Candidate Selection step
In addition to the -f
and --sources
another inputs is required:
- a clusters tsv file with the
--clusters
option
Quick usage
python ./CAESAR/set_caesar.py selection -q references_sequences.fasta -c config.yml -d data_file -f fasta_file --sources sources_file --clusters clusters_file
options
usage: set_caesar.py selection [-h] [-o] [-t] -c -q -f --sources --clusters [-d] [-n | --cov-per-cluster ] [--gc] [-p] [-r] [-u]
options:
-h, --help show this help message and exit
-o , --outdir output directory [default: ./]
-t , --threads number of cpu threads [default: 6]
Mandatory inputs:
-c , --config the yaml config file
-q , --query set of reference sequences or hmm file
-f , --fasta-cand multi fasta file
--sources file indicating the sources database of each sequences
--clusters clusters tsv file
Optional data:
-d , --data file formatted as filetered_data.tsv returned by the filter step
Candidates selection options:
-n , --nb-cand maximum number of candidate per cluster [default: 1]
--cov-per-cluster uses a percentage of each cluster as maximum number of candidates rather than a given number
--gc target GC percentage to decide between candidates [default: 50.0]
Phylogeny options:
-p , --phylo 1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
-r , --reduce 1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]
Exclude some protein ids:
Can be used to re-run pipeline and try to get other candidates
-u , --update File containing the list of proteins ids not to be selected as candidates
Example of modifying candidates selection options
python ./CAESAR/set_caesar.py selection -q references_sequences.fasta -c config.yml -d data_file -f fasta_file --sources sources_file --clusters clusters_file -n 2 --gc 55
Others options
Phylogeny options
For each subcommand two options are available for the phylogeny step:
Phylogeny options:
-p , --phylo 1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
-r , --reduce 1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]
If you don't want perform this step, simply set -p
to 0
. If you want to use all the filtered sequences to builds the tree, simply set -r
to 0
and keep -p
to 1
.
NB: If -r
is set to 0
, this can lead to the builds of a multiple alignment and a phylogenetic tree of a very large number of sequences. This can result in a tree that is difficult to read, or even cause its build to fail.
Update
This option can be used ton re-run a worflow and retrieves new candidates or simply excludes some sequences. It's available with all subcommands.
python ./CAESAR/set_caesar.py blastp -q references_sequences.fasta -c config.yml -u update_file
The file must contains one sequence id per line like the all_candidates.txt written by CAESAR.