Usages - labgem/CAESAR GitHub Wiki

The CAESAR workflow consists of five consecutive steps:

Search step (blastp or hmmsearch)
Filter step
Clustering step
Candidate selection step
Phylogenetic step

CAESAR have five subcommands, each corresponding to an entry point:

blastp
hmmsearch
filter
clsutering

NB: It's not possible to perform only the phylogenetic step

Complete Workflow

To perform a complete workflow, just start at the search step with a blastp or a hmmsearch.

Start with a Blastp

Quick Usage

python ./CAESAR/set_caesar.py blastp -q references_sequences.fasta -c config.yml

Options

usage: set_caesar.py blastp [-h] [-o] [-t] [-m] -c  -q  [--id] [--cov] [--min-len] [--max-len] [--tax] [--cluster-id] [--cluster-cov] [-n  | --cov-per-cluster ] [--gc] [-p] [-r] [-u]

options:
  -h, --help          show this help message and exit
  -o , --outdir       output directory [default: ./]
  -t , --threads      number of cpu threads [default: 6]
  -m , --mem          memory limit for the clustering step [default: 4G]

Mandatory inputs:
  -c , --config       the yaml config file
  -q , --query        set of reference sequences

Blastp options:
  --id                retains only sequences above the specified percentage of sequence identity [default: 30.0]
  --cov               retains only sequences above the specified percentage of query cover [default: 80.0]
  --min-len           retains only sequences above the specified sequence length [default: 200]
  --max-len           retains only sequences below the specified sequence length [default: 200]
  --tax               Superkingdom filter, A: Archaea, B: Bacteria and E: Eukaryota [default: 'ABE']

Clustering options:
  --cluster-id        identity cutoff for the clustering [default: 80.0]
  --cluster-cov       minimum coverage of cluster member sequence

Candidates selection options:
  -n , --nb-cand      maximum number of candidate per cluster [default: 1]
  --cov-per-cluster   uses a percentage of each cluster as maximum number of candidates rather than a given number
  --gc                target GC percentage to decide between candidates [default: 50.0]

Phylogeny options:
  -p , --phylo        1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
  -r , --reduce       1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]

Exclude some protein ids:
  Can be used to re-run pipeline and try to get other candidates

  -u , --update       File containing the list of proteins ids not to be selected as candidates

Example of modifying blastp options

python ./CAESAR/set_caesar.py blastp -q references_sequences.fasta -c config.yml --id 55 --cov 90 --min-len 150 --max-len 400 --tax AB

Start with a Hmmsearch

Quick Usage

python ./CAESAR/set_caesar.py hmmsearch -q reference_profile.hmm -c config.yml

Options

usage: set_caesar.py hmmsearch [-h] [-o] [-t] [-m] -c  -q  [--score] [--cov] [--min-len] [--max-len] [--tax] [--cluster-id] [--cluster-cov] [-n  | --cov-per-cluster ] [--gc] [-p] [-r] [-u]

options:
  -h, --help          show this help message and exit
  -o , --outdir       output directory [default: ./]
  -t , --threads      number of cpu threads [default: 6]
  -m , --mem          memory limit for the clustering step [default: 4G]

Mandatory inputs:
  -c , --config       the yaml config file
  -q , --query        hmm file

Hmmsearch options:
  --score             retains only sequences above the specified full sequence score [default: 0.0]
  --cov               retains only sequences above the specified percentage of hmm profile cover [default: 80.0]
  --min-len           retains only sequences above the specified sequence length [default: 200]
  --max-len           retains only sequences below the specified sequence length [default: 200]
  --tax               Superkingdom filter, A: Archaea, B: Bacteria and E: Eukaryota [default: 'ABE']

Clustering options:
  --cluster-id        identity cutoff for the clustering [default: 80.0]
  --cluster-cov       minimum coverage of cluster member sequence

Candidates selection options:
  -n , --nb-cand      maximum number of candidate per cluster [default: 1]
  --cov-per-cluster   uses a percentage of each cluster as maximum number of candidates rather than a given number
  --gc                target GC percentage to decide between candidates [default: 50.0]

Phylogeny options:
  -p , --phylo        1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
  -r , --reduce       1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]

Exclude some protein ids:
  Can be used to re-run pipeline and try to get other candidates

  -u , --update       File containing the list of proteins ids not to be selected as candidates

Example of modifying hmmsearch options

python ./CAESAR/set_caesar.py hmmsearch -q reference_profile.hmm -c config.yml --score 125 --cov 90 --min-len 150 --max-len 400 --tax AB

Start at the Filter step

It is not possible to give as input a directory containing both blastp output and .domtbl files without generating an error.

The blastp output must be formatted as with the diamond software with the option: --outfmt 6 qseqid qlen sseqid slen length pident qcovhsp positive mismatch gaps evalue.

The filter options are the same as for blastp and hmmsearch, as they are used at this step. The --id and --score options are mutually exclusive.

usage: set_caesar.py filter [-h] [-o] [-t] -c  -q  -d  [--id  | --score ] [--cov] [--min-len] [--max-len] [--tax] [--cluster-id] [--cluster-cov] [-n  | --cov-per-cluster ] [--gc] [-p] [-r] [-u]

options:
  -h, --help          show this help message and exit
  -o , --outdir       output directory [default: ./]
  -t , --threads      number of cpu threads [default: 6]
  -m , --mem          memory limit for the clustering step [default: 4G]

Mandatory inputs:
  -c , --config       the yaml config file
  -q , --query        set of reference sequences or hmm file
  -d , --data         directory containing tsv file from diamond blastp or .domtbl file from hmmsearch

Filter options:
  --id                retains only sequences above the specified percentage of sequence identity [default: 30.0]
  --score             retains only sequences above the specified full sequence score [default: 0.0]
  --cov               retains only sequences above the specified percentage of query cover or hmm profile cover [default: 80.0]
  --min-len           retains only sequences above the specified sequence length [default: 200]
  --max-len           retains only sequences below the specified sequence length [default: 200]
  --tax               Superkingdom filter, A: Archaea, B: Bacteria and E: Eukaryota [default: 'ABE']

Clustering options:
  --cluster-id        identity cutoff for the clustering [default: 80.0]
  --cluster-cov       minimum coverage of cluster member sequence

Candidates selection options:
  -n , --nb-cand      maximum number of candidate per cluster [default: 1]
  --cov-per-cluster   uses a percentage of each cluster as maximum number of candidates rather than a given number
  --gc                target GC percentage to decide between candidates [default: 50.0]

Phylogeny options:
  -p , --phylo        1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
  -r , --reduce       1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]

Exclude some protein ids:
  Can be used to re-run pipeline and try to get other candidates

  -u , --update       File containing the list of proteins ids not to be selected as candidates

Example of Usage:

python ./CAESAR/set_caesar.py filter -q references_sequences.fasta -c config.yml -d data_directory

Start at the Clustering step

To start at this step, the user must specify 2 files (these are usually written by the previous step):

a fasta file with the f option
a text file with the --sources option

a third optional file can be given:

a tsv file with the -d option

Checks the File format page to find out more about these files

Quick usage

python ./CAESAR/set_caesar.py clustering -q references_sequences.fasta -c config.yml -d data_file -f fasta_file --sources sources_file

Options

usage: set_caesar.py clustering [-h] [-o] [-t] -c  -q  -f  --sources  [-d] [--cluster-id] [--cluster-cov] [-n  | --cov-per-cluster ] [--gc] [-p] [-r] [-u]

options:
  -h, --help          show this help message and exit
  -o , --outdir       output directory [default: ./]
  -t , --threads      number of cpu threads [default: 6]
  -m , --mem          memory limit for the clustering step [default: 4G]

Mandatory inputs:
  -c , --config       the yaml config file
  -q , --query        set of reference sequences or hmm file
  -f , --fasta-cand   multi fasta file
  --sources           file indicating the sources database of each sequences

Optional data:
  -d , --data         file formatted as filetered_data.tsv returned by the filter step

Clustering options:
  --cluster-id        identity cutoff for the clustering [default: 80.0]
  --cluster-cov       minimum coverage of cluster member sequence

Candidates selection options:
  -n , --nb-cand      maximum number of candidate per cluster [default: 1]
  --cov-per-cluster   uses a percentage of each cluster as maximum number of candidates rather than a given number
  --gc                target GC percentage to decide between candidates [default: 50.0]

Phylogeny options:
  -p , --phylo        1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
  -r , --reduce       1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]

Exclude some protein ids:
  Can be used to re-run pipeline and try to get other candidates

  -u , --update       File containing the list of proteins ids not to be selected as candidates

Example of modifying clustering options

python ./CAESAR/set_caesar.py clustering -q references_sequences.fasta -c config.yml -d data_file -f fasta_file --sources sources_file --cluster-id 70 --cluster-cov 70

Start at the Candidate Selection step

In addition to the -f and --sources another inputs is required:

a clusters tsv file with the --clusters option

Quick usage

python ./CAESAR/set_caesar.py selection -q references_sequences.fasta -c config.yml -d data_file -f fasta_file --sources sources_file --clusters clusters_file

options

usage: set_caesar.py selection [-h] [-o] [-t] -c  -q  -f  --sources  --clusters  [-d] [-n  | --cov-per-cluster ] [--gc] [-p] [-r] [-u]

options:
  -h, --help          show this help message and exit
  -o , --outdir       output directory [default: ./]
  -t , --threads      number of cpu threads [default: 6]

Mandatory inputs:
  -c , --config       the yaml config file
  -q , --query        set of reference sequences or hmm file
  -f , --fasta-cand   multi fasta file
  --sources           file indicating the sources database of each sequences
  --clusters          clusters tsv file

Optional data:
  -d , --data         file formatted as filetered_data.tsv returned by the filter step

Candidates selection options:
  -n , --nb-cand      maximum number of candidate per cluster [default: 1]
  --cov-per-cluster   uses a percentage of each cluster as maximum number of candidates rather than a given number
  --gc                target GC percentage to decide between candidates [default: 50.0]

Phylogeny options:
  -p , --phylo        1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
  -r , --reduce       1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]

Exclude some protein ids:
  Can be used to re-run pipeline and try to get other candidates

  -u , --update       File containing the list of proteins ids not to be selected as candidates

Example of modifying candidates selection options

python ./CAESAR/set_caesar.py selection -q references_sequences.fasta -c config.yml -d data_file -f fasta_file --sources sources_file --clusters clusters_file -n 2 --gc 55

Others options

Phylogeny options

For each subcommand two options are available for the phylogeny step:

Phylogeny options:
  -p , --phylo        1: generate a msa and a phylogenetic tree, 0: does not perform the step [default: 1]
  -r , --reduce       1: builds the tree using only the representative sequences of each cluster, 0: uses all the filtered sequences [default: 1]

If you don't want perform this step, simply set -p to 0. If you want to use all the filtered sequences to builds the tree, simply set -r to 0 and keep -p to 1.

NB: If -r is set to 0, this can lead to the builds of a multiple alignment and a phylogenetic tree of a very large number of sequences. This can result in a tree that is difficult to read, or even cause its build to fail.

Update

This option can be used ton re-run a worflow and retrieves new candidates or simply excludes some sequences. It's available with all subcommands.

python ./CAESAR/set_caesar.py blastp -q references_sequences.fasta -c config.yml -u update_file

The file must contains one sequence id per line like the all_candidates.txt written by CAESAR.