01 Setting Project Configurations - NBChub/bgcflow GitHub Wiki
As many other Snakemake workflow, BGCFlow find information about the input and workflow setting inside the config/
folder. This folder will have a .yaml
file with the workflow configurations and metadata, which also points to the samples
table containing the input paths.
To load an example configuration, run this wrapper command:
bgcflow init
The above command will create a new file in config/config.yaml
.
Note: You should run this command inside the BGCFlow directory or use the arguments
--bgcflow_dir <my BGCFlow folder>
More about the init
command:
Usage: bgcflow init [OPTIONS]
Create projects or initiate BGCFlow config from template. Use --project to
create a new BGCFlow project.
Usage: bgcflow init --> check current directory for existing config dir. If
not found, generate from template. bgcflow init --project <TEXT> -->
generate a new BGCFlow project in the config directory.
Options:
--bgcflow_dir TEXT Location of BGCFlow directory. (DEFAULT: Current
working directory)
--project TEXT Initiate a new BGCFlow project. Insert project name:
`bgcflow init --project <TEXT>`
--use_project_pipeline Generate pipeline selection template in PEP file
instead of using Global pipelines. Use with
`--project` option.
--prokka_db TEXT Path to custom reference file. Use with `--project`
option.
--gtdb_tax TEXT Path to custom taxonomy file. Use with `--project`
option.
--samples_csv TEXT Path to samples file. Use with `--project` option.
-h, --help Show this message and exit.
BGCFlow have two different configuration levels, global and project-specific. Both are defined as a .yaml
format and structured like below:
config/
├── config.yaml # --> GLOBAL CONFIGURATION FILE
└── project_1
├── project_config.yaml # --> PROJECT CONFIGURATION FILE
└── samples.csv
The global configuration is defined in the config.yaml
under the config
folder. It's function is to:
- List projects that should be run in the main workflow and subworkflows
- Set up default pipelines/rules that will be run for all projects
- Locate the
resources
path - Manage other settings that applies to all projects
Configure the workflow according to your needs by editing the files in the config/
folder. An example of the configuration files is provided in the .examples
folder.
Projects can be added under the project
section of the global config file: config/config.yaml
. Each can project can be added as a line containing a path to the project specification configuration files (the PEP file). Each line starts with "-
" and the variable pep
which points to a PEP config file.
projects:
- pep: .examples/_pep_example/project_config.yaml
In the global config file, you can choose which analysis to run by setting the parameter value in pipelines
section to TRUE
or FALSE
:
pipelines:
bigscape: TRUE
mlst: TRUE
refseq_masher: TRUE
seqfu: TRUE
eggnog: FALSE
Note that this only applies to the pipelines availaible in the main workflow.
TIPS - Find available rules from the main workflow with
bgcflow_wrapper
$ bgcflow pipelines --bgcflow_dir bgcflow
Printing available rules:
- eggnog
- mash
- fastani
- automlst-wrapper
- roary
- eggnog-roary
- seqfu
- bigslice
- query-bigslice
- checkm
- gtdbtk
- prokka-gbk
- antismash
- arts
- deeptfactor
- deeptfactor-roary
- cblaster-genome
- cblaster-bgc
- bigscape
TIPS - Find out rule description with
bgcflow_wrapper
$ bgcflow pipelines --describe bigscape
Description for bigscape:
- Cluster BGCs using BiG-SCAPE
$ bgcflow pipelines --cite bigscape
Citations for bigscape:
- Navarro-Muñoz, J.C., Selem-Mojica, N., Mullowney, M.W. et al. A computational framework to explore large-scale biosynthetic diversity. [Nat Chem Biol 16, 60–68 (2020)](https://doi.org/10.1038/s41589-019-0400-9)
More about the command:
$ bgcflow pipelines --help
Usage: bgcflow pipelines [OPTIONS]
Get description of available pipelines from BGCFlow.
Options:
--bgcflow_dir TEXT Location of BGCFlow directory. (DEFAULT: Current working
directory)
--describe TEXT Get description of a given pipeline.
--cite TEXT Get citation of a given pipeline.
-h, --help Show this message and exit.
By default, BGCFlow will download and install necessary softwares and databases in the resources/
folder. The location of each resources can be changed by editing the path in the resource_path
section. This is useful, especially if you already have the databases and softwares locally. Instead of creating the resource folder, BGCFlow will generate a symlink to the existing resources.
resources_path:
antismash_db: resources/antismash_db
eggnog_db: resources/eggnog_db
BiG-SCAPE: resources/BiG-SCAPE
bigslice: resources/bigslice
checkm: resources/checkm
gtdbtk: <custom gtdbtk database path>
Other configuration is described in the Advanced Configuration page.
As of BGCFlow version >=0.4.0
, projects are now configured as a Portable Encapsulated Project (PEP). The project specific configuration is a .yaml
file which can be put inside each project folder. It's function is to:
- Define a project id
- Give project description and metadata
- Locate the
samples
table containing a list of the inputs for each project - Add additional information for the workflow run, such as custom
taxonomic assignment
orcustom reference gene annotation
- Define pipelines/rules to run for a particular project. This will override and ignore the pipelines/rules defined in the global configuration.
See project_config.yaml for an example of a PEP formatted project.
Each project will requires a name
and description
. An example project PEP configuration will look like this:
name: Lactobacillus_delbrueckii
pep_version: 2.1.0
description: "Lactobacillus delbrueckii 27 01 2023"
sample_table: samples.csv
#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
seqfu: TRUE
mash: TRUE
fastani: TRUE
checkm: FALSE
The name
will be used as the project id, and should be unique for each project. The description
should be given to provide context about the project, sample size, date of experiment, etc. The variable pep_version
will tell BGCFlow which version of PEP is being used. Additional configuration is described in the Advanced Configuration section.
The variable sample_table
(PEP) or samples
denote the location of your .csv
file which specifies the genomes to analyze. Note that you can name the file anything as long as you define it in the config.yaml
.
Example: samples.csv
genome_id | source | organism | genus | species | strain | closest_placement_reference |
---|---|---|---|---|---|---|
GCF_000359525.1 | ncbi | J1074 | ||||
1223307.4 | patric | Streptomyces sp. PVA 94-07 | Streptomyces | sp. | PVA 94-07 | GCF_000495755.1 |
P8-2B-3.1 | custom | Streptomyces sp. P8-2B-3 | Streptomyces | sp. | P8-2B-3 |
Columns description:
-
genome_id
[required]: The genome accession ids (with genome version forncbi
andpatric
genomes). Forcustom
fasta file provided by users, it should refer to the fasta file names stored in thedata/raw/fasta/
directory with.fna
extension. Example: genome id P8-2B-3.1 refers to the filedata/raw/fasta/P8-2B-3.1.fna
. -
source
[required]: Source of the genome to be analyzed choose one of the following:custom
,ncbi
,patric
. Where:-
custom
: for user-provided genomes (.fna
) in thedata/raw/fasta
directory with genome ids as filenames -
ncbi
: for list of public genome accession IDs that will be downloaded from the NCBI refseq (GCF...) or genbank (GCA...) database -
patric
: for list of public genome accession IDs that will be downloaded from the PATRIC database
-
-
organism
[optional]: name of the organism that is the same as in the fasta header -
genus
[optional] : genus of the organism. Ideally identified with GTDBtk. -
species
[optional]: species epithet (the second word in a species name) of the organism. Ideally identified with GTDBtk. -
strain
[optional] : strain id of the organism -
closest_placement_reference
[optional]: if known, the closest NCBI genome to the organism. Ideally identified with GTDBtk.
Further formatting rules are defined in the workflow/schemas/
folder.
In each projects, you can choose which analysis to run by setting the parameter value in the project_config.yaml
to TRUE
or FALSE
:
rules:
bigscape: TRUE
mlst: TRUE
refseq_masher: TRUE
seqfu: TRUE
eggnog: FALSE
This will ignore the pipelines
configuration set in the global configuration.