05 Output Files - NBChub/bgcflow GitHub Wiki
Pipeline Output
Data Structure
The output of BGCFlow is a processed folder that contains the following subdirectories and files:
.
├── antismash
├── automlst_wrapper
├── bgcflow_wrapper.log
├── bigscape
│ ├── for_cytoscape_antismash_7.0.0
│ ├── Lactobacillus_delbrueckii_bigscape_as_7.0.0_mapping.csv
│ └── result_as7.0.0
├── bigslice
│ ├── cluster_as_7.0.0
│ └── query_as_7.0.0
├── cblaster
├── data_warehouse
├── dbt
│ └── antiSMASH_7.0.0
│ └──dbt_bgcflow.duckdb
├── docs
├── fastani
├── genbank
├── log_changes
├── main.py
├── mash
├── metadata
├── mkdocs.yml
├── README.md
├── roary
└── tables
├── df_antismash_7.0.0_summary.csv
├── df_arts_as-7.0.0.csv
├── df_deeptfactor.csv
├── df_gtdb_meta.csv
├── df_ncbi_meta.csv
├── df_regions_antismash_7.0.0.csv
└── df_seqfu_stats.csv
This processed folder is a combination of the MkDocs report, data build tools, and results from the bioinformatic pipelines in the BGCFlow workflow. Details can be seen below:
File / Directory | Description |
---|---|
antismash | A directory containing the AntiSMASH results, which predicts and annotates secondary metabolite biosynthetic gene clusters (BGCs) in bacterial and fungal genomes. |
automlst_wrapper | A directory containing the genome tree build using simplified AutoMLST wrapper. The *.newick file can be used for further tree visualization. |
bgcflow_wrapper.log | A log file generated upon serving the mkdocs report. |
bigscape | A directory containing the results of the BiG-SCAPE tool, which clusters BGCs into families based on their biosynthetic gene content. |
bigslice | A directory containing the results of the BiG-SLiCE tool, which clusters BGCs using the BIRCH algorithm. |
cblaster | A directory containing the diamond database of the dataset generated by CBlaster which can be used for BLAST searches. |
data_warehouse | A directory containing parquet tables of various table generated by different tools in the workflow. |
dbt | A directory dbt SQL schema for data transformation of BGCFlow results into DuckDB database. Inspired from: https://github.com/dbt-labs/jaffle_shop_duckdb |
docs | A directory containing the jupyter notebooks and markdown reports that are served in the report. |
fastani | A directory containing the results of the FastANI tool, which performs pairwise genome comparisons. |
genbank | A directory containing the GenBank files for the bacterial genomes used in the BGCFlow workflow. |
log_changes | A directory recording the BGC id changes made by antiSMASH and BGCFlow. |
main.py | Python script generated by BGCFlow wrapper to serve the markdown report. |
mash | A directory containing the results of the MASH tool, which performs pairwise genome comparisons. |
metadata | A directory containing metadata and dependency version used in the project. |
mkdocs.yml | The configuration file for the MkDocs tool, which generates the documentation for the BGCFlow report. |
overrides | A directory containing the overrides for the MkDocs tool. |
pycache | A directory containing the compiled Python bytecode files. |
README.md | The README file for the BGCFlow project result. |
roary | A directory containing the results of the Roary tool, which performs pan-genome analysis on bacterial genomes. |
tables | A directory containing the tables generated by the BGCFlow workflow. |
Summary of Available Pipeline (Main Workflow)
Here you can find pipeline keywords that you can run using the main Snakefile of BGCflow.
Keyword | Description | Links | |
---|---|---|---|
0 | eggnog | Annotate samples with eggNOG database (http://eggnog5.embl.de) | eggnog-mapper |
1 | mash | Calculate distance estimation for all samples using MinHash. | Mash |
2 | fastani | Do pairwise Average Nucleotide Identity (ANI) calculation across all samples. | FastANI |
3 | automlst-wrapper | Simplified Tree building using autoMLST | automlst-simplified-wrapper |
4 | roary | Build pangenome using Roary. | Roary |
5 | eggnog-roary | Annotate Roary output using eggNOG mapper | eggnog-mapper |
6 | seqfu | Calculate sequence statistics using SeqFu. | seqfu2 |
7 | bigslice | Cluster BGCs using BiG-SLiCE (https://github.com/medema-group/bigslice) | bigslice |
8 | query-bigslice | Map BGCs to BiG-FAM database (https://bigfam.bioinformatics.nl/) | bigfam.bioinformatics.nl |
9 | checkm | Assess genome quality with CheckM. | CheckM |
10 | gtdbtk | Taxonomic placement with GTDB-Tk | GTDBTk |
11 | prokka-gbk | Copy annotated genbank results. | prokka |
12 | antismash | Summarizes antiSMASH result. | antismash |
13 | arts | Run Antibiotic Resistant Target Seeker (ARTS) on samples. | arts |
14 | deeptfactor | Use deep learning to find Transcription Factors. | deeptfactor |
15 | deeptfactor-roary | Use DeepTFactor on Roary outputs. | Roary |
16 | cblaster-genome | Build diamond database of genomes for cblaster search. | cblaster |
17 | cblaster-bgc | Build diamond database of BGCs for cblaster search. | cblaster |
18 | bigscape | Cluster BGCs using BiG-SCAPE | BiG-SCAPE |
Network Analysis with Cytoscape
A graphml file containing the annotated BiG-SCAPE network is generated by the automated report and can be explored with Cytoscape. A guideline for network analysis with Cytoscape can be found in the Cytoscape documentation: https://manual.cytoscape.org/en/stable/