05 Output Files - NBChub/bgcflow GitHub Wiki

Pipeline Output

Data Structure

The output of BGCFlow is a processed folder that contains the following subdirectories and files:

.
├── antismash
├── automlst_wrapper
├── bgcflow_wrapper.log
├── bigscape
│   ├── for_cytoscape_antismash_7.0.0
│   ├── Lactobacillus_delbrueckii_bigscape_as_7.0.0_mapping.csv
│   └── result_as7.0.0
├── bigslice
│   ├── cluster_as_7.0.0
│   └── query_as_7.0.0
├── cblaster
├── data_warehouse
├── dbt
│   └── antiSMASH_7.0.0
│       └──dbt_bgcflow.duckdb
├── docs
├── fastani
├── genbank
├── log_changes
├── main.py
├── mash
├── metadata
├── mkdocs.yml
├── README.md
├── roary
└── tables
    ├── df_antismash_7.0.0_summary.csv
    ├── df_arts_as-7.0.0.csv
    ├── df_deeptfactor.csv
    ├── df_gtdb_meta.csv
    ├── df_ncbi_meta.csv
    ├── df_regions_antismash_7.0.0.csv
    └── df_seqfu_stats.csv

This processed folder is a combination of the MkDocs report, data build tools, and results from the bioinformatic pipelines in the BGCFlow workflow. Details can be seen below:

File / Directory	Description
antismash	A directory containing the AntiSMASH results, which predicts and annotates secondary metabolite biosynthetic gene clusters (BGCs) in bacterial and fungal genomes.
automlst_wrapper	A directory containing the genome tree build using simplified AutoMLST wrapper. The `*.newick` file can be used for further tree visualization.
bgcflow_wrapper.log	A log file generated upon serving the `mkdocs` report.
bigscape	A directory containing the results of the BiG-SCAPE tool, which clusters BGCs into families based on their biosynthetic gene content.
bigslice	A directory containing the results of the BiG-SLiCE tool, which clusters BGCs using the BIRCH algorithm.
cblaster	A directory containing the diamond database of the dataset generated by CBlaster which can be used for BLAST searches.
data_warehouse	A directory containing parquet tables of various table generated by different tools in the workflow.
dbt	A directory dbt SQL schema for data transformation of BGCFlow results into DuckDB database. Inspired from: https://github.com/dbt-labs/jaffle_shop_duckdb
docs	A directory containing the jupyter notebooks and markdown reports that are served in the report.
fastani	A directory containing the results of the FastANI tool, which performs pairwise genome comparisons.
genbank	A directory containing the GenBank files for the bacterial genomes used in the BGCFlow workflow.
log_changes	A directory recording the BGC id changes made by antiSMASH and BGCFlow.
main.py	Python script generated by BGCFlow wrapper to serve the markdown report.
mash	A directory containing the results of the MASH tool, which performs pairwise genome comparisons.
metadata	A directory containing metadata and dependency version used in the project.
mkdocs.yml	The configuration file for the MkDocs tool, which generates the documentation for the BGCFlow report.
overrides	A directory containing the overrides for the MkDocs tool.
pycache	A directory containing the compiled Python bytecode files.
README.md	The README file for the BGCFlow project result.
roary	A directory containing the results of the Roary tool, which performs pan-genome analysis on bacterial genomes.
tables	A directory containing the tables generated by the BGCFlow workflow.

Summary of Available Pipeline (Main Workflow)

Here you can find pipeline keywords that you can run using the main Snakefile of BGCflow.

	Keyword	Description	Links
0	eggnog	Annotate samples with eggNOG database (http://eggnog5.embl.de)	eggnog-mapper
1	mash	Calculate distance estimation for all samples using MinHash.	Mash
2	fastani	Do pairwise Average Nucleotide Identity (ANI) calculation across all samples.	FastANI
3	automlst-wrapper	Simplified Tree building using autoMLST	automlst-simplified-wrapper
4	roary	Build pangenome using Roary.	Roary
5	eggnog-roary	Annotate Roary output using eggNOG mapper	eggnog-mapper
6	seqfu	Calculate sequence statistics using SeqFu.	seqfu2
7	bigslice	Cluster BGCs using BiG-SLiCE (https://github.com/medema-group/bigslice)	bigslice
8	query-bigslice	Map BGCs to BiG-FAM database (https://bigfam.bioinformatics.nl/)	bigfam.bioinformatics.nl
9	checkm	Assess genome quality with CheckM.	CheckM
10	gtdbtk	Taxonomic placement with GTDB-Tk	GTDBTk
11	prokka-gbk	Copy annotated genbank results.	prokka
12	antismash	Summarizes antiSMASH result.	antismash
13	arts	Run Antibiotic Resistant Target Seeker (ARTS) on samples.	arts
14	deeptfactor	Use deep learning to find Transcription Factors.	deeptfactor
15	deeptfactor-roary	Use DeepTFactor on Roary outputs.	Roary
16	cblaster-genome	Build diamond database of genomes for cblaster search.	cblaster
17	cblaster-bgc	Build diamond database of BGCs for cblaster search.	cblaster
18	bigscape	Cluster BGCs using BiG-SCAPE	BiG-SCAPE

Network Analysis with Cytoscape

A graphml file containing the annotated BiG-SCAPE network is generated by the automated report and can be explored with Cytoscape. A guideline for network analysis with Cytoscape can be found in the Cytoscape documentation: https://manual.cytoscape.org/en/stable/