Output - magwenelab/WeavePop GitHub Wiki
Below is a description of each module's output. The paths are relative to <project_directory>/results_<run_id>
as specified in the config file (results/
by default).
If you want to create all the output files, you only need to activate the database and the plotting modules. If you don't want to create the database, you can activate only the intermediate modules you want.
The files in bold are the ones that are integrated into the final database.
The files in this module are produced as needed, according to the activation of the subsequent modules.
In the config file, you can activate the annotation of the reference genomes by a main reference. If you activate it, all reference genomes will be annotated.
Path | Description |
---|---|
03.References/{lineage}/{lineage}_repeats.bed |
BED file of regions with repetitive sequences identified by RepeatMasker. Each region is the intersection of different types of repetitive sequences identified. Columns are Accession, Start, End, Types (comma-separated list of types in the region). Positions are 0-Based. |
03.References/{lineage}/{lineage}.gff |
Standardized GFF file of the reference genome with added introns, intergenic regions, and repetitive sequences. If the reference annotation was activated, it is the processed result of Liftoff annotation using the main reference. Positions are 1-Based. |
03.References/{lineage}/{lineage}.gff.tsv |
Tabular version of the previous file. Positions are 1-Based. Column names different from standard GFF format: accession ('seq_id'), feature_id ('ID'), gene_name ('Name'), gene_id ('locus'), old_feature_id (original ID before fixing), lineage , and identical_to_main_ref , ('matches_ref_protein', added by Liftoff if used) start_stop_mutation (union of columns: 'missing_start_codon', 'missing_stop_codon', 'inframe_stop_codon' added by Liftoff if used). |
03.References/all_lineages.gff.tsv |
Concatenation of the previous table of all lineages. Positions are 1-Based. |
03.References/refs_unmapped_features.tsv |
Table with the genes that were not mapped in each reference when Liftoff from the main reference was done. Along with the information of the genes, there is one column per reference genome with the value unmapped or blank. |
Intermediate files
Path | Description |
---|---|
04.Intermediate_files/03.References/{lineage}/intermediate_liftoff/ |
See Liftoff output |
04.Intermediate_files/03.References/{lineage}/repeats/01_simple/{lineage}.bed |
BED file of simple repetitive sequences. Positions are 0-Based. |
04.Intermediate_files/03.References/{lineage}/repeats/01_simple/ |
See RepeatMasker output |
04.Intermediate_files/03.References/{lineage}/repeats/02_complex/{lineage}.bed |
BED file of complex repetitive sequences. Positions are 0-Based. |
04.Intermediate_files/03.References/{lineage}/repeats/02_complex/ |
See RepeatMasker output |
04.Intermediate_files/03.References/{lineage}/repeats/03_known/{lineage}.bed |
BED file of known repetitive sequences. Positions are 0-Based. |
04.Intermediate_files/03.References/{lineage}/repeats/03_known/ |
See RepeatMasker output |
04.Intermediate_files/03.References/{lineage}/repeats/04_unknown/{lineage}.bed |
BED file of unknown repetitive sequences. Positions are 0-Based. |
04.Intermediate_files/03.References/{lineage}/repeats/04_unknown/ |
See RepeatMasker output |
04.Intermediate_files/03.References/{lineage}/repeats/db_rmodeler/ |
Database created with RepeatModeler's BuildDatabase |
04.Intermediate_files/03.References/{lineage}/repeats/known.fa |
FASTA file of known families of repetitive sequences identified by RepeatModeler |
04.Intermediate_files/03.References/{lineage}/repeats/unknown.fa |
FASTA file of unknown families of repetitive sequences identified by RepeatModeler |
04.Intermediate_files/03.References/{lineage}/main_ref.fasta |
Symlink to original FASTA |
04.Intermediate_files/03.References/{lineage}/main_ref.fasta.fai |
FASTA index created by Liftoff |
04.Intermediate_files/03.References/{lineage}/main_ref.gff |
Symlink to fixed GFF |
04.Intermediate_files/03.References/{lineage}/main_ref.gff_db |
DB of GFF created by Liftoff |
04.Intermediate_files/03.References/{lineage}/liftoff.gff |
GFF from Liftoff before polishing |
04.Intermediate_files/03.References/{lineage}/{lineage}_annotated.gff |
GFF from Liftoff, polished |
04.Intermediate_files/03.References/{lineage}/unmapped_features.txt |
List of features not lifted over to the reference genome |
04.Intermediate_files/03.References/{lineage}/{lineage}_interg_introns.gff |
{lineage}_annotated.gff plus intergenic regions and introns |
04.Intermediate_files/03.References/{lineage}/{lineage}_intergenic.gff |
{lineage}_annotated.gff plus intergenic regions |
04.Intermediate_files/03.References/{lineage}/{lineage}_repeats.gff |
{lineage}_annotated.gff plus intergenic regions, introns, and fraction of repetitive sequences |
04.Intermediate_files/03.References/{lineage}/{lineage}_repeats.gff.tsv |
Tabular version of the previous file |
04.Intermediate_files/03.References/{lineage}/{lineage}.fasta |
Symlink to original FASTA |
04.Intermediate_files/03.References/{lineage}/{lineage}.fasta.fai |
|
04.Intermediate_files/03.References/{lineage}/{lineage}.fasta.mmi |
|
04.Intermediate_files/03.References/{lineage}/{lineage}.cds.fa |
Nucleotide sequences of all transcripts in reference genome. |
04.Intermediate_files/03.References/{lineage}/{lineage}.cds.csv |
Tabular version of previous file. |
04.Intermediate_files/03.References/{lineage}/{lineage}.prots.fa |
Protein sequences of all isoforms in reference genome. |
04.Intermediate_files/03.References/{lineage}/{lineage}.prots.csv |
Tabular version of previous file. |
04.Intermediate_files/03.References/{lineage}/chromosomes.csv |
Table of chromosome names and lengths of the lineage. |
04.Intermediate_files/03.References/{lineage}/chromosome_lengths.csv |
Table of chromosome lengths of the lineage. |
04.Intermediate_files/03.References/all_refs_sequences.csv |
Concatenation of all CSV files of sequences from all references. |
04.Intermediate_files/03.References/chromosomes.csv |
Table of chromosome names and lengths of all the lineages. |
04.Intermediate_files/03.References/agat_config.yaml |
Config file for AGAT |
04.Intermediate_files/03.References/fake_repeats.fasta |
Fake database for RepeatMasker. If selected to use a fake database. |
04.Intermediate_files/03.References/main_ref_fixed_description.gff |
GFF with description tag instead of product tag |
04.Intermediate_files/03.References/main_ref_fixed_ID.gff |
GFF with fixed IDs |
04.Intermediate_files/03.References/main_ref_fixed_locus.gff |
GFF with locus tag added |
04.Intermediate_files/03.References/main_ref_fixed.tsv |
Table version of fixed_description GFF |
04.Intermediate_files/03.References/main_ref.gff |
Final fixed GFF with new IDs in the shape of <locus>-<level2 tag and number>-<level3 tag and number>
|
04.Intermediate_files/03.References/main_ref.tsv |
TSV version of fixed GFF |
Always produced.
Path | Description |
---|---|
01.Samples/snippy/{sample}/snps.bam |
BAM file of alignment between short reads of the sample with the corresponding reference genome. |
01.Samples/snippy/{sample}/snps.consensus.fa |
FASTA file of the reference genome with all variants instantiated. |
01.Samples/snippy/{sample}/snps.vcf |
Called variants in VCF format. Positions are 1-Based. |
01.Samples/snippy/{sample}/* |
Other files from the Snippy output. |
Always produced.
Path | Description |
---|---|
01.Samples/depth_quality/{sample}/mapping_stats.tsv |
Mapping quality and depth statistics . |
02.Dataset/depth_quality/mapping_stats.tsv |
Concatenation of mapping_stats.tsv files of all samples with a quality warning dependent on user-defined thresholds. |
02.Dataset/metadata.csv |
Metadata table with samples that survived the quality filter. |
02.Dataset/chromosomes.csv |
Table of chromosome names and lengths of the lineages that survived the quality filter. |
Intermediate files
Path | Description |
---|---|
04.Intermediate_files/01.Samples/depth_quality/{sample}/depth_distribution.tsv |
Distribution of read depth of good quality mappings and all mappings. |
04.Intermediate_files/01.Samples/depth_quality/{sample}/depth_summary.tsv |
Mean and median depth of each chromosome and whole genome from good quality mappings and all mappings. |
04.Intermediate_files/01.Samples/depth_quality/{sample}/snps_good.bam |
Filtered BAM file with good quality mappings. |
04.Intermediate_files/01.Samples/depth_quality/{sample}/snps_good.bam.bai |
Index of previous file. |
04.Intermediate_files/01.Samples/filtered_samples/{sample}.txt |
Empty file for surviving samples after the quality filter. |
04.Intermediate_files/03.References/filtered_lineages/{lineage}.txt |
Empty file for surviving lineages after the sample filtering. |
04.Intermediate_files/01.Samples/mosdepth/{sample}/* |
See Modepth output. |
Path | Description |
---|---|
01.Samples/annotation/{sample}/annotation.gff |
Standardized GFF file of annotation by Liftoff. Positions are 1-Based. |
01.Samples/annotation/{sample}/annotation.gff.tsv |
Tabular version of the previous file. Positions are 1-Based. Column names different from standard GFF format: accession ('seq_id'), feature_id ('ID'), gene_name ('Name'), gene_id ('locus'), old_feature_id (original ID before fixing), and identical_to_main_ref , ('matches_ref_protein') start_stop_mutation (union of columns: 'missing_start_codon', 'missing_stop_codon', 'inframe_stop_codon'). |
01.Samples/annotation/{sample}/cds.fa |
Nucleotide sequences of all transcripts of the sample. |
01.Samples/annotation/{sample}/proteins.fa |
Protein sequences of all isoforms of the sample. |
Intermediate files
Path | Description |
---|---|
04.Intermediate_files/01.Samples/annotation/{sample}/liftoff/ |
See Liftoff output |
04.Intermediate_files/01.Samples/annotation/{sample}/intergenic.gff |
Polished GFF annotated by Liftoff with added intergenic regions. |
04.Intermediate_files/01.Samples/annotation/{sample}/interg_introns.gff |
Previous file with added introns. |
04.Intermediate_files/01.Samples/annotation/{sample}/annotation.gff.tsv |
Tabular version of previous file. |
04.Intermediate_files/01.Samples/annotation/{sample}/cds.csv |
Tabular version of corresponding FASTA file. Generated only if the database is produced. |
04.Intermediate_files/01.Samples/annotation/{sample}/proteins.csv |
Tabular version of corresponding FASTA file.Generated only if the database is produced. |
04.Intermediate_files/02.Dataset/sequences.csv |
Concatenation of all cds.csv and proteins.csv files. Generated only if the database is produced. |
These files are produced if you activate this module or the database module.
Path | Description |
---|---|
01.Samples/depth_quality/{sample}/mapq_depth_by_feature.tsv |
MAPQ and mean depth of the windows in each feature. |
02.Dataset/depth_quality/mapq_depth_by_feature.tsv |
Concatenation of all mapq_depth_by_feature.tsv files. |
Intermediate files
Path | Description |
---|---|
04.Intermediate_files/01.Samples/depth_quality/{sample}/mapq_depth_by_window.bed |
MAPQ and mean depth of each window. Positions are 0-Based. Columns are: accession, start, end, mean MAPQ, and mean depth. |
04.Intermediate_files/01.Samples/depth_quality/{sample}/mapq.bed |
Mean MAPQ of each position. Positions are 0-Based. Columns are: accession, start, end, and mean MAPQ. |
04.Intermediate_files/01.Samples/depth_quality/{sample}/mapq_by_window.bed |
Mean MAPQ of each window. Positions are 0-Based. Columns are: accession, start, end, mean MAPQ. |
These files are produced if you activate this module or the database module.
Path | Description |
---|---|
01.Samples/cnv/{sample}/cnv_calls.tsv |
Table of deleted and duplicated regions in each sample and their overlap with repetitive sequences and genes. Columns are accession , start , end , cnv (deletion or duplication), region_size , depth (median of the mean depth of the windows in the CNV region), norm_depth (median of the normalized mean depth of the windows in the CNV region), smooth_depth (median of smooth normalized mean depth of the windows in the CNV region), repeat_fraction (overlap_bp/region_size), overlap_bp (sum of basepairs of all windows in the region that overlap with repetitive sequences), feature_id (comma separated list of gene_id s that overlap with the region totally or partially) Positions are 1-Based. |
01.Samples/cnv/{sample}/cnv_chromosomes.tsv |
Summary metrics per chromosome of the features described in the previous table. |
02.Dataset/cnv/cnv_calls.tsv |
Concatenation of all cnv_calls.tsv files. |
02.Dataset/cnv/cnv_chromosomes.tsv |
Concatenation of all cnv_chromosomes.tsv files. |
Intermediate files
Path | Description |
---|---|
04.Intermediate_files/01.Samples/depth_quality/{sample}/depth_by_windows.tsv |
Mean depth of each window. Positions are 0-Based. Columns are: accession, start, end, mean depth, normalized mean depth, and smoothed normalized mean depth. |
These files are produced if you activate this module or the database module.
Path | Description |
---|---|
02.Dataset/snpeff/effects.tsv |
Concatenation of the effect tables of all lineages. |
02.Dataset/snpeff/lofs.tsv |
Concatenation of the loss of function tables of all lineages. |
02.Dataset/snpeff/nmds.tsv |
Concatenation of the nonsense-mediated decay tables of all lineages. |
02.Dataset/snpeff/presence.tsv |
Concatenation of the variant presence tables of all lineages. |
02.Dataset/snpeff/variants.tsv |
Concatenation of the variant description tables of all lineages. Positions are 1-Based. |
Intermediate files
Path | Description |
---|---|
04.Intermediate_files/02.Dataset/snpeff/{lineage}_snpeff.vcf |
Version of the {lineage}_intersection.vcf annotated by SnpEff. Positions are 1-Based. |
04.Intermediate_files/02.Dataset/snpeff/{lineage}_intersection.vcf |
Modified VCFfile with the description of all the possible variants in the lineage. The MAT field in INFO is a matrix showing the presence/absence of the variant in the samples of the lineage. Positions are 1-Based. |
04.Intermediate_files/02.Dataset/snpeff/{lineage}_variants.tsv |
Tabular version of the {lineage}_intersection.vcf . Positions are 1-Based. |
04.Intermediate_files/02.Dataset/snpeff/{lineage}_effects.tsv |
Table with the effects of the possible variants of the lineage. Identified against the annotation of the reference genome of the lineage. |
04.Intermediate_files/02.Dataset/snpeff/{lineage}_lofs.tsv |
Loss of function output table of SnpEff. |
04.Intermediate_files/02.Dataset/snpeff/{lineage}_nmds.tsv |
Nonsense-mediated decay output table of SnpEff. |
04.Intermediate_files/02.Dataset/snpeff/{lineage}_presence.tsv |
Table with the variant IDs and the samples they are present in. |
04.Intermediate_files/02.Dataset/snpeff/{lineage}_snpeff.genes.txt |
See SnpEff output. |
04.Intermediate_files/02.Dataset/snpeff/{lineage}_snpeff.html |
See SnpEff output. |
04.Intermediate_files/03.References/snpeff_data/Species_name_{lineage}/ |
Directory with the annotation database craeted by SnpEff build for each lineage. |
These plots are produced if you activate this module. The ones about depth distribution are made by default, the rest are produced if the CNV module is also executed.
Path | Description |
---|---|
01.Samples/plots/{sample}/depth_chrom_distribution.png |
Depth distribution by chromosome. |
01.Samples/plots/{sample}/depth_global_distribution.png |
Genome-wide depth distribution. |
01.Samples/plots/{sample}/depth_boxplot.png |
Plot of the distribution of the raw, normalized and normalized-smoothed depth of the windows along the chromsomes. |
01.Samples/plots/{sample}/depth_by_windows.png |
Plot of normalized depth of windows along each chromosome, with specified genetic features, called CNVs, and repetitive sequences of the corresponding reference. |
01.Samples/plots/{sample}/mapq.png |
Plot of MAPQ of windows along each chromosome, with specified genetic features, called CNVs, and repetitive sequences of the corresponding reference. |
01.Samples/plots/{sample}/depth_vs_cnvs.png |
Plot of relationship between normalized depth of each chromosome and the percentage of it covered by called deletions and duplications. |
02.Dataset/plots/dataset_depth_by_chrom.png |
Normalized mean depth of each chromosome in the samples that survived the quality filter. |
02.Dataset/plots/dataset_summary.png |
Genome-wide depth and mapping quality metrics of the samples that survived the quality filter. |
Intermediate files
Path | Description |
---|---|
04.Intermediate_files/03.References/loci_to_plot.tsv |
Positions are 1-Based. |
Activating this module will automatically run everything (except for the plots) and join the results (files marked in bold above) into the file 02.Dataset/database.db
.
This is an SQL Database created with DuckDB. DuckDB does not require primary keys to be declared. In the schema below, when there is one variable in bold, it is a unique variable, and when there are more, their combination is unique.
