Output - magwenelab/WeavePop GitHub Wiki

Below is a description of each module's output. The paths are relative to <project_directory>/results_<run_id> as specified in the config file (results/ by default).
If you want to create all the output files, you only need to activate the database and the plotting modules. If you don't want to create the database, you can activate only the intermediate modules you want. The files in bold are the ones that are integrated into the final database.

Processing of reference genomes

The files in this module are produced as needed, according to the activation of the subsequent modules.
In the config file, you can activate the annotation of the reference genomes by a main reference. If you activate it, all reference genomes will be annotated.

Path Description
03.References/{lineage}/{lineage}_repeats.bed BED file of regions with repetitive sequences identified by RepeatMasker. Each region is the intersection of different types of repetitive sequences identified. Columns are Accession, Start, End, Types (comma-separated list of types in the region). Positions are 0-Based.
03.References/{lineage}/{lineage}.gff Standardized GFF file of the reference genome with added introns, intergenic regions, and repetitive sequences. If the reference annotation was activated, it is the processed result of Liftoff annotation using the main reference. Positions are 1-Based.
03.References/{lineage}/{lineage}.gff.tsv Tabular version of the previous file. Positions are 1-Based. Column names different from standard GFF format: accession ('seq_id'), feature_id ('ID'), gene_name ('Name'), gene_id ('locus'), old_feature_id (original ID before fixing), lineage, and identical_to_main_ref, ('matches_ref_protein', added by Liftoff if used) start_stop_mutation (union of columns: 'missing_start_codon', 'missing_stop_codon', 'inframe_stop_codon' added by Liftoff if used).
03.References/all_lineages.gff.tsv Concatenation of the previous table of all lineages. Positions are 1-Based.
03.References/refs_unmapped_features.tsv Table with the genes that were not mapped in each reference when Liftoff from the main reference was done. Along with the information of the genes, there is one column per reference genome with the value unmapped or blank.
Intermediate files
Path Description
04.Intermediate_files/03.References/{lineage}/intermediate_liftoff/ See Liftoff output
04.Intermediate_files/03.References/{lineage}/repeats/01_simple/{lineage}.bed BED file of simple repetitive sequences. Positions are 0-Based.
04.Intermediate_files/03.References/{lineage}/repeats/01_simple/ See RepeatMasker output
04.Intermediate_files/03.References/{lineage}/repeats/02_complex/{lineage}.bed BED file of complex repetitive sequences. Positions are 0-Based.
04.Intermediate_files/03.References/{lineage}/repeats/02_complex/ See RepeatMasker output
04.Intermediate_files/03.References/{lineage}/repeats/03_known/{lineage}.bed BED file of known repetitive sequences. Positions are 0-Based.
04.Intermediate_files/03.References/{lineage}/repeats/03_known/ See RepeatMasker output
04.Intermediate_files/03.References/{lineage}/repeats/04_unknown/{lineage}.bed BED file of unknown repetitive sequences. Positions are 0-Based.
04.Intermediate_files/03.References/{lineage}/repeats/04_unknown/ See RepeatMasker output
04.Intermediate_files/03.References/{lineage}/repeats/db_rmodeler/ Database created with RepeatModeler's BuildDatabase
04.Intermediate_files/03.References/{lineage}/repeats/known.fa FASTA file of known families of repetitive sequences identified by RepeatModeler
04.Intermediate_files/03.References/{lineage}/repeats/unknown.fa FASTA file of unknown families of repetitive sequences identified by RepeatModeler
04.Intermediate_files/03.References/{lineage}/main_ref.fasta Symlink to original FASTA
04.Intermediate_files/03.References/{lineage}/main_ref.fasta.fai FASTA index created by Liftoff
04.Intermediate_files/03.References/{lineage}/main_ref.gff Symlink to fixed GFF
04.Intermediate_files/03.References/{lineage}/main_ref.gff_db DB of GFF created by Liftoff
04.Intermediate_files/03.References/{lineage}/liftoff.gff GFF from Liftoff before polishing
04.Intermediate_files/03.References/{lineage}/{lineage}_annotated.gff GFF from Liftoff, polished
04.Intermediate_files/03.References/{lineage}/unmapped_features.txt List of features not lifted over to the reference genome
04.Intermediate_files/03.References/{lineage}/{lineage}_interg_introns.gff {lineage}_annotated.gff plus intergenic regions and introns
04.Intermediate_files/03.References/{lineage}/{lineage}_intergenic.gff {lineage}_annotated.gff plus intergenic regions
04.Intermediate_files/03.References/{lineage}/{lineage}_repeats.gff {lineage}_annotated.gff plus intergenic regions, introns, and fraction of repetitive sequences
04.Intermediate_files/03.References/{lineage}/{lineage}_repeats.gff.tsv Tabular version of the previous file
04.Intermediate_files/03.References/{lineage}/{lineage}.fasta Symlink to original FASTA
04.Intermediate_files/03.References/{lineage}/{lineage}.fasta.fai
04.Intermediate_files/03.References/{lineage}/{lineage}.fasta.mmi
04.Intermediate_files/03.References/{lineage}/{lineage}.cds.fa Nucleotide sequences of all transcripts in reference genome.
04.Intermediate_files/03.References/{lineage}/{lineage}.cds.csv Tabular version of previous file.
04.Intermediate_files/03.References/{lineage}/{lineage}.prots.fa Protein sequences of all isoforms in reference genome.
04.Intermediate_files/03.References/{lineage}/{lineage}.prots.csv Tabular version of previous file.
04.Intermediate_files/03.References/{lineage}/chromosomes.csv Table of chromosome names and lengths of the lineage.
04.Intermediate_files/03.References/{lineage}/chromosome_lengths.csv Table of chromosome lengths of the lineage.
04.Intermediate_files/03.References/all_refs_sequences.csv Concatenation of all CSV files of sequences from all references.
04.Intermediate_files/03.References/chromosomes.csv Table of chromosome names and lengths of all the lineages.
04.Intermediate_files/03.References/agat_config.yaml Config file for AGAT
04.Intermediate_files/03.References/fake_repeats.fasta Fake database for RepeatMasker. If selected to use a fake database.
04.Intermediate_files/03.References/main_ref_fixed_description.gff GFF with description tag instead of product tag
04.Intermediate_files/03.References/main_ref_fixed_ID.gff GFF with fixed IDs
04.Intermediate_files/03.References/main_ref_fixed_locus.gff GFF with locus tag added
04.Intermediate_files/03.References/main_ref_fixed.tsv Table version of fixed_description GFF
04.Intermediate_files/03.References/main_ref.gff Final fixed GFF with new IDs in the shape of <locus>-<level2 tag and number>-<level3 tag and number>
04.Intermediate_files/03.References/main_ref.tsv TSV version of fixed GFF

Snippy

Always produced.

Path Description
01.Samples/snippy/{sample}/snps.bam BAM file of alignment between short reads of the sample with the corresponding reference genome.
01.Samples/snippy/{sample}/snps.consensus.fa FASTA file of the reference genome with all variants instantiated.
01.Samples/snippy/{sample}/snps.vcf Called variants in VCF format. Positions are 1-Based.
01.Samples/snippy/{sample}/* Other files from the Snippy output.

Depth and quality

Always produced.

Path Description
01.Samples/depth_quality/{sample}/mapping_stats.tsv Mapping quality and depth statistics .
02.Dataset/depth_quality/mapping_stats.tsv Concatenation of mapping_stats.tsv files of all samples with a quality warning dependent on user-defined thresholds.
02.Dataset/metadata.csv Metadata table with samples that survived the quality filter.
02.Dataset/chromosomes.csv Table of chromosome names and lengths of the lineages that survived the quality filter.
Intermediate files
Path Description
04.Intermediate_files/01.Samples/depth_quality/{sample}/depth_distribution.tsv Distribution of read depth of good quality mappings and all mappings.
04.Intermediate_files/01.Samples/depth_quality/{sample}/depth_summary.tsv Mean and median depth of each chromosome and whole genome from good quality mappings and all mappings.
04.Intermediate_files/01.Samples/depth_quality/{sample}/snps_good.bam Filtered BAM file with good quality mappings.
04.Intermediate_files/01.Samples/depth_quality/{sample}/snps_good.bam.bai Index of previous file.
04.Intermediate_files/01.Samples/filtered_samples/{sample}.txt Empty file for surviving samples after the quality filter.
04.Intermediate_files/03.References/filtered_lineages/{lineage}.txt Empty file for surviving lineages after the sample filtering.
04.Intermediate_files/01.Samples/mosdepth/{sample}/* See Modepth output.

Annotation

Path Description
01.Samples/annotation/{sample}/annotation.gff Standardized GFF file of annotation by Liftoff. Positions are 1-Based.
01.Samples/annotation/{sample}/annotation.gff.tsv Tabular version of the previous file. Positions are 1-Based. Column names different from standard GFF format: accession ('seq_id'), feature_id ('ID'), gene_name ('Name'), gene_id ('locus'), old_feature_id (original ID before fixing), and identical_to_main_ref, ('matches_ref_protein') start_stop_mutation (union of columns: 'missing_start_codon', 'missing_stop_codon', 'inframe_stop_codon').
01.Samples/annotation/{sample}/cds.fa Nucleotide sequences of all transcripts of the sample.
01.Samples/annotation/{sample}/proteins.fa Protein sequences of all isoforms of the sample.
Intermediate files
Path Description
04.Intermediate_files/01.Samples/annotation/{sample}/liftoff/ See Liftoff output
04.Intermediate_files/01.Samples/annotation/{sample}/intergenic.gff Polished GFF annotated by Liftoff with added intergenic regions.
04.Intermediate_files/01.Samples/annotation/{sample}/interg_introns.gff Previous file with added introns.
04.Intermediate_files/01.Samples/annotation/{sample}/annotation.gff.tsv Tabular version of previous file.
04.Intermediate_files/01.Samples/annotation/{sample}/cds.csv Tabular version of corresponding FASTA file. Generated only if the database is produced.
04.Intermediate_files/01.Samples/annotation/{sample}/proteins.csv Tabular version of corresponding FASTA file.Generated only if the database is produced.
04.Intermediate_files/02.Dataset/sequences.csv Concatenation of all cds.csv and proteins.csv files. Generated only if the database is produced.

Depth and quality of genetic features

These files are produced if you activate this module or the database module.

Path Description
01.Samples/depth_quality/{sample}/mapq_depth_by_feature.tsv MAPQ and mean depth of the windows in each feature.
02.Dataset/depth_quality/mapq_depth_by_feature.tsv Concatenation of all mapq_depth_by_feature.tsv files.
Intermediate files
Path Description
04.Intermediate_files/01.Samples/depth_quality/{sample}/mapq_depth_by_window.bed MAPQ and mean depth of each window. Positions are 0-Based. Columns are: accession, start, end, mean MAPQ, and mean depth.
04.Intermediate_files/01.Samples/depth_quality/{sample}/mapq.bed Mean MAPQ of each position. Positions are 0-Based. Columns are: accession, start, end, and mean MAPQ.
04.Intermediate_files/01.Samples/depth_quality/{sample}/mapq_by_window.bed Mean MAPQ of each window. Positions are 0-Based. Columns are: accession, start, end, mean MAPQ.

CNV calling

These files are produced if you activate this module or the database module.

Path Description
01.Samples/cnv/{sample}/cnv_calls.tsv Table of deleted and duplicated regions in each sample and their overlap with repetitive sequences and genes. Columns are accession, start, end, cnv (deletion or duplication), region_size, depth (median of the mean depth of the windows in the CNV region), norm_depth (median of the normalized mean depth of the windows in the CNV region), smooth_depth (median of smooth normalized mean depth of the windows in the CNV region), repeat_fraction (overlap_bp/region_size), overlap_bp (sum of basepairs of all windows in the region that overlap with repetitive sequences), feature_id (comma separated list of gene_ids that overlap with the region totally or partially) Positions are 1-Based.
01.Samples/cnv/{sample}/cnv_chromosomes.tsv Summary metrics per chromosome of the features described in the previous table.
02.Dataset/cnv/cnv_calls.tsv Concatenation of all cnv_calls.tsv files.
02.Dataset/cnv/cnv_chromosomes.tsv Concatenation of all cnv_chromosomes.tsv files.
Intermediate files
Path Description
04.Intermediate_files/01.Samples/depth_quality/{sample}/depth_by_windows.tsv Mean depth of each window. Positions are 0-Based. Columns are: accession, start, end, mean depth, normalized mean depth, and smoothed normalized mean depth.

Annotation of SNP effects

These files are produced if you activate this module or the database module.

Path Description
02.Dataset/snpeff/effects.tsv Concatenation of the effect tables of all lineages.
02.Dataset/snpeff/lofs.tsv Concatenation of the loss of function tables of all lineages.
02.Dataset/snpeff/nmds.tsv Concatenation of the nonsense-mediated decay tables of all lineages.
02.Dataset/snpeff/presence.tsv Concatenation of the variant presence tables of all lineages.
02.Dataset/snpeff/variants.tsv Concatenation of the variant description tables of all lineages. Positions are 1-Based.
Intermediate files
Path Description
04.Intermediate_files/02.Dataset/snpeff/{lineage}_snpeff.vcf Version of the {lineage}_intersection.vcf annotated by SnpEff. Positions are 1-Based.
04.Intermediate_files/02.Dataset/snpeff/{lineage}_intersection.vcf Modified VCFfile with the description of all the possible variants in the lineage. The MAT field in INFO is a matrix showing the presence/absence of the variant in the samples of the lineage. Positions are 1-Based.
04.Intermediate_files/02.Dataset/snpeff/{lineage}_variants.tsv Tabular version of the {lineage}_intersection.vcf. Positions are 1-Based.
04.Intermediate_files/02.Dataset/snpeff/{lineage}_effects.tsv Table with the effects of the possible variants of the lineage. Identified against the annotation of the reference genome of the lineage.
04.Intermediate_files/02.Dataset/snpeff/{lineage}_lofs.tsv Loss of function output table of SnpEff.
04.Intermediate_files/02.Dataset/snpeff/{lineage}_nmds.tsv Nonsense-mediated decay output table of SnpEff.
04.Intermediate_files/02.Dataset/snpeff/{lineage}_presence.tsv Table with the variant IDs and the samples they are present in.
04.Intermediate_files/02.Dataset/snpeff/{lineage}_snpeff.genes.txt See SnpEff output.
04.Intermediate_files/02.Dataset/snpeff/{lineage}_snpeff.html See SnpEff output.
04.Intermediate_files/03.References/snpeff_data/Species_name_{lineage}/ Directory with the annotation database craeted by SnpEff build for each lineage.

Plotting

These plots are produced if you activate this module. The ones about depth distribution are made by default, the rest are produced if the CNV module is also executed.

Path Description
01.Samples/plots/{sample}/depth_chrom_distribution.png Depth distribution by chromosome.
01.Samples/plots/{sample}/depth_global_distribution.png Genome-wide depth distribution.
01.Samples/plots/{sample}/depth_boxplot.png Plot of the distribution of the raw, normalized and normalized-smoothed depth of the windows along the chromsomes.
01.Samples/plots/{sample}/depth_by_windows.png Plot of normalized depth of windows along each chromosome, with specified genetic features, called CNVs, and repetitive sequences of the corresponding reference.
01.Samples/plots/{sample}/mapq.png Plot of MAPQ of windows along each chromosome, with specified genetic features, called CNVs, and repetitive sequences of the corresponding reference.
01.Samples/plots/{sample}/depth_vs_cnvs.png Plot of relationship between normalized depth of each chromosome and the percentage of it covered by called deletions and duplications.
02.Dataset/plots/dataset_depth_by_chrom.png Normalized mean depth of each chromosome in the samples that survived the quality filter.
02.Dataset/plots/dataset_summary.png Genome-wide depth and mapping quality metrics of the samples that survived the quality filter.
Intermediate files
Path Description
04.Intermediate_files/03.References/loci_to_plot.tsv Positions are 1-Based.

Database

Activating this module will automatically run everything (except for the plots) and join the results (files marked in bold above) into the file 02.Dataset/database.db. This is an SQL Database created with DuckDB. DuckDB does not require primary keys to be declared. In the schema below, when there is one variable in bold, it is a unique variable, and when there are more, their combination is unique.

Schema of database
⚠️ **GitHub.com Fallback** ⚠️