Running the pipeline - cnr-ibba/nf-resequencing-mem GitHub Wiki

Running nf-resequencing-mem

Creating a snpEff database

This pipeline let you to annotate the final VCF file using snpEff by providing a snpEff database. A supported snpEff database (a database configured in snpEff.config like "Saccharomyces_cerevisiae") can be provided to the pipeline with the --snpeff_database parameter: this database will be downloaded and used to annotate the final VCF file.

It is possible also to create a custom snpEff database, and provide it to the pipeline in addition of --snpeff_database parameter through the --snpeff_config and --snpeff_cachedir parameters, which define the path of the custom snpEff.config file and the directory where the custom database is located respectively, like in the following example:

{
    "snpeff_config": "path/to/snpEff.config",
    "snpeff_cachedir": "path/to/snpEff/data",
    "snpeff_database": "your custom database"
}

In the snpEff documentation is described how to build a custom database with various examples. Here are some tips to build a custom database

Update snpEff.config

First of all, you need to add your custom database to the snpEff.config file. You can download a copy of the snpEff.config file from the github repository:

wget https://raw.githubusercontent.com/pcingola/SnpEff/refs/heads/master/config/snpEff.config

next, simply add a new entry for your custom database in the snpEff.config file:

<your database> : <scientific name or common name>

the first part is the name of the database, which will be specified using the --snpeff_database parameter in this pipeline. The second part is the scientific name or common name of the organism or any other description you want to give to the database. Next, you need to create a subdirectory in snpEff's data directory using the name of your custom database. For example, if you have a custom database named my_database, you can create a directory like this:

mkdir -p data/my_database

in this example, the data folder is relative to the path of the snpEff.config file. The data folder (not the folder of the custom database) should be specified using the --snpeff_cachedir parameter in this pipeline, while the path of the custom snpEff.config file should be specified using the --snpeff_config parameter.

Getting data from NCBI genome

In this example we will download data from NCBI genome using NCBI dataset CLI tool. Enter in the NCBI genome portal and then search for your organism of interest using scientific name or common name. Next, click on the genome assembly version of interest and then click on datasets tab: a modal will appear with the NCBI dataset CLI command to download the genome data. Copy the command and paste it in your terminal to download the data, for example:

datasets download genome accession <accession_id> --include gff3,rna,cds,protein,genome,seq-report

Where <accession_id> is the accession id of the genome assembly version of interest. This command will download the genome data in the ncbi_dataset.zip file in your current directory. Unzip the archive and enter in the directory with your data

unzip ncbi_dataset.zip
cd ncbi_dataset/data/<accession_id>

there should be a folder with the same accession_id used in the datasets download inside the ncbi_dataset/data directory. This folder contains all the data you need to build a snpEff database. Next, enter in the snpEff data directory in which you want to build your custom database, the same folder you specified in the snpEff.config file. Create a symbolic link for your sequence, gff, cds and protein files in the data/<my_database> directory. It's better to use the default names expected by snpEff, for example:

ln -s /path/to/ncbi_dataset/data/<accession_id>/<accession_id>_<assemply_version>.genomic.fna sequences.fa
ln -s /path/to/ncbi_dataset/data/<accession_id>/protein.faa protein.fa
ln -s /path/to/ncbi_dataset/data/<accession_id>/rna.fna cds.fa

The file genes.gff should have the same ID and Parent we can find in the cds.fa and protein.fa files. If this is not the case, you can use sed to fix the ID and Parent fields in the genomic.gff file, for example:

sed 's/\(rna-\|exon-\|cds-\)//g' /path/to/ncbi_dataset/data/<accession_id>/genomic.gff > genes.gff

Next build the snpEff database using the build command in the directory where the custom snpEff.config file and data folder are located:

java -jar snpEff.jar build -gff3 -v -c snpEff.config <my_database>

You can also add the -noCheckCds flag to avoid checking the CDS sequences if the build process fails. However, if the IDs are consistent across the GFF file, the build process will be successful. The -v flag is used to print verbose output, while the -gff3 flag and -c specify the format of the GFF file and the name of the custom config file respectively. The last argument is the name of the database you want to build.

Normalizing VCF files

VCF normalization refers to the process of simplifying and standardizing complex alleles represented in a VCF file. This involves reducing intricate alleles into simpler forms using pairwise alignment techniques, specifically with tools like vcfwave, which utilizes the fast bi-wavefront aligner (WFA).

The main goals of VCF normalization are to:

  • Breakdown complex overlapping and nested variants into more straightforward representations.
  • Improve the accuracy of variant calls by eliminating false positives.
  • Enhance the clarity and usability of the output VCF records.

Consider the following example of a VCF file before normalization:

I	140142	.	A	G	114.131	.	AB=0.6;...;TYPE=snp
I	140154	.	GGAAAAGATC	AGAGAAAACT	50.5592	.	AB=0.333333;...;TYPE=complex
I	140169	.	C	T	85.3288	.	AB=0.333333;...;TYPE=snp

The complex allele in the second record is broken down into separate record after normalization:

I	140142	.	A	G	114.131	.	AB=0.6;...;TYPE=snp;
I	140154	._1	G	A	51	.	AB=0.333333;...;TYPE=snp;
I	140157	._2	A	G	51	.	AB=0.333333;...;TYPE=snp;
I	140160	._3	G	A	51	.	AB=0.333333;...;TYPE=snp;
I	140162	._4	TC	CT	51	.	AB=0.333333;...;TYPE=mnp;
I	140169	.	C	T	85	.	AB=0.333333;...;TYPE=snp;

The normalization workflow in nf-resequencing-mem is represented as follows:

Normalization Workflow

First there's a step performed by vcfwave to normalize the VCF file (which replaces the vcfallelicprimitives tool used in previous versions of freebayes). The output of this step is then sorted using bcftools to ensure that the VCF file is in the correct order for downstream processing. Then there's another normalization step performed by bcftools in order to ensure that alternate and reference are consistent with the reference genome. Finally the VCF is enriched with additional information using bcftools fill-tags plugin. All those steps are performed by chromosomes in parallel with nf-resequencing-mem and then joined in a unique file which is the final output of this pipeline.

Running normalization only

Is it possible to skip all the nf-resequencing-mem pipeline steps and run only the normalization workflow. The input parameters with the CLI or with the params file changes a little bit: the --normalization_only flag tells the pipeline to execute only the normalize_vcf workflow, bypassing all the other steps like trimming, alignment and variant calling. The --input_vcf and --input_tbi parameters are used to specify the input VCF file and its index, respectively. Finally the --genome_fasta parameter is used to specify the reference genome to be used in the normalization process. Here's a little example of the JSON file that can be provided to run the normalization workflow only:

{
    "normalization_only": true,
    "input_vcf": "path/to/input.vcf",
    "input_tbi": "path/to/input.vcf.tbi",
    "genome_fasta": "path/to/genome.fasta",
    "outdir": "path/to/output"
}

Options like --outdir are required to specify the output directory where the normalized VCF file will be saved, like the normal execution of the pipeline.