Running the pipeline - cnr-ibba/nf-resequencing-mem GitHub Wiki
Running nf-resequencing-mem
Creating a snpEff database
This pipeline let you to annotate the final VCF file using
snpEff by providing
a snpEff database. A supported snpEff database (a database configured in snpEff.config
like "Saccharomyces_cerevisiae") can be provided to the pipeline with the
--snpeff_database
parameter: this database will be downloaded and used to annotate
the final VCF file.
It is possible also to create a custom snpEff database, and provide it to the pipeline
in addition of --snpeff_database
parameter
through the --snpeff_config
and --snpeff_cachedir
parameters, which define
the path of the custom snpEff.config
file and the directory where the custom database
is located respectively, like in the following example:
{
"snpeff_config": "path/to/snpEff.config",
"snpeff_cachedir": "path/to/snpEff/data",
"snpeff_database": "your custom database"
}
In the snpEff documentation is described how to build a custom database with various examples. Here are some tips to build a custom database
Update snpEff.config
First of all, you need to add your custom database to the snpEff.config
file.
You can download a copy of the snpEff.config
file from the
github repository:
wget https://raw.githubusercontent.com/pcingola/SnpEff/refs/heads/master/config/snpEff.config
next, simply add a new entry for your custom database in the snpEff.config
file:
<your database> : <scientific name or common name>
the first part is the name of the database, which will be specified
using the --snpeff_database
parameter in this pipeline. The
second part is the scientific name or common name of the organism
or any other description you want to give to the database. Next, you need to create a subdirectory in snpEff's data directory using
the name of your custom database. For example, if you have a custom database named my_database
, you can create a directory like this:
mkdir -p data/my_database
in this example, the data folder is relative to the path of the
snpEff.config
file. The data folder (not the folder of the custom database) should be specified using the --snpeff_cachedir
parameter in this pipeline, while the path of the custom
snpEff.config
file should be specified using the --snpeff_config
parameter.
Getting data from NCBI genome
In this example we will download data from NCBI genome
using NCBI dataset
CLI tool. Enter in the NCBI genome portal
and then search for your organism of interest using scientific name or common name.
Next, click on the genome assembly version of interest and then click on datasets
tab: a modal will appear with the NCBI dataset CLI command to download the genome
data. Copy the command and paste it in your terminal to download the data, for example:
datasets download genome accession <accession_id> --include gff3,rna,cds,protein,genome,seq-report
Where <accession_id>
is the accession id of the genome assembly version of interest.
This command will download the genome data in the ncbi_dataset.zip
file in your
current directory. Unzip the archive and enter in the directory with your data
unzip ncbi_dataset.zip
cd ncbi_dataset/data/<accession_id>
there should be a folder with the same accession_id used in the datasets download
inside the ncbi_dataset/data
directory. This folder contains all the data you need
to build a snpEff database. Next, enter in the snpEff data directory in which you want
to build your custom database, the same folder you specified in the snpEff.config
file. Create a symbolic link for your sequence, gff, cds and protein files in the
data/<my_database>
directory. It's better to use the default names expected by snpEff,
for example:
ln -s /path/to/ncbi_dataset/data/<accession_id>/<accession_id>_<assemply_version>.genomic.fna sequences.fa
ln -s /path/to/ncbi_dataset/data/<accession_id>/protein.faa protein.fa
ln -s /path/to/ncbi_dataset/data/<accession_id>/rna.fna cds.fa
The file genes.gff
should have the same ID
and Parent
we can find in the cds.fa
and protein.fa
files. If this is not the case, you can use sed
to fix the ID
and Parent
fields in the genomic.gff
file, for example:
sed 's/\(rna-\|exon-\|cds-\)//g' /path/to/ncbi_dataset/data/<accession_id>/genomic.gff > genes.gff
Next build the snpEff database using the build
command in the directory where
the custom snpEff.config
file and data
folder are located:
java -jar snpEff.jar build -gff3 -v -c snpEff.config <my_database>
You can also add the -noCheckCds
flag to avoid checking the CDS sequences if the build process fails. However, if the IDs are consistent across the GFF file, the build process will be successful. The -v
flag is used to print verbose output, while the -gff3
flag and -c
specify the format of the GFF file and the name of the custom
config file respectively. The last argument is the name of the database you want to build.
Normalizing VCF files
VCF normalization refers to the process of simplifying and standardizing complex alleles represented in a VCF file. This involves reducing intricate alleles into simpler forms using pairwise alignment techniques, specifically with tools like vcfwave, which utilizes the fast bi-wavefront aligner (WFA).
The main goals of VCF normalization are to:
- Breakdown complex overlapping and nested variants into more straightforward representations.
- Improve the accuracy of variant calls by eliminating false positives.
- Enhance the clarity and usability of the output VCF records.
Consider the following example of a VCF file before normalization:
I 140142 . A G 114.131 . AB=0.6;...;TYPE=snp
I 140154 . GGAAAAGATC AGAGAAAACT 50.5592 . AB=0.333333;...;TYPE=complex
I 140169 . C T 85.3288 . AB=0.333333;...;TYPE=snp
The complex allele in the second record is broken down into separate record after normalization:
I 140142 . A G 114.131 . AB=0.6;...;TYPE=snp;
I 140154 ._1 G A 51 . AB=0.333333;...;TYPE=snp;
I 140157 ._2 A G 51 . AB=0.333333;...;TYPE=snp;
I 140160 ._3 G A 51 . AB=0.333333;...;TYPE=snp;
I 140162 ._4 TC CT 51 . AB=0.333333;...;TYPE=mnp;
I 140169 . C T 85 . AB=0.333333;...;TYPE=snp;
The normalization workflow in nf-resequencing-mem
is represented as follows:
First there's a step performed by vcfwave
to normalize the VCF file (which replaces the vcfallelicprimitives
tool used in previous versions of freebayes
). The output of this step is then sorted using bcftools
to ensure that the VCF file is in the correct order for downstream processing. Then there's another normalization step performed by bcftools
in order to ensure that alternate and reference are consistent with the reference genome. Finally the VCF is enriched with additional information using bcftools
fill-tags plugin. All those steps are performed by chromosomes in parallel with nf-resequencing-mem
and then joined in a unique file which is the final output of this pipeline.
Running normalization only
Is it possible to skip all the nf-resequencing-mem
pipeline steps and run only the normalization workflow. The input parameters with the CLI or with the params file changes a little bit: the --normalization_only
flag tells the pipeline to execute
only the normalize_vcf
workflow, bypassing all the other steps like trimming, alignment and variant calling. The --input_vcf
and --input_tbi
parameters are used to specify the input VCF file and its index, respectively. Finally the --genome_fasta
parameter is used to specify the reference genome to be used in the normalization process. Here's a little example of the JSON file that can be provided to run
the normalization workflow only:
{
"normalization_only": true,
"input_vcf": "path/to/input.vcf",
"input_tbi": "path/to/input.vcf.tbi",
"genome_fasta": "path/to/genome.fasta",
"outdir": "path/to/output"
}
Options like --outdir
are required to specify the output directory where the normalized VCF file will be saved, like the normal execution of the pipeline.