Merging different datasets of SNPs

This guide explains how to merge two distinct datasets of SNPs. The main issue when merging different batches of SNPs from various runs of this pipeline or other variant calling pipelines is that SNPs unique to a particular batch will be absent in other batches. Consequently, when merging, these SNPs will have missing positions in the final gVCF file for every sample in the other batches, making it impossible to determine if the missing call was due to a real absence of aligned sequences or if the SNP matches the reference allele.

Using gVCF can address the problem of merging SNPs from different calling analyses because it stores sequencing information for both variant and non-variant positions. This comprehensive representation ensures that all sites in the genome are accounted for, reducing ambiguity about whether a missing call is due to the absence of aligned sequences or if the SNP matches the reference allele. By compressing contiguous non-variant regions into blocks, gVCF maintains a compact format while preserving essential information, making it easier to merge datasets without losing critical data.

Representing non-variant sequences

Contiguous non-variant segments of the genome can be represented as single records in gVCF. These records use the standard END INFO key to indicate the extent of the record. Even though the record can span multiple bases, only the first base is provided in the REF field to reduce file size. Consider for example this VCF region:

I	140176	.	G	<*>	0	.	DP=2;END=140303;MIN_DP=0
I	140304	.	A	G	68.5498	.	AB=0.666667;ABP=4.45795;AC=1;...;TYPE=snp
I	140305	.	A	<*>	0	.	DP=6;END=140310;MIN_DP=6
I	140311	.	G	C	68.5498	.	AB=0.666667;ABP=4.45795;AC=1;...;TYPE=snp

In this case, all positions between 140305 and 140310 are non-variant and represented by a single record. The END field indicates the last position of the block, while the REF field contains the first base of the block. The ALT field is set to <*> to indicate that the block is non-variant. The MIN_DP field indicates the minimum depth of coverage in the block. For example, in the region between 140176 and 140303, the MIN_DP is 0, meaning there is at least one position with no aligned reads. This should be considered when merging datasets, as the reference allele called in this block may not be supported by alignments across the entire region.

However, this is not the only way to represent a gVCF file. You can also use fixed-length blocks, which can help isolate regions with no alignments. In this approach, a single point with 0 coverage will correspond to a shorter segment of the genome. Consider the same region as before, but using gVCF blocks with a maximum length of 20bp:

I	140176	.	G	<*>	0	.	DP=5;END=140196;MIN_DP=0
I	140197	.	G	<*>	0	.	DP=2;END=140217;MIN_DP=0
I	140218	.	C	<*>	0	.	DP=1;END=140238;MIN_DP=1
I	140239	.	A	<*>	0	.	DP=1;END=140259;MIN_DP=1
I	140260	.	A	<*>	0	.	DP=1;END=140280;MIN_DP=1
I	140281	.	T	<*>	0	.	DP=2;END=140301;MIN_DP=1
I	140302	.	A	<*>	0	.	DP=6;END=140303;MIN_DP=6
I	140304	.	A	G	68.5498	.	AB=0.666667;ABP=4.45795;AC=1;...;TYPE=snp
I	140305	.	A	<*>	0	.	DP=6;END=140310;MIN_DP=6
I	140311	.	G	C	68.5498	.	AB=0.666667;ABP=4.45795;AC=1;...;TYPE=snp

In this case, the region between 140176 and 140303 is represented by 7 records, each with a maximum length of 20bp. We can see that in the first block region, the MIN_DP is 0, while approaching the 140304 position there's an increase in the minimum depth. This can be useful since we can be more confident to call a reference allele in blocks with higher coverage.

Eventually, you can shorten the block length to 1bp, which will result in a gVCF file with a record for each position in the genome. This approach is more verbose and will result in a larger file size, but it can be useful when you need to analyze specific regions of the genome in detail. You should consider the trade-offs between file size and the level of detail required for your analysis when choosing the block length.

Calling gVCF files with nf-resequencing-mem

This pipeline supports the generation of gVCF files when calling freebayes through the --gvcf option. This parameter should be specified when running the pipeline via CLI or when creating a JSON parameters file. Additionally, there are two parameters to customize the gVCF output: --gvcf_chunk, which specifies the maximum length of the gVCF blocks, and --gvcf_dont_use_chunk, which results in a gVCF file with a record for each position in the genome. You cannot specify both parameters at the same time, so you should consider the best trade-off between gVCF file size and reference block sizes.

You should also consider skipping the normalization step performed by this pipeline. While normalization can decompose a complex variant into a simpler one, it cannot handle reference gVCF blocks. In this case, it is better to skip the normalization step and use the gVCF output directly, or collect the results from the freebayes step. After merging, you can normalize the final VCF files using the normalization workflow provided by this pipeline.

Here's a basic example of how to call freebayes with gVCF output using the JSON parameters file:

{
    "gvcf": true,
    "gvcf_chunk": 20,
    "skip_normalization": true,
    "save_freebayes": true
}

Merging gVCF files

You can merge vcf files using bcftools merge, for example like this:

bcftools merge --threads <ncpus> --gvcf <reference genome.fasta> \
    --output <merged.gvcf.gz> --output-type z --write-index=tbi \
    <1st_batch.gvcf.gz> <2nd_batch.gvcf.gz> ...

where:

--threads: specifies the number of threads to use for merging.
--gvcf: let to specify the reference sequence to use when merging gVCF files. This implies the following INFO rules: -i QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max and the following missing rules: -M PL:max,AD:0, which respectively mean:
- QS is the sum of the quality scores of the reads supporting the variant
- MinDP is the minimum depth of coverage in the block
- I16 (a tag used in calling SNPs with samtools/bcftools) is the sum of the I16 field in the FORMAT column
- IDV(Individual Depth of Variant alleles) is the maximum depth of coverage in the block
- IMF (Individual Minor allele Frequency) is the maximum value of the IMF field in the FORMAT column
- PL:max: When input files have different alternative alleles, vector fields for unobserved alleles are set to missing (.) by default. This rule specifies that for the PL (Phred-scaled genotype likelihoods) tag, the maximum value observed for other alleles in the sample should be used. In practice, if an allele is present in one file but not in another during merging, the PL value for that allele in the resulting file will be set to the maximum value observed in the other alleles.
- AD:0: Similarly, this rule applies to the AD (Allelic Depth) tag, which represents the read depth for each allele. When an allele is missing in one of the merged files, the AD value for that allele will be set to 0. This indicates that there are no reads supporting that allele in that particular sample.
All of thus values can be overridden by the user by setting an appropriate value
--output: specifies the output file name.
--output-type: specifies the output file type. In this case, it is a compressed VCF file (z).
--write-index: specifies to write an index file for the output file (tbi format).
<1st_batch.gvcf.gz> <2nd_batch.gvcf.gz> ...: specifies the input gVCF files to merge.

Transforming a gVCF into a VCF file

We can transform the merged gVCF file into a VCF file with bcftools convert piped into bcftools view:

bcftools convert --gvcf2vcf --fasta-ref <reference genome.fasta> \
    merged.gvcf.gz | bcftools view -AA --trim-alt-alleles \
    --types=snps,indels,mnps,other --output merged.vcf.gz \
    --output-type z --write-index=tbi

where this option are specific to the convert command:

--gvcf2vcf: specifies to convert a gVCF file to a VCF file.
--fasta-ref: specifies the reference genome to use when converting the gVCF file.

while this options are specific to the view command:

--trim-alt-alleles: specifies to trim the alternate alleles in the output VCF file.
-AA: remove the unseen allele <*> at all sites.
--types: specifies the types of variants to include in the output VCF file. In this case, it includes SNPs, indels, MNPs, and other types of variants.
--output: specifies the output file name.
--output-type: specifies the output file type. In this case, it is a compressed VCF file (z).
--write-index: specifies to write an index file for the output file (tbi format).

After merging the gVCF files, you can normalize the final VCF file using the normalization workflow provided by this pipeline.

Merging datasets - cnr-ibba/nf-resequencing-mem GitHub Wiki

Merging different datasets of SNPs

Representing non-variant sequences

Calling gVCF files with nf-resequencing-mem

Merging gVCF files

Transforming a gVCF into a VCF file

⚠️ GitHub.com Fallback ⚠️

Merging datasets - cnr-ibba/nf-resequencing-mem GitHub Wiki

Merging different datasets of SNPs

Representing non-variant sequences

Calling gVCF files with nf-resequencing-mem

Merging gVCF files

Transforming a gVCF into a VCF file

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️