Merging datasets - cnr-ibba/nf-resequencing-mem GitHub Wiki
This guide explains how to merge two distinct datasets of SNPs. The main issue when merging different batches of SNPs from various runs of this pipeline or other variant calling pipelines is that SNPs unique to a particular batch will be absent in other batches. Consequently, when merging, these SNPs will have missing positions in the final gVCF file for every sample in the other batches, making it impossible to determine if the missing call was due to a real absence of aligned sequences or if the SNP matches the reference allele.
Using gVCF can address the problem of merging SNPs from different calling analyses because it stores sequencing information for both variant and non-variant positions. This comprehensive representation ensures that all sites in the genome are accounted for, reducing ambiguity about whether a missing call is due to the absence of aligned sequences or if the SNP matches the reference allele. By compressing contiguous non-variant regions into blocks, gVCF maintains a compact format while preserving essential information, making it easier to merge datasets without losing critical data.
Contiguous non-variant segments of the genome can be represented as single records in gVCF. These records use the standard END
INFO key to indicate the extent of the record. Even though the record can span multiple bases, only the first base is provided in the REF field to reduce file size. Consider for example this VCF region:
I 140176 . G <*> 0 . DP=2;END=140303;MIN_DP=0
I 140304 . A G 68.5498 . AB=0.666667;ABP=4.45795;AC=1;...;TYPE=snp
I 140305 . A <*> 0 . DP=6;END=140310;MIN_DP=6
I 140311 . G C 68.5498 . AB=0.666667;ABP=4.45795;AC=1;...;TYPE=snp
In this case, all positions between 140305 and 140310 are non-variant and represented by a single record. The END
field indicates the last position of the block, while the REF
field contains the first base of the block. The ALT
field is set to <*>
to indicate that the block is non-variant. The MIN_DP
field indicates the minimum depth of coverage in the block. For example, in the region between 140176 and 140303, the MIN_DP
is 0, meaning there is at least one position with no aligned reads. This should be considered when merging datasets, as the reference allele called in this block may not be supported by alignments across the entire region.
However, this is not the only way to represent a gVCF file. You can also use fixed-length blocks, which can help isolate regions with no alignments. In this approach, a single point with 0 coverage will correspond to a shorter segment of the genome. Consider the same region as before, but using gVCF blocks with a maximum length of 20bp:
I 140176 . G <*> 0 . DP=5;END=140196;MIN_DP=0
I 140197 . G <*> 0 . DP=2;END=140217;MIN_DP=0
I 140218 . C <*> 0 . DP=1;END=140238;MIN_DP=1
I 140239 . A <*> 0 . DP=1;END=140259;MIN_DP=1
I 140260 . A <*> 0 . DP=1;END=140280;MIN_DP=1
I 140281 . T <*> 0 . DP=2;END=140301;MIN_DP=1
I 140302 . A <*> 0 . DP=6;END=140303;MIN_DP=6
I 140304 . A G 68.5498 . AB=0.666667;ABP=4.45795;AC=1;...;TYPE=snp
I 140305 . A <*> 0 . DP=6;END=140310;MIN_DP=6
I 140311 . G C 68.5498 . AB=0.666667;ABP=4.45795;AC=1;...;TYPE=snp
In this case, the region between 140176 and 140303 is represented by 7 records, each with a maximum length of 20bp. We can see that in the first block region, the MIN_DP
is 0, while approaching the 140304 position there's an increase in the minimum depth. This can be useful since we can be more confident to call a reference allele in blocks with higher coverage.
Eventually, you can shorten the block length to 1bp, which will result in a gVCF file with a record for each position in the genome. This approach is more verbose and will result in a larger file size, but it can be useful when you need to analyze specific regions of the genome in detail. You should consider the trade-offs between file size and the level of detail required for your analysis when choosing the block length.
This pipeline supports the generation of gVCF files when calling freebayes through the --gvcf
option. This parameter should be specified when running the pipeline via CLI or when creating a JSON parameters file. Additionally, there are two parameters to customize the gVCF output: --gvcf_chunk
, which specifies the maximum length of the gVCF blocks, and --gvcf_dont_use_chunk
, which results in a gVCF file with a record for each position in the genome. You cannot specify both parameters at the same time, so you should consider the best trade-off between gVCF file size and reference block sizes.
You should also consider skipping the normalization step performed by this pipeline. While normalization can decompose a complex variant into a simpler one, it cannot handle reference gVCF blocks. In this case, it is better to skip the normalization step and use the gVCF output directly, or collect the results from the freebayes step. After merging, you can normalize the final VCF files using the normalization workflow provided by this pipeline.
Here's a basic example of how to call freebayes with gVCF output using the JSON parameters file:
{
"gvcf": true,
"gvcf_chunk": 20,
"skip_normalization": true,
"save_freebayes": true
}
You can merge vcf files using bcftools
merge, for example like this:
bcftools merge --threads <ncpus> --gvcf <reference genome.fasta> \
--output <merged.gvcf.gz> --output-type z --write-index=tbi \
<1st_batch.gvcf.gz> <2nd_batch.gvcf.gz> ...
where:
-
--threads
: specifies the number of threads to use for merging. -
--gvcf
: let to specify the reference sequence to use when merging gVCF files. This implies the followingINFO
rules:-i QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max
and the following missing rules:-M PL:max,AD:0
, which respectively mean:-
QS
is the sum of the quality scores of the reads supporting the variant -
MinDP
is the minimum depth of coverage in the block -
I16
(a tag used in calling SNPs withsamtools
/bcftools
) is the sum of theI16
field in theFORMAT
column -
IDV
(Individual Depth of Variant alleles) is the maximum depth of coverage in the block -
IMF
(Individual Minor allele Frequency) is the maximum value of theIMF
field in theFORMAT
column -
PL:max
: When input files have different alternative alleles, vector fields for unobserved alleles are set to missing (.) by default. This rule specifies that for the PL (Phred-scaled genotype likelihoods) tag, the maximum value observed for other alleles in the sample should be used. In practice, if an allele is present in one file but not in another during merging, the PL value for that allele in the resulting file will be set to the maximum value observed in the other alleles. -
AD:0
: Similarly, this rule applies to the AD (Allelic Depth) tag, which represents the read depth for each allele. When an allele is missing in one of the merged files, the AD value for that allele will be set to 0. This indicates that there are no reads supporting that allele in that particular sample.
All of thus values can be overridden by the user by setting an appropriate value
-
-
--output
: specifies the output file name. -
--output-type
: specifies the output file type. In this case, it is a compressed VCF file (z
). -
--write-index
: specifies to write an index file for the output file (tbi
format). -
<1st_batch.gvcf.gz> <2nd_batch.gvcf.gz> ...
: specifies the input gVCF files to merge.
We can transform the merged gVCF file into a VCF file with bcftools
convert
piped into bcftools
view:
bcftools convert --gvcf2vcf --fasta-ref <reference genome.fasta> \
merged.gvcf.gz | bcftools view -AA --trim-alt-alleles \
--types=snps,indels,mnps,other --output merged.vcf.gz \
--output-type z --write-index=tbi
where this option are specific to the convert
command:
-
--gvcf2vcf
: specifies to convert a gVCF file to a VCF file. -
--fasta-ref
: specifies the reference genome to use when converting the gVCF file.
while this options are specific to the view
command:
-
--trim-alt-alleles
: specifies to trim the alternate alleles in the output VCF file. -
-AA
: remove the unseen allele<*>
at all sites. -
--types
: specifies the types of variants to include in the output VCF file. In this case, it includes SNPs, indels, MNPs, and other types of variants. -
--output
: specifies the output file name. -
--output-type
: specifies the output file type. In this case, it is a compressed VCF file (z
). -
--write-index
: specifies to write an index file for the output file (tbi
format).
After merging the gVCF files, you can normalize the final VCF file using the normalization workflow provided by this pipeline.