Genotype Harmonizer - molgenis/systemsgenetics GitHub Wiki
The Genotype Harmonizer is an easy to use command-line tool that allows harmonization of genotype data stored using different file formats with different and potentially unknown strands.
Linkage disequilibrium (LD) patterns are used to determine the correct strand GC and AT SNPs and by using the Genotype IO package we can import and export different file formats.
📌 please cite our work if you used this tool, http://www.biomedcentral.com/1756-0500/7/901/abstract
Note: this manual is for version 1.4.20 (and higher) of the Genotype Harmonizer.
Getting started
Typical usage scenarios
Arguments overview
Bug reports / feature requests
The latest version from the Genotype Harmonizer can be downloaded here
type GenotypeHarmonizer.sh
, GenotypeHarmonizer.bat
or java -jar GenotypeHarmonizer.jar
to run. You will now get an overview of the different command-line options.
In the case of an heapspace or out of memory error you need allocate more memory to run the Genotype Harmonizer. If this should happen use this command to run: Java -jar -Xms##g -Xmx##g -jar GenotypeHarmonizer.jar
. Replace ## with the number of gigabytes of memory you want to allocate.
In the most basic usage scenario you need to define:
- A dataset that you want to align and the type of this dataset
- A dataset that you want to use as reference and the type of this dataset
- The output path and type where you want to write the aligned data to
Your command will look like this:
GenotypeHarmonizer.sh \
--input path_to_study_data \
--output path_of_output \
--ref path_to_reference
Note: this is a single command line command. The \
is only for readability.
You can find more examples as a script in the root of the distribution.
Note: it is important to make sure that both study and reference are using the same genome build
Before VCF files can be used they need to be compressed using bgzip and indexed with a tabix. This prevents having to read all data into memory yet still allows quick access.
wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.6.tar.bz2
tar -jxvf tabix-0.2.6.tar.bz2
cd tabix-0.2.6/
make
bgzip -c example.vcf > example.vcf.gz
tabix -p vcf example.vcf.gz
The SHAPEIT2 output and the Oxford GEN format store either a custom SNP ID on the first column or the chromosome. When the fist column is not used for the chromosome the --forceChr
option can be used to force the input data to a specified chromosome. Note this is only valid if all variants are indeed on this chromosome. This feature is currently only implemented for the input data and not for the reference data, feel free to raise a new issue on our GitHub project to request this.
The haplotype file format data as used by IMPUTE2 for the imputation reference haplotypes can not directly be used by genotype harmonizer. For many reference datasets VCF files are also available. If this is not the case bcftools convert can convert the haps & legend files to VCF files using the hapsample2vcf
option.
When imputing genotype data the strand of both the study data to impute and the reference data used for imputation need to be identical. Some imputation tools can swap the strand of non-ambiguous SNPs but this is not possible for AT and GC SNPs. AT and GC can be swapped using minor allele frequency but this is not reliable, especially for variants with a high minor allele frequency. The Genotype Harmonizer solves these problems by using LD structure of nearby variants.
In combination with the --update-id
option the Genotype Harmonizer is a convenient tool for preparation of genotype data before imputation.
Aligning pre-phased data, generated using tools like SHAPEIT2, is particularly useful. A dataset only needs to be pre-phased once and can then be aligned and imputed using different reference set or different versions of reference sets.
When combining different genotype datasets, either samples ran on multiple genotyping chips or different batches of samples, it is important to have identical strands. Merge data in Plink will give a warning when it detects strand issues in non AT or non GC SNPs but can not automatically correct this. AT and GC SNP swaps are not automatically detected.
The --keep
option is particularly useful here to keep the SNPs not shared by both datasets. The --update-id
will also make merging using Plink (http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#merge) or other tools more easy.
Although there are many tools that can convert from one file type to an other it can also easily be done using our Genotype Harmonizer by simply not specifying a reference. In such a case the input data will simply be converted to the specified output type.
Short | Long | Description |
---|---|---|
-i | --input | * The base path of the data to align. The extensions are determined based on the input data type. |
-I | --inputType | The input data type. If not defined will attempt to automatically select the first matching dataset on the specified path (see inputType options) |
-r | --ref | The base path of the reference data used for alignment. The extensions are determined based on the input data type. If not specified the input data is simply converted to the specified output type. |
-R | --refType | The input data type. If not defined will attempt to automatically select the first matching dataset on the specified path (see refType options) |
-o | --output | * The base path of the output data. |
-O | --outputType | The output data type. Defaults to --inputType or to PLINK_BED if there is no writer for the impute type. (--outputType options) |
Short | Long | Description |
---|---|---|
-ip | --inputProb | The minimum posterior probability to call genotypes in the input data. Defaults to 0.4 |
-f | --forceChr | SHAPEIT2 does not output the sequence name in the first column of the haplotype file and for GEN files this can also be the case. Use this option to force the chromosome for all variants. This option is only valid in combination with --inputType SHAPEIT2 and --inputType GEN
|
-gf | --genotypeField | VCF files often have different Genotype format fields. Which one is chosen depends on the requested data (either hard-called genotypes, probabilities, phased probabilities, etc.). This option allows the user to define which genotype format field to use. This option is only valid in combination with --inputType VCF and --inputType VCF_FOLDER
|
-cf | --callRateFilter | The minimum call rate to include variant from input data |
-ch | --chrFilter | Filter input data on chromosome |
-hf | --hweFilter | The minimum hardy weinberg equilibrium p-value to include variant from input data |
-mf | --mafFilter | The minimum minor allele frequency to include variant from input data |
-sf | --sampleFilterList | Path to file with samples IDs to include from input data. For plink data and oxford sample files only the sample id (column 2) is used |
-vf | --variantFilterList | Path to file with variant IDs to include from input data. |
-mrf | --machR2Filter | The minimum MACH R2 measure to include SNPs. |
-pf | --variantPosFilterList | Path to file with variant CHR\tPOS or CHR:POS to include from input data. |
-asf | --ambiguousSnpFilter | Filter out ambiguous SNPs (A/T, C/G) SNPs. |
Short | Long | Description |
---|---|---|
-id | --update-id | Update the variant identifiers using the reference data. The identifiers of the output data will be the same as the reference data |
-l | --min-ld | The minimum LD (r2) between the variant to align and potential supporting variants. Defaults to 0.3 |
-m | --min-variants | The minimum number of supporting variant before before we can do an alignment. Defaults to 3 |
-v | --variants | Number of flanking variants to consider. Defaults to 100 |
-c | --check-ld | Also check the LD structure of non AT and non GC variants. Variants that do not pass the check are excluded. |
-d | --debug | Activate debug mode. This will result in a more verbose log file |
-ma | --mafAlign | If there are not enough variants in LD and the minor allele frequency (MAF) of a variant <= the specified value in both study as in reference then the minor allele can be used as a backup for alignment. Defaults to 0 |
-ura | --update-reference-allele | Make sure the output data uses the same reference allele as the reference data set. |
Short | Long | Description |
---|---|---|
-bts | --probabilityPrecision | The precision of probabilities, in number of bits, to use when writing BGEN files. Defaults to 16. Only values 1 to 32 inclusive are valid. This option is only valid in clombination with --outputType BGEN
|
Short | Long | Description |
---|---|---|
-d | --debug | Activate debug mode. This will result in a more verbose log file |
* = required
Base path refers to either --input or --ref
- PED_MAP
- Expects Plink PED file at:
${base path}.ped
- Expects Plink MAP file at:
${base path}.map
- Note: it is strongly recommend to use PLINK_BED due to the large memory usage of the current implementation. When dealing with large datasets it is recommend to first use plink to convert the data binary plink using this command
plink --file pedmapData --make-bed --out binaryData
- Expects Plink PED file at:
- VCF
- Expects VCF file at:
${base path}.vcf.gz
- Must be compressed using bgzip. (see chapter: Preparing a VCF file)
- Expects tabix file at:
${base path}.vcf.gz.tbi
(see chapter: Preparing a VCF file)
- Expects VCF file at:
- PLINK_BED
- Expects Plink BED file at:
${base path}.bed
- Expects Plink BIM file at:
${base path}.bim
- Expects Plink FAM file at:
${base path}.fam
- Must be in SNP major mode. This is the default of Plink 1.07.
- Expects Plink BED file at:
- VCFFOLDER
- Matches all vcf.gz files in the folder specified with the bash path.
- SHAPEIT2
- Expects haps file at:
${base path}.haps
- Expects sample file at:
${base path}.sample
- See also chapter on 'Using SHAPEIT2 Output and Oxford Gen format'
- Expects haps file at:
- GEN
- Expects gen file at:
${base path}.gen
or without the extentions${base path}
- Expects sample file at:
${base path}.sample
- See also chapter on 'Using SHAPEIT2 Output and Oxford Gen format'
- Expects gen file at:
- BGEN
- Expects bgen file at:
${base path}.bgen
- A sample file (
${base path}.sample
) is optional, unless there are no sample identifiers in the BGEN file itself, than a sample file is required. - A bgenix index file at:
${base path}.bgen.bgi
is read. If it does not yet exist a new one is created at that location.
- Expects bgen file at:
- TRITYPER
- Expects folder with TriTyper data. Can handle dosage matrix
Base path refers to --output
Regardless of the output type a log file will always be created at: ${base path}.log
- PED_MAP
- Writes Plink PED file to:
${base path}.ped
- Writes Plink MAP file to:
${base path}.map
- Note: it is strongly recommend to use PLINK_BED since the writing is much faster
- Writes Plink PED file to:
- PLINK_BED
- Writes Plink BED file to:
${base path}.bed
- Writes Plink BIM file to:
${base path}.bim
- Writes Plink FAM file to:
${base path}.fam
- Data is written in SNP major mode
- Writes Plink BED file to:
- SHAPEIT2
- Writers haps file at:
${base path}.haps
- Writers sample file at:
${base path}.sample
- Writers haps file at:
- GEN
- Writers gen file at:
${base path}.gen
- Writers sample file at:
${base path}.sample
- Writers gen file at:
- BGEN
- Writes .bgen file at
${base path}.bgen
- Writes .sample file at
${base path}.sample
- Writes .bgen.bgi file at
${base path}.bgen.bgi
- Writes .bgen file at
- TRITYPER
- Create folder with TriTyper data
- Also created dosage matrix
-
NOTE: Older software that uses TriTyper data cannot handle missing dosages. When converting GEN propabilities to TriTyper make sure to also supply this argument
--inputProb 0
It can be worthwhile to tweak the alignment algorithm using the advanced options (--min-ld
, --min-variants
, --variants
& --mafAlign
) to improve the reliability of the alignment and to reduce the number of excluded variant that could not be aligned. The default values are quite conservative and sooner exclude variant than falsely swap the strand.
We have found that for datasets with a small number of samples it is wiser to have a high value for --min-ld
(as is the default). This resulted in a reliable aliment of the small example datasets. However for larger datasets a lower value can be used to retain more variants. Increasing --min-variants
allows lower values of --min-ld
this can improve the number of aligned variants.
When imputing the HLA region in densely genotyped dataset containing a large number of rare variants it helped to increase the --variants
although this did effect the speed performance it did allow us to align more variants. This resulted in better imputation when using snp2hla from the Broad Institute.
Finally, it is also possible to align using the minor allele when the LD alignment fails using the --mafAlign
option. By default this option is off (the maximum maf is set to 0) so it is never done. However in a lot of cases it is save to use the minor when there are not enough in LD with a variant. A save value would be 0.1, in that case only variant with a maf <= to 0.1 will be aligned by matching the minor allele.
Please use the GitHub issue tracker to post feature request or to report bugs. https://github.com/molgenis/systemsgenetics/issues/new.
This chapter is not relevant for the usage of the program but allows reproducibility of the test data.
The genotype harmonizer contains test data. For the genotype data to align we use HapMap3 data and as a reference we use 1000G data.
This dataset is always tested when building the project and by our Jenkins server (http://www.molgenis.org/jenkins/job/systemsgenetics/lastBuild/nl.systemsgenetics$Genotype-Harmonizer/). It is also supplied in the Genotype Harmonizer package to get you started.
The following tools are needed for this script:
- Plink
- UCSC liftover + chain hg18ToHg19
- SHAPEIT2
- Genetic Map (b37)
wget ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/latest_phaseIII_ncbi_b36/plink_format/hapmap3_r2_b36_fwd.consensus.qc.poly.map.bz2
wget ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/latest_phaseIII_ncbi_b36/plink_format/hapmap3_r2_b36_fwd.consensus.qc.poly.ped.bz2
wget ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/latest_phaseIII_ncbi_b36/plink_format/relationships_w_pops_121708.txt
tar -zxvf hapmap3_r2_b36_fwd.consensus.qc.poly.map.bz2
tar -zxvf hapmap3_r2_b36_fwd.consensus.qc.poly.ped.bz2
#Create list of CEU sampels to extract
awk '$7 == "CEU" {print $1,$2}' relationships_w_pops_041510.txt > ceuSamples.txt
#Extract first 6Mb of chr20 for CEU samples
plink --noweb --chr 20 --file ../hapmap3_r3_b36_fwd.consensus.qc.poly --out hapmap3CeuChr20B36Mb6 --from-mb 0 --to-mb 6 --recode --keep ceuSamples.txt
# - liftover to b37 -
#Create bed
awk '{$5=$2;$2=$4;$3=$4+1;$1="chr"$1;print $1,$2,$3,$5}' OFS="\t" hapmap3CeuChr20B36Mb6.map > hapmap3CeuChr20B36Mb6b36.bed
#Update mapping
liftOver -bedPlus=4 hapmap3CeuChr20B36Mb6b36.bed hg18ToHg19.over.chain hapmap3CeuChr20B36Mb6b37.bed hapmap3CeuChr20B36Mb6unmapped.txt
#All SNPs are mapped. Normally we would have to account for this
#Create mapping update list used by Plink
awk '{print $4, $2}' OFS="\t" hapmap3CeuChr20B36Mb6b37.bed > hapmap3CeuChr20B36Mb6b37.txt
#Update plink mappings
plink --noweb --file hapmap3CeuChr20B36Mb6 --recode --out hapmap3CeuChr20B37Mb6 --update-map hapmap3CeuChr20B36Mb6b37.txt
#No we have to again create a plink file to make sure the implied order is correct after liftover.
plink --noweb --file hapmap3CeuChr20B37Mb6 --out hapmap3CeuChr20B37Mb6 --make-bed
We have now created a subset of the hapmap3 which is all in forward strand. We are now going to swap a large number variants. Genotype Harmonizer can identify these swapped variants and flip them back to forward strand using the 1000G data.
#Create swap list 50% of SNPs
awk '
BEGIN { srand(1)}
{ if (rand() <= .5) print $2}
' < hapmap3CeuChr20B37Mb6.bim > flipList.txt
plink --noweb --bfile hapmap3CeuChr20B37Mb6 --make-bed --flip flipList.txt --out hapmap3CeuChr20B37Mb6RandomStrand
We also want this data phased using SHAPEIT2, for additional testing.
shapeit.v2.r644.linux.x86_64 --input-bed ./hapmap3CeuChr20B37Mb6RandomStrand -M genetic_map_chr20_combined_b37.txt --output-max ./hapmap3CeuChr20B37Mb6RandomStrand --noped --thread 4
The following tools are needed for this script:
- vcftools
- tabix
wget ftp://share.sph.umich.edu/1000genomes/fullProject/2012.03.14/phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz.tgz
tar xvzf phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz.tgz
#Create subset of data
vcftools --gzvcf chr20.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz --out 1000gCeuChr20Mb6 --chr 20 --from-bp 0 --to-bp 6000000 --recode --remove-indels --remove-filtered-all
#Comprese subset using bgzip (part of tabix package)
bgzip -c 1000gCeuChr20Mb6.recode.vcf > 1000gCeuChr20Mb6.vcf.gz
#Create index using tabix
tabix -p vcf 1000gCeuChr20Mb6.vcf.gz
VCF files allow for multiple genotype format fields. The following fields are supported in GenotypeHarmonizer.
ID | T.B.A | Description |
---|---|---|
GT | ... | T.B.A. |
GP | ... | T.B.A. |
HP | ... | T.B.A. |
DS | ... | T.B.A. |
ADS | ... | T.B.A. |
A preferred genotype format field can be set with -gf/--genotypeField <Field> [<FieldSynonym> ['supress']]
.
Herein, Field
is one of the supported fields from above.
If the preferred genotype field defined within the input VCF is not within this list,
but is synonymous to one of them,
the optional second argument FieldSynonym
should match the genotype field defined in the VCF.
The third variable can be used to suppress an exception whenever the requested
genotype field is unavailable (for a given variant): suppress
Which field is used is further described in the flowchart below:
graph TB;
A[Start] --> B;
B{Preferred genotype<br/>format field?} --> |Yes| C;
B --> |No| G;
C{Preferred format<br/>field present?} -->|No| D;
C --> |Yes| F[Use preferred format field];
D{'suppress' given?};
D --> |No| E[Raise exception];
D --> |Yes| G;
G[Choose optimal field based on output];
G --> GP1;
G --> HP1;
G --> DS1;
G --> GC1;
subgraph GP[Genotype Probabilities];
GP1[GP]-->GP2[GT]-->GP3[DS];
end;
subgraph HP[Haplotype Probabilities];
HP1[HP]-->HP2[ADS];
end;
subgraph DS[Dosages];
DS1[DS]-->DS2[GP]-->DS3[GT];
end;
subgraph HC[Hard calls];
GC1[GT]-->GC2[GP]-->GC3[DS];
end;
When writing to the BGEN format, genotype harmonizer can choose to either write phased data, or uphased genotypes. When VCF files are used in this conversion, the input file can hold both phased and unphased data. Which one is written depends on a number of factors. The diagrom below illustrates the behaviour of genotype harmonizer.
flowchart TB
subgraph isPhased["Is phased data available?"]
PreferredSet{Preferred\ngenotype\nfield set?} --> |Yes| IsFieldPresent{Is field\npresent?}
PreferredSet --> |No| Defaults
IsFieldPresent --> |No| Force{'suppress' given}
Force --> |No| Exception
IsFieldPresent --> |Yes| CanGetPhasingFromField{Does this\nfield have\nphased data?}
Force --> |Yes| Defaults{Can read\nphased data\nfrom available\nfields?}
end
Defaults --> |Yes| WritePhased[Write phased data]
Defaults --> |No| WriteUnphased[Write genotype data]
CanGetPhasingFromField --> |Yes| WritePhased
CanGetPhasingFromField --> |No| WriteUnphased