Genomic Profile Risk Score - gc5k/GEAR GitHub Wiki
Plink risk score provides a simple method for generating weighted allelic scores, for which the weight is often calculated in the generalized linear regression. In practice, it has a couple of inconvenience, which have been solved by GEAR.
- GEAR will flip the alleles to match them with the named predictor alleles. For example, when the allele coding are flipped, say A/G in the discovery panel, but coded as T/C in the validation panel, plink will leave those SNPs out. However, this option can be turn off by specifying "--auto-flip-off".
- Also, plink does not take the potential risk of A/T or G/C loci, which because of their ambiguous nature, may bring in noisy in prediction. GEAR has an option --keep-atgc to use all of them or remove them.
- Often, the score of each SNP is provided in odds ratio format, in this circumstance, GEAR provides --logit option to transform the odds ratio to effects.
- GEAR supports dosage data in MaCH format.
- If the allele are coded in small letters and capital letters in the score file and the genotypes, respectively, GEAR will automatically match a reference allele in small letter to its capital form.
In this procedure the above issues will be solved and consequently makes prediction easier.
It should be noted that GEAR will leave out monomorphic loci if there are any in the validation set.
The format of the score file
SNP | RefAllele | value |
---|---|---|
SNPA | A | 1.95 |
SNPB | C | 2.04 |
SNPC | C | -0.98 |
SNPD | C | -0.24 |
By default, gear assumes that the score file contains a header line. If your score file doesn't contains the header line, you should switch on the --no-score-header option.
In addition, if the score can be loaded in gzip format, then --score-gz should be used instead.
Binary genotype gear profile --s scorefile.txt --bfile test --out test
MaCH dosage gear profile --s scorefile.txt --mach-dosage test.mldose.gz --mach-info test.mlinfo --out test Or run multiple dosage distributed in multiple files gear profile --s scorefile.txt --mach-dosage-batch dose.txt --mach-info-batch info.txt --out test
dose.txt reads like below mach_stage2_chr1.mldose.gz mach_stage2_chr2.mldose.gz mach_stage2_chr3.mldose.gz
info.txt reads like below mach_stage2_chr1.mlinfo mach_stage2_chr2.mlinfo mach_stage2_chr3.mlinfo
**Add-on options The default model is allelic model, but three other genetic models, additive, dominant, and recessive are supported. Let use T denote the number of the reference allele(s), and M the converted code under each model. The four models code the genotypes as tabulated below
Allelic | Additive | Dominant | Recessive |
---|---|---|---|
M=T/2 | M=T | M=1, if T>0, M=0 otherwise | M=1, if T>1, M=0 otherwise |
The options for additive model is --add, dominant --dom, and recessive --rec.
--extract-score
Only snps included in both --extract-score and --score/--score-gz will be used for generating profile scores.
--remove-score
SNPs included in --removed-score will be used for generating profile scores.
--logit
If in the score file the score is in odds ratio format, it will take logarithm with nature as the base
--keep-atgc
It will keep AT/GC loci in the risk profile. However, the user should confirm whether the genotypes in both discovery and the validation panels are coded on the same reference allele for each locus.
--no-weight
When this option is on, the accumulated score will not be divided by 2*M, in which M is the number of the matched SNPs.
--no-score-header
When there is no title line for the score file, this option should be used.
--auto-flip-off
When this option is on, a locus has flipped alleles in the testing set will not be matched. As genotypes may be called on the complementary strands across genotyping platforms, gear will match them by flipping SNPs automatically. For example, the named SNP is "A" in the score file, but due to flipping the reported SNPs are "T/C" in the validation set. Under --auto-flip-off option is switched off, gear will flip "T/C" back to "A/G", and consequently match the score to the validation set. Of course, gear presumes the polymorphism is same across the discovery and the validation sets.
There are four possible schemes for matching a SNP between the discovery and the validation sets
Scheme |
---|
The named score SNP matches the reference allele in the validation set |
The named score SNP matches the alternative allele in the validation set |
The named score SNP matches the flipped reference allele in the validation set |
The named score SNP matches the flipped alternative allele in the validation set |
Matches neither, then this locus will be discarded |
--qscore snpval.dat
--qrange q.ranges To calculate multiple scores from subsets of SNPs in a single --score file, it is possible to use the two commands.
snpval.dat reads like below rs00001 0.234 rs00002 0.046 rs00003 0.887
q.ranges reads like below S1 0.00 0.01 S2 0.00 0.20 S3 0.10 0.50
Notes
AT/GC loci will be left out if --keep-atgc is not on. Probably --keep-atgc should not be turned on otherwise the SNP coding on the same strand for each locus in both the discovery and the validation panels.
Missing loci will be automated imputated with allele frequency that estimated from the available alleles.
When --score option is not used, the sum of the dosage score for the reference allele will be calculated. It is equivalent to set a score file with score of 1 for each SNP. In addition, if wants to count the sum of the specified alleles, the user can make a score file, in which the column for the reference alleles specify the reference allele and the scores are all 1. (Deprecated)
Example
gear profile --s scorefile.txt --bfile test --out test
gear profile --s scorefile.txt --bfile test --extract-score scExtract.txt --out test
gear profile --s scorefile.txt --bfile test --qscore snpval.dat -qrange q.range --keep-atgc --out test
gear profile --s scorefile.txt --mach-dosage test.mldose.gz --mach-info test.mlinfo --out test
gear profile --s scorefile.txt --mach-dosage-batch dose.txt --mach-info-batch info.txt --out test
gear profile --bfile test --out test (note: it counts the sum of the reference allele specified in mlinfor in MaCH output). *(Deprecated)*
gear --bfile test --auto-flip-off --out test *(Deprecated)*
If loading score file in gzip format,
gear --bfile test --score-gz scorefile.gz --out test
Notes There three model options can not be applied to dosage files.