Q and A - bvilhjal/ldpred GitHub Wiki
We encourage new questions.
What LD reference panel should I use?
A reference panel that reflects the LD in the summary statistics that you're using for training. E.g., if you're using summary stats based on GWAS in individuals of European ancestry, you should use a LD reference panel with individuals of European ancestry, regardless of the ancestry in your validation/target sample. Also, ideally the LD reference panel should contain unrelated individuals, and enough number of sample to estimate LD accurately.
In summary:
- LD panel ancestry should match sum stats ancestry, irrespective of target/validation data ancestry.
- LD panel should contain (relatively) unrelated individuals.
- LD panel should ideally have at least 2000 individuals (will not work for less than 300 individual).
- You may run into memory issues if you have more than 5000 individuals, in which case, I suggest you downsample.
How many SNPs should I use for LDpred?
I currently recommend anything between 100K-2M SNPs. In my experience, using more than 2M SNPs typically reduces the accuracy of the prediction. And if you're below 100K, you might as well use other simpler methods, such as P+T (pruning + thresholding). I generally suggest users to use the 1.2M HapMap3 SNPs, same as LDscore regression uses (see flag --only-hm3
in coord step).
How many individuals should I use for LDpred?
As many as you want for validation. However, for the LD reference panel, we suggest that you use between 2000-5000 individuals.
It's using too much memory, what to do?
It's your LD reference panel, it's likely too big. Try restricting analysis to HapMap 3 SNPs (see flag --only-hm3
in coord step), and subsample individuals down to 5000. Also set the LD radius accordingly (not too large). Alternatively, you could try installing hickle
(see here) and set the --hickle-ld
flag when running LDpred.py gibbs
or LDpred.py inf
.
What format should my summary stats be in?
You can use CUSTOM
to define your own format. The following pre-defined formats are supported:
LDPRED
CHR POS SNP_ID REF ALT REF_FRQ PVAL BETA SE N
chr1 1020428 rs6687776 C T 0.85083 0.0587 -0.0100048507289348 0.0100 8000
STANDARD
chr pos ref alt reffrq info rs pval effalt
chr1 1020428 C T 0.85083 0.98732 rs6687776 0.0587 -0.0100048507289348
BASIC
hg19chrc snpid a1 a2 bp or p
chr1 rs4951859 C G 729679 0.97853 0.2083
PGC
CHR SNP BP A1 A2 FRQ_A_30232 FRQ_U_40578 INFO OR SE P ngt Direction HetISqt HetChiSq HetDf HetPVa
...
GIANT
MarkerName Allele1 Allele2 Freq.Allele1.HapMapCEU p N
rs10 a c 0.0333 0.8826 78380
Can I get Nagelkerke R2 using LDpred?
No, currently LDpred does not report Nagelkerke R2.
What LDpred causal variant fraction should I choose for my final polygenic score?
I suggest you try all of the polygenic scores on your validation data, and report the accuracy for all of them. Alternatively, you can put them all in a regression and fit in cross-validation (e.g. in R).
Can I include related individuals in the validation/target sample?
Relatedness in the validation/target sample is not a concern, however it is a concern for the LD reference panel.
I'm having trouble with covariates, can you help me?
Currently the handling of covariates is not great in LDpred. I suggest that you rather load the resulting scores and covariates in R, and get final variance explained using glm or lm in R.
Where can I find the Hapmap3 data?
It’s described here: https://www.broadinstitute.org/medical-and-population-genetics/hapmap-3
More specifically, LDpred uses the same set of SNPs as LDscore (Bulik-Sullivan et al., Nat Genet 2015) uses, i.e. https://data.broadinstitute.org/alkesgroup/LDSCORE/hapmap3_snps.tgz
Can LDpred be applied to binary/case-control traits?
The short answer is yes. LDpred assumes converts p-values to z-scores internally, which in case of a binary trait can be thought of some sort of liablity scale. However, it only reports R2 in the score step.
What do the --f/--p/--r2 flags do in LDpred score?
They are used to find the files made in the previous step. They are only necessary if you used non-default values in preceding step.
If OR are used in sum stats, how does the P+T implementation use them?
The P+T implementation then uses log OR as effect sizes.
I emailed you, when will you answer?
I typically answer LDpred emails in batches, hence a typical waiting time is a week or two or five. If you're desperate, then email again without LDpred in the title...