Covariate data - genetics-of-dna-methylation-consortium/godmc_phase2 GitHub Wiki
Covariates
Your covariates file should be a text file (white space separated) with the sample ID labeled as IID
in the first column followed by headed columns for each subsequent covariate. In the column name you need to specify whether the covariate is a categorical (factor
) or continuous (numeric
) variable. For example: {Name}_factor
or {Name}_numeric
.
An example of what your file should look like is below:
IID Sex_factor Age_numeric Slide_factor
id1 F 30 12345678
id2 M 42 12345678
id3 M 76 87654321
Required
The following covariates are required - these covariates are the minimum criteria for contributing to GoDMC:
Sex_factor
: A column ofM
's for males andF
's for femalesAge_numeric
: In years (can be integer or with decimals)
Please ensure there are no missing values in the Sex
and Age
columns, even if your data is all the same age or all the same sex. Please also ensure that the capitalisation of column headers in the example above is followed.
Optional
The following covariates are strongly recommended (but not essential):
Slide_factor
: The slide/chip/sentrix ID from the DNA methylation array- Any other important batch covariates
Not needed
The pipeline will calculate genetic and DNA methylation principal components, so you don’t need to add these to the covariates file. Please note that you shouldn't include any other surrogate variables (e.g. from combat) as these will capture genetic and/or phenotypic variation that we are interested in.
Cell counts
We use cell counts as covariates in many of the analyses, interaction terms in the cell type interacting meQTL analysis and as phenotypes in the GWA analysis. If you have directly measured cell counts, then we will use these measured cell counts to compare against predicted cell counts.
Directly measured cell counts
If you are providing your directly measured cell count estimates, please make sure you use the same format as the covariate files and save it in a separate file from the covariates file in the input data folder. You should indicate the filename and path in the config
file. Your header should be the same as in the example below.
IID Bcells Tcells Eos Mono Neu Baso
id1 0.01737535 0.0000000 -3.255819e-19 0.06323692 0.8449611 0.09284286
id2 0.07131901 0.0532302 0.000000e+00 0.05687753 0.7669148 0.07237679
id3 0.16438292 0.1637806 -1.042249e-19 0.11164848 0.3932823 0.25849168
Cell counts derived from DNA methylation data
The pipeline will generate cell counts for 12 blood cell types from the normalized betas using EpiDISH and the Salas et al. 2022 reference panel, so these don't need to be provided.
Phenotype data for Polygenic Risk Scores (PRS)
Please note that phenotype information is not required to run any of the PRS EWAS analyses, although if available it can help to validate the PRS generated in your sample.
If you have phenotype information for any of the traits for which PRS will be generated and you didn't contribute the data to the original GWAS, please create a file per trait coding the phenotype variables so that higher values of the variables correspond to higher values (or higher risk in the case of a disease/disorder) of the trait studied.
For each of these files please follow the same formatting guidelines as for the covariate data: white space separated, with IID
in the first column followed by headed columns for the phenotypes specifying whether your variable is categorical (factor
) or continuous (numeric
): {Name}_factor
or {Name}_numeric
. In case of a disease/disorder please specify cases as 1 and controls as 0.