Step 2. Methylome data quality control - ariadnacilleros/Cis-mQTL-mapping-protocol-for-methylome GitHub Wiki

Step 2.1. Beta values quality control

The first step to be done with the methylation data is the Quality Control and preprocess, to do so, we will use the PACEanalysis R package by A.Binder. Before anything else, we will execute the loadingSamples() function to load the IDAT files into an RGset with raw microarray intensity values. Next, with the function ExploratoryDataAnalysis(), we will identify possible samples swaps and technical outliers. For this step, we will set:

DetectionPvalMethod = 'SeSAMe' : SeSAMe R package to calculate detection p-values.
DetectionPvalCutoff = 0.05 : Detection p-value threshold of 0.05.
minNbeads = 3 : Mask methylation values based on less than 3 beads.
FilterZeroIntensities = TRUE : Mask intensity values with less than 1.

Next, with the preprocessingofData() function and the outputs returned by ExploratoryDataAnalysis(), we will discard the suggested probes and samples to be removed. On the one hand, the samples will be discarded based on evidence of sex mix-ups, a high proportion of failed probes for that sample, an indication of sample contamination (Meanlog2oddsContamination > -1), low global methylated or unmethylated intensities, and unintentional replicates. To detect possible replicates, you will have to check the dendrogram and the heatmap of the SNPs on the array. On the other hand, the probes to be filtered out are based on the percent of samples that failed for that probe. Furthermore, with preprocessingofData() function, we will also normalize and correct beta values, apply ComBat to control the batch effect, and calculate the cellular composition of the samples using the placenta reference panel implemented in the planet R package (compositeCellType = 'Placenta').

Finally, the outlierprocess() function will be ejecuted to reduce the influence of the outliers with winzorize values based on the 0.005 percentile (pct = 0.005) considering the percentiles estimated on the empirical beta-distribution (quantilemethod = "EmpiricalBeta").

Be careful! You have to make sure that these functions are been executed with the arguments described above. For more information have a look at the function details and the PACEanalysis.R script.

Step 2.2. Prepare BED file for TensorQTL mapping

Once the betas have been pre-processed, you will obtain a data.frame object which comes from the PACEanalysis outlierprocess() function. Before starting filtering CpGs, you must fulfill these requirements:

Make sure the same sample names are the same in methylation and genotype files. Important: in the genotype file, the sample name is the IID, not FID_IID!
Make sure the same samples are present in the methylation and genotype files.

Once you know for sure that you have the same samples in both datasets and that they have the same name, you can start annotating the CpGs by chromosome, start, and end, and filtering out those located on sexual chromosomes. Important, to define the cis-window, the end must be equal to start + 1 (end=start+1), have a look at the example dataset to make sure that you are correctly defining the cis-window, otherwise, you will get an error from TensorQTL software. Additionally, once we have obtained the text file in a BED format, you will calculate the variances of each CpG and the difference between the 90% and the 10% percentiles of their betas.

Finally, using the bash commands provided in the Readme, you will sort it and zip it into a BED UCSC file. An example is provided in this section of the Readme.