MethylationQC - folkehelseinstituttet/mobagen GitHub Wiki

Table of contents

#Introduction

The QC is based on the minfi libraries.

It is rerun whenever needed and has the same structure for all data-sets.

Set splitting

As mentioned below, some of the larger sets have been split into smaller sets. In a future version, we anticipate to merge results into larger files. However, having smaller files is not necessary bad: The result files are of such size that unless you have very much memory, you will not be able to handle them efficiently.

In addition child and parents will be split, due to properties of the blood that was used: blood for parents, umbilical cord blood for newborn.

Feel free to merge them yourself if you want.

Directories and results available

Results of the QC are found under each individual set, in a subdirectory called QC. Child/parent splitting my occur if relevant.

Resultfiles contain numberes that can be traced back to the QC-step. For example, met009\_child-01_4_probes_failed_per_sample_p0.01.csv is from set met009, child-01 batch, step 4.

input

The yaml-files found here document various parameters used during the QC like, QC thresholds and methylation-chip used.

Please note that a global config files exists on a set denoted MetCommon. The file is called MetCommon/QC/input/globalConfig.yaml and should be identical for all sets.

The *.yaml files are specific for the set. Some sets are split into child and parents, this being reflected in the yaml-files.

Files named *_manual_remove_samples_bad_density.txt are actually result of previous runs of the QC, documenting what samples were removed due to poor quality/density. These samples still exist in the raw-data.

results

Numeric/matrix results of the QC are found in a directory called ... results. More documentation to come - CpGs

All files are prefixed by the setname and an underscore. In addition, they might be split in child/parent and sub-batches as previously described under sensitive_plots.

Result-matrices are available as .csv and .rds files. .rds files are easier to work with for R users, as they are much faster to load into R and already in the correct format in terms of colnames and rownames.

  • 2_probes_removed.csv 2 column matrix of CpG-ids removed and why: "cross_hybridizing", "polymorphic" (SNP influenced), "high_detection_p"

  • 2_SNP_Betas.rds Matrix of beta values for the 65 (450k) or 60 (EPIC) SNPs on the array. Rows - rs id, column - samples

  • 3_control_probe_PCA.rds The PCA object of PCA analysis from control probes described under plots.

  • 4_bisulphite_conversions.csv The calculated bisulphite conversion rate for each sample

  • 4_num_probes_with_proportion_failed_samples_p0.01.csv For each sample, the fraction of failed probes and the total of failed probes, compared to the detection p value cut off (0.01)

  • 4_probes_failed_per_sample_p0.01.csv For 5 different proportions (0.5, 0.1, 0.05, 0.01, 0.0033), how many CpGs have a bad detection p value for more than the give proportion of samples

  • 8_filtered_samples.csv 2 column matrix of sample ids removed and why: "NA_control_probes" (too many NA/INF values), "bisulphite" (low bisulphite conversion), "low_median_meth_or_unmeth_channel" (less than 10.5 mean median meth and unmeth log2-intensity), "manual_removed_bad_density" (removed due to having outlying values in either the control_probe_boxplot or genome-wide density), "pvalue" (removed samples with less than 90% detected probes based on detection pvalue cutoff)

  • 9_estimated_cell_proportions.csv Matrix with samples as rows and cell type proportions estimated as columns)

  • 11_added_sex_prediction_to_pheno.csv (Sample sheet data with additional columns for predicted sex, median intensity across Y chromosome probes, and median intenisty across X chromosome probes

  • 12_pvalueMat_sex_chr.csv

  • 12_pvalueMat_autosomal.csv

  • 12_Noob_beta_matrix_sex_chr.csv Beta values from CpGs at sex chromosomes; rows - CpGs, columns - samples

  • 13_beta_values.csv Beta values from autosomal CpGs before BMIQ; rows - CpGs, columns - samples

  • 13_bmiqed_beta_values.csv Beta values from autosomal CpGs after BMIQ; rows - CpGs, columns - samples)

plots

All plots are found the set/QC directory. As for the idat-files there might be an extra directory layers for sets that have both child and parent individuals.

There are two kind of plots: sensitive and non-sensitive. The directories contains plots produced during the QC. A subset of these plots (those without sample-info) are currently (January 2022) piloted for browsing on a public github project. The rest are only available for those having been granted access to the data itself.

sensitive_plots

These are plots with one .jpeg plot for each individual in the set (possibly split in child and parent as mentioned above). The QC does remove certain samples as it moves forward (starting with elements removed manually in step 3 and documented on *_manual_remove_samples_bad_density.txt on the input directory (see above)).

Removed samples are documented under results sub-directory (see below) - meaning that the number of plots found will gradually decrease.

Subdirectories are

  • 3_boxplots boxplot of log2()-values from intensity measures from control probes for each sample.

  • 6_density_plots genomewide density from RAW methylation values compared to after ssNoob is applied. Estimated from a random (set.seed) selection of 50000 CpGs.

  • 14_density_plots Similar to 6_density_plots, but with genomewide density after BMIQ is applied as well

plots

Due to current limitations, large sets have been split into smaller batches and this is reflected on both the results (see below) and the plots. The number of splits might be reduced in later versions of the QC.

All names below are generic, but might have the set number (child/parent) in them. Using 3_control_probe_PCA_plot.pdf as an example met004_child-02_3_control_probe_PCA_plot.pdf is for the met004 set - a set split in child and parents. Furthermore it is the second batch of childs.

The underlying data for the plots are found under the results directory, see below.

The plots found are

  • 3_control_probe_PCA_plot.pdf PC2 against PC1 from PCA of log2()-values from intensity measures from control probe.

  • 8_qc_plot.pdf minfi qc plot - median intensity from methylated vs unmethylated intensities. A mean of these being less than 10.5 indicates bad sample/low signal

  • 11_sex_plot.pdf The median intensity signal from chromosome Y probes against X probes, indicating Male or Female

  • 15_histogram_of_differences.pdf Comparison of RAW vs ssNoob methylation values. The difference of the two matrices are calculated, then the rowMeans (plot 1) and rowSds (plot 2) are calculated and plotted as histograms. The same is done between ssNoob methylation values and BMIQ methylation values. These plots are for documenting the difference after each major normalization method is applied.

QC Pipeline Workflow

(The source code will probably be available for all to see on GitHub).

Step 0 - splitting large datasets

Memory limitations will cause systems to crash/use slowly so big sets are split. This is configured in MetCommon/QC/input/globalConfig.yaml

Step 1 - Build RGset

  • load iDat files to RGset (all information in minfi object format)

Step 2 - Filter probes from RGset

  • poor quality probes (detection p-value larger than 0.01 for more than 5 % of the samples)
  • cross-hybridizing probes (probes shown to have more than 1 possible binding site in the genome)
  • non-CpG methylation (CpH, (H = A, C or T))
  • probes influenced by SNPs with higher than 0.05 MAF. SNP placed in interrogated CpG or at the neighboring positions

Step 3 - PCA plot/boxplots of control probes

  • PCA plot: PC2 against PC1 from PCA of approximately 1200 Red and Green channel control probes
    • meant to use as documentation and to get intuition of possible amount of outliers
  • boxplot of control probe values for each sample
    • if a sample has outlying pattern, such as very narrow range of values -> remove sample

Step 4 - Filter samples

  • filter samples with:
    • more than 25% missing/infinite values for control probes
    • more than 10% probes with high detection p value (> 0.01)
    • less than 90% bisulphite conversion rate (calculated with bscon() function, which uses bisulphite conversion control probes to estimate conversion rate)

Step 5 - Preprocess RGChannelSet to MethylSet widt ss-Noob

  • single-sample-(normal-exponential out-of-band) method used for background subtraction and dye-bias correction. It uses out-of-band probes to estimate a normal-exponential background signal for each color channel, and normalizes the signal from regular probes based on these estimated distributions to obtain a purer signal where the background is removed. This algorithm only uses information within each sample to correct each sample. It has been shown to be one of the better background subtraction and dye-bias correction methods: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408810/
  • from the authors of ssNoob: "We note that on the Beta value scale, there is no difference between values returned by Noob or ssNoob. Differences are confined to the methylated and unmethylated signals."

Step 6 - Genomwide density plots

  • plotting of genomewide densities of RAW methylation data and after ssNoob is applied
    • if a sample has clear outlying distribution from the others, remove sample

Step 7 - Filter out samples that failed manual QC

Manual inspection of earlier results (mainly PCA (step 3) and genomwide density plots (step 6) resulted in samples that will be removed from further QC due to poor quality.

These are documented in QC/input/met*_manual_remove_samples_bad_density.txtfor every set.

Step 8 - QC plot for selected preprocessing method

Step 9 - Estimate cell type proportions

Step 10 - Convert GenomicMethylSet to GenomicRatioSet and add annotations

Step 11 - Sex prediction and removement of discordant gender/predicted sex samples

  • measures the median intensity values from combined methylated and unmethylated signal from Y chromsome probes and the same for X chromosome probes. If the differences between these medians is more than 2 (less than -2) a female is predicted.

Step 12 - Seperate out sex chromosome measurements from autosomal measurements

Step 13 - BMIQ on beta values from GenomicRatioSet

The Beadchips (450k/EPIC) have 2 different probe designs, I and II. These have a clear difference in dynamic range for the obtained methylation measurements.

BMIQ projects the global distribution of probe type II measurements onto the distribution of probe type I, meaning that the methylation values from probe type II measuements are slightly shifted. This is especially an usual step when analyzing regions of data, i.e. looking for different methylation regions. It is also believed that the dynamic range of probe type I is more correct than probe type II.

BMIQ is calculated only within each sample.

Step 14 - Plot density after BMIQ

plots created of genomewide distribution of methylation values after BMIQ is applied.

Step 15 - Plot histogram of differences

Plot histogram of differences between raw methylation values and after ssNoob is applied, and between methylation values after ssNoob is applied and BMIQ is applied. These plots are for documenting the difference after each major normalization method is applied.

(Step 16 - Clean up)

Clean up and delete unnecessary temporary data stored at intermediate steps.

More information

  • Detection p value: a detection p value greater than 0.01 should not be trusted. It is calculated by comparing the total dna signal (methylated + unmethylated) for each position and sample to the background signal level. The background is estimated using the negative control probes, assuming a normal distribution. Calculations are performed on the original (non-log) scale.

  • Negative control probes: specifically design not to match the human genome, reads/signal from these probes in red and green channel to estimate the background signal distribution.

  • Out of band probes: for a given type I bead, the intensity from the unused color channel has been proposed as a means of estimating background signal, and termed the out-of-band intensity.