Quality Control - sups-k/methylation GitHub Wiki
Minfi's All-In-One Function
Minfi contains a function qcReport(rgSet, sampNames = targets$Sample_Name, sampGroups = targets$Sample_Group, pdf = filename)
that performs quality controls on the raw IDAT intensities.
Density Plot
The first plot it produces is a density distribution of beta values. To obtain beta values,
Beta value = (methylated intensities) ÷ (methylated intensities + unmethylated intensities)
The Y-axis corresponds to density, which means the number of elements in the MethylSet (in scientific notation), and the X-axis corresponds to beta value. There should be 2 peaks in the graph, one corresponding to 0% methylation and one for 100% methylation. Ideally, these 2 peaks should have the same density to indicate that there is an equal distribution of methylated and unmethylated CpG regions, as is the case in the DNA.
If the sample groups are given, the density curve for each sample will be coloured according to the group it belongs to (like RA and healthy).
Density Bean Plot
This is the second plot produced. It can be thought of as a box and whisker plot of the range of beta values per sample. All of the medians should be approximately the same. If they aren't, it means the data needs to be normalise.
Control Probes Plot
The 450k array contains several internal control probes that can be used to assess the quality control of different sample preparation steps. The values of these control probes are stored in the initial RGChannelSet and can be plotted by using the function controlStripPlot and by specifying the control probe type.
The probes used to generate plots are:
- Bisulfite Conversion: Tests efficiency of bisulphite conversion by query of C/T polymorphism. Indicates methylation at a site known to be methylated. There are 2 types of these probes.
- Extension: Tests efficiency of extension of A, T, C and G nucleotides from a hairpin probe (sample-independent). The perfect match hairpin controls should result in high signal and the mismatch probes in low signal.
- Hybridisation: Tests the overall performance (hybridisation efficiency) of the Infinium assay using synthetic targets (instead of amplified DNA) at 3 concentrations.
- Non-polymorphic: Queries a non polymorphic base A, T, C and G to test methylation at a base in a non-polymorphic region of the genome.
- Specificity: Checks for non-specific detection of methylation signal over unmethylated background. Specificity controls are designed against non-polymorphic T sites (G/T mismatch). There are 2 types of these probes.
- Target Removal: Efficiency of stripping step after extension reaction.
Apart from these probes, there are 2 negative control probes present:
- Normalisation: Randomly permutated bisulphite-converted sequences containing no CpGs. Determines system background.
- Staining: Efficiency and sensitivity of staining step.
Both these negative control probes should not hybridise to DNA. The mean of these probes determines the system background.
Sources:
- https://cordis.europa.eu/docs/results/279/279143/final1-final-report-combined-tables-and-figures-and-references.pdf
- http://www.bristol.ac.uk/caite/geocode/newcastleshortcourse/assaydesignsandprotocols.pdf
Other Function From Minfi - QC Plot
The quality control plot uses the log median intensity in both the methylated (M) and unmethylated (U) channels. This means for each sample, the median of all probe intensities in each channel, methylated and unmethylated, is calculated. Then the log2 of these medians is taken. The U median is plotted on the Y-axis while the M median is plotted on the X-axis. Each such point represents one sample. When plotting these two medians against each other, it has been observed that good samples cluster together, while failed samples tend to separate and have lower median intensities.
getQC(x)
where x is a MethylSet or GenomicMethylSet.
Box And Whisker Plots
To make box and whisker plots, we first obtain the methylated and unmethylated intensities separately using the functions getMeth
and getUnmeth
, which can take in a MethylSet or GenomicMethylSet. The next step is to obtain the natural log of these intensities. Then we create 2 different plots, one for methylated and one for unmethylated intensities using the function box plot()
.
Density Plots With MissMethyl
The overall density distribution of the beta values for a single sample is plotted. The density distributions of the Infinium I and II probes are then plotted individually, showing how they contribute to the overall distribution. This is useful for visualising how using SWAN affects the data. The function used is densityByProbeType(x)
where x is a MethylSet or a vector of beta values. If a MethylSet is given, the minfi function getBeta()
is used to obtain beta values.
Cluster Plots
Scatter Plot Using limma
This is plotted to find out if batch normalisation is required. Ideally, the RA and the healthy samples should cluster separately. Using a function in limma called plotMDS()
, samples are plotted on a 2D scatterplot so that distances on the plot approximate the typical log2 fold changes between the samples. It takes in M-values as its input.
M-value = log2( (meth + 100) ÷ (unmeth + 100) )
An offset of 100 is added in calculating the M-value as per recent literature.
Hierarchical Clustering
This is another method to see how the samples cluster. Ideally, there should be 2 main clusters of RA and healthy samples. First, a distance matrix for the M-values is computed using the Eucledian distance measure. The function used is dist()
. After that, the function hclust()
performs a hierarchical cluster analysis using the distance matrix for the 'n' objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage, distances between clusters are recomputed by the Lance–Williams dissimilarity update formula according to the particular clustering method being used. We have used the complete linkage method which finds similar clusters.
The Lance-Williams formula specifies how dissimilarities are computed when clusters are agglomerated (equation (32) in K&R(1990), p.237). If clusters C1 and C2 are agglomerated into a new cluster, the dissimilarity between their union and another cluster Q is given by
D((C1 ∪ C2) , Q) = α1 * D(C1, Q) + α2 * D(C2, Q) + β * D(C1,C2) + γ * |D(C1, Q) - D(C2, Q)|,
where the four coefficients (α1, α2, β, γ) are specified by a vector of length 4.