Case Study: Raman Spectroscopic Grading of Gliomas - janeshdev/ggplot2 GitHub Wiki

ggplot2 and this Case Study

With more than 35000 spectra (= rows in wide format), the data set is too large to be handled conveniently inside the ggplot2 plotting functions. This case study therefore shows different examples of pre-calculating the actual data that is to be plotted, and then using ggplot2 to produce the plot.

The provided range of statistics that can easily be calculated by ggplot2 is an advantage that is basically lost due to practical issues (mainly RAM consumption, but also calculation time) for this data set.

However, a second advantage of ggplot2 is that it is actually also easy to plot pre-calculated data. Besides allowing to handle large data more efficiently (by taking into account the particular data structure), this also means that the plotted statistics can easily be extended. This possibility should not be under-rated: e.g. producing 3-class 2d-histograms below in lattice requires hacking package hexbin's hexagon drawing function.

I present here three different plots demonstrating:

  • tabulating and geom_hstep ()

,

  • plotting of weighted quantile spectra

, and

  • plotting of 2d histogram with colours according to different classes

.

The code is available in two .R files: function definitions and the source of the actual plots. The remainder of this page is a mix of explanations and the respective function definitions and commands, following my lines of thought for producing the plots. However, if you plan to use the examples in order to produce similar graphs yourself, the two .R files are probably easier to start with: source the -functions file, and then you'll hopefully only need to change the code in -plots.

The Case Study is very much focused on the why and how of producing the plots, and contains just few hints about the application behind. A paper dealing with the analysis of these data was recently submitted, and I'll put a link in here once it is available. You may also ask me by email for the manuscript of the paper.

The data itself cannot yet be made publicly available.

Background of the Work

The background of this work is the classification of tumour tissues using their Raman-Spectra. A detailed discussion can be found in C. Beleites et al.: Raman spectroscopic grading of astrocytoma tissues: using soft reference information, Anal Bioanal Chem, 400, 2801-2816 (2011).

Gliomas are the most frequent brain tumours, and astrocytomas are their largest subgroup. These tumours are treated by surgery. However, the exact borders of the tumour are hardly visible. Thus the need for new tools that help the surgeon to find the tumour border. A grading scheme is given by the World Health Organization (WHO).

From a surgical point of view, three classes are of interest:

  • normal tissue, class N: every tissue that is not tumour tissue (normal tissue + a small amount of gliotic (scar) tissue. These tissues must be preserved during surgery.
  • low grade tissue, class A°II: morphologically like astrocytoma WHO grade II tissue. These tissues should ideally be removed, as the patient's prognosis depends on the complete removal. However, they are often preserved in order to be sure not to remove any normal tissue.
  • high grade tumour tissue, class A°III+: morphologically like astrocytoma WHO grade III or glioblastoma or necrotic tissue. These tissues must be removed.

I visualize the classes with the following colours:

  • green for N
  • blue for A°II
  • red for A°III+
    cols <- c (N = "#008000", `A°II` = "#0000A0", `A°III+` = "#FF0000", all = "black")
    dimcols <- c (N = "#00800080", `A°II` = "#0000A080", `A°III+` = "#FF000080")
    The colours were chosen mainly for printed publication on a white background with light gray grid. For most on screen purposes, however, ggplot's standard choice of white grid lines on light gray background looks much better, and I therefore do not apply my version of theme_bw () to the presented graphs. With the light gray background, ggplot's colour choice is also nicer, but not quite practicable for the last plot. Thus, I opt for consistency in the colours for all plots (and also regardless whether printed, viewed here on a computer monitor, or via beamer in a presentation).

Raman spectra were measured for each section on a grid, and reference diagnosis was obtained. The tumours tend to de-differentiate and are polymorphous (spatially heterogeneous). In addition, astrocytomas do grow particularly infiltratively. Therefore, one may encounter tissues that are between the classes defined above.In addition, it is sometimes not possible to properly transfer the reference diagnosis onto the measurement grid (due to deformation of reference section and/or measured section). Such spectra are considered to belong partially to more than one class.

Spectra, that belong completely and exactly to one of the classes are called hard, soft spectra have membership values somewhere between 0 and 1 that express a degree of belonging to the different classes. E.g. a spectrum of a tissue that is at the transition from °II to °III would be labelled 50% for each of the two classes.

Overview of the Data Set:

Class Patients Spectra Patients Spectra
Membership crisp crisp soft soft
Normal 15 7290 34 15401
thereof Controls 9 4902 9 4902
Low grade tumour 16 4168 45 18670
High grade tumour 27 8279 52 21342
Total 52 19737 78 36391

I have the data as a hyperSpec-object, but I cannot (yet) make it available to the public.

library (ggplot2)
library (hyperSpec)
load ("astrocytomas.RData")
astro
hyperSpec object
   36391 spectra
   9 data columns
   256 data points / spectrum
wavelength: Delta * tilde(nu)/cm^-1 [numeric] 755 760 ... 3025 
data:  (36391 rows x 9 columns)
   1. patient:  [numeric] 46 46 ... 1630 
   2. sample:  [numeric] 46 46 ... 1630 
   3. section:  [numeric] 6 6 ... 3 
   4. measurement:  [factor] 46.6 46.6 ... 1630.3 
   5. x: x/(mu * m) [numeric] 7659 7992 ... 9800 
   6. y: x/(mu * m) [numeric] -12654 -12654 ... -7400 
   7. spc: I/"a.u." [matrix256] -0.1722 -0.2330 ... 0.06374 
   8. label: label [AsIs matrix x 3] 1 1 ... 1 
   9. label.factor: label.factor [factor] N N ... A°III+ 
Columns patient, sample, and section are pretty self-explanatory. The patient number is always the number of the first sample of that patient. The measurement number is (here) simply a concatenation of the sample and section number. x and y are the coordinates (in μm) where the spectrum was taken, and spc holds the spectra matrix (with 256 columns). label holds the reference class memberships (in the range between 0 and 1, with all rowSums 1). label.factor divides the spectra in four groups: one for each class (holding the respective crisp spectra) and one with all soft (ambiguous) spectra.

The Reference Class Memberships

About half of the spectra belong (partially) to more than one class. This can be visualized by plotting the fraction of all spectra that do belong with at least the given value to the respective class.

Basically, I want to plot the sorted membership values against the number of the spectrum. However, waiting for a plot with close to 100000 points is not going to be fun. Besides, only few points are really needed, if geom_step () is used instead of geom_point () or geom_line ().

Function fraction () does the sorting and tabulating, and prepends the correct start point:

fraction <- function (max.membership){
  max.membership <- round (max.membership, digits = 3)
  y <- sort (unique (max.membership), decreasing = TRUE)
  x <- tabulate (factor (max.membership, levels = y))
  x <- cumsum (x)
  x <- x / tail (x, 1)
  x <- c (0, x)
  y <- c (y [1], y)

  data.frame (x = x, y = y)
}

A helper function that is used also further below is rbind.w.name () which rbinds the elements of a list and uses their names as grouping variable (factor) in column class:

rbind.w.name <- function (l){
  n <- names (l)
  for (i in n)
    l [[i]]$class <- i

  l <- do.call (rbind, l)
  l$class <- factor (l$class, levels = n)

  l
} 

Now everything needed to calculate the memberships is at hand: I first extract for all three classes the memberships that are larger than 0, plus the maximum class membership for each spectrum. Then they are tabulated and combined into a single data.frame for plotting.

memberships <- c (list (all = apply (astro$label, 1, max)),
                  apply (astro$label, 2, function (x) x [x > 0]))
before <- sapply (memberships, length)  # curiosity: how much does this approach reduce the data to plot?
memberships <- lapply (memberships, fraction)
after <- sapply  (memberships, nrow)
memberships <- rbind.w.name (memberships)
> summary (memberships)
       x               y            class   
 Min.   :0.000   Min.   :0.020   all   :31  
 1st Qu.:0.541   1st Qu.:0.300   N     :24  
 Median :0.688   Median :0.700   A°II  :33  
 Mean   :0.668   Mean   :0.605   A°III+:29  
 3rd Qu.:0.855   3rd Qu.:0.900              
 Max.   :1.000   Max.   :1.000   
> dim (memberships)
[1] 117   3
> before / after
   all      N   A°II A°III+ 
1173.9  641.7  565.8  735.9 
Tabulating reduced the number of points per line about 2 - 3 orders of magnitude.

Now, here's the plot:

ggplot (data = memberships, aes(x = x, y = y, colour = class)) +
  geom_step () +
  scale_colour_manual ("class", value = cols) +
  scale_x_continuous (name = "Fraction of Spectra", expand = c (0, 0), limits = c (0, 1.005)) +
  scale_y_continuous (name = "Class Membership", expand = c (0, 0), limits = c (0, 1.005)) 

Almost half of all spectra do not belong exclusively to one of the three classes. The situation for the low grade tissues is most difficult: not only is this class biochemically "between" the normal and high grade tumours (and will thus probably suffer from confusion with both other classes), but only a small fraction (about 1/4) of the spectra with low grade tissue membership belong completely to this class.

The Spectra

Spectral intensities at a certain wavelength (or here, rather, Raman shift) are often not normally distributed. Instead of looking at the mean ± standard deviation, I plot the 16th, 50th (median), and 84th percentile.

Weighted quantiles take into account the partial class memberships of almost half of the spectra. Package hmisc provides the function. Again lapply () and rbind.w.name () are used.

library (Hmisc)
wtd.percentilespc <- function (weights, spc, probs = c (.16, .5, .84)){
  spc <- spc [weights > 0]              # reduce calculation time by removing unrelated spectra
  weights <- weights [weights > 0]

  spc <- apply (spc, 2, wtd.quantile, weights = weights, probs = probs, normwt = FALSE)
  spc$percentile <- probs

  spc
}

spc <- apply (astro$label, 2, wtd.percentilespc, spc = astro [, "spc"])
spc <- rbind.w.name (spc)

The spectra were "centered" by subtracting the mean spectrum of normal gray matter as part of the pre-processing (graunorm). While this gives a good data origin for the classification, the spectra are better displayed uncentered.

spc <- sweep (astro, 2, graunorm, `+`)

The conversion from hyperSpec object to a long-format data.frame is done by as.long.df (). The Raman spectra consist of two spectral ranges (less than 1800 1/cm and above 2800 1/cm). The unnecessary part of the x-axis is usually cut, and I use faceting (grouping variable =wlrange=) to achieve the effect. I want to have the 16th and 84th percentile spectra as band around the median. This is easiest achieved by casting the data.frame again, so that each percentile is in its own column.

df <- rbind (cbind (as.long.df (spc [,, min ~ 1800]), wlrange = "low"),
             cbind (as.long.df (spc [,, 2800 ~ max]), wlrange = "high"))

df <- cast (df, .wavelength + class + wlrange ~ percentile, value = "spc")

Now everything is ready for plotting

p <- ggplot (data = df) +
  geom_ribbon (aes (x = .wavelength, ymin = `0.16`, ymax = `0.84`, fill = class), col = "black", size = 0.15)  +
  scale_fill_manual ("class", value = dimcols) + 
  geom_line (aes (x = .wavelength, y = `0.5`), size = 0.25)

The spectra are faceted according to the spectral range and the class. However, the spectral range is redundant as description, so I set the labeller to leave it out (I didn't find out how to suppress the column facet labels only, while still having the row labels). Also, the axes get their proper labels (which are stored as expressions in the hyperSpec object). The colour legend is not necessary, as colours are only in addition to the row labels and is therefore suppressed, too.

p <- p + facet_grid (class ~ wlrange, scales = "free_x", space = "free",
                     labeller = function (variable, value){
                       if (variable == "wlrange") "" else value
                     }) + 
  ylab (labels (spc, "spc"))  +
  scale_x_continuous (name = labels (spc, ".wavelength"), breaks = seq (800, 3000, 200), expand = c (0, 50))  +
  opts (legend.position = "none")

p

The spectra of the three classes show differences mainly in the "high wavenumber region" above 2800 1/cm. This is were C-H stretching vibrations give signals. As lipids contain lots of C-H bonds, lipids are a main contributor to these signals (other biomolecules such as proteins have C-H bonds, too, but much less. In addition, the lipid content of brain tissues is high). Lower Raman intensities with increasing malignancy of the tissue in this region are attributed to lower lipid contents of the tissue - which is in fact reported in the literature. However, the distributions of the spectra overlap heavily. This is of course partly due to the fact that almost half of the spectra do belong to a mixture of the classes. On the other hand, it certainly means that sepation of the classes is difficult. Classifiers will need to pick up small differences over larger spectral regions. Subtracting the mean spectrum of gray tissue as a reference spectrum, the differences are already more pronounced (not shown, but that data is in fact used for the classification).

Descriptive LDA

As third plot I show how to create a projection of the spectra into linear discriminant space. For 3 class classification, the LD space has two dimensions which can be nicely displayed. But 36000 points are even with the "transparency trick" not going to give much insight. I thus create a 2d histogram, coloured according to the classes (or class mixtures).

library (MASS)
lda <- lda (label.factor ~ spc, subset (astro$., label.factor != "soft"))
desc <- predict (lda, astro [astro$label.factor != "soft"]$.)$x

The classes should be indicated by their usual colours. The underlying histogram is easily computed by hexbin (). Package hexbin's function hexTapply () allows to apply a function to each of the hexagons calculated by hexbin (). I count the class memberships for each class separately and store the result in a matrix inside a data.frame together with the hexagons' x and y coordinates.

hist2d <- function (labels, desc, xbins, rng = range (desc)){
  h <- hexbin (desc, xbins = xbins, xbnds = rng, ybnds = rng,IDs = TRUE)
  
  counts <- hexTapply (h, 1 : nrow (labels),
                       function (i, labels){
                         colSums (labels [i,, drop = FALSE])
                       },
                       labels = labels)
  counts <- t (matrix (unlist (counts), nrow = length (counts [[1]])))
  colnames (counts) <- colnames (labels)

  data.frame (hcell2xy (h),  counts = I(counts))
}

h <- hist2d (astro$label, desc, 80)

Next, the counts need to be translated into colours for filling the hexagons. As I do not yet know how I can do a multi-colour mix in a perceptive space, I stick to a brute RGB mix. I prefer mixing against white:

colmix.rgb <- function (x, purecol, against = 1, min = 0, max = 1, sub = TRUE){
  if (is.character (purecol))
    purecol <- t (col2rgb (purecol)) / 255

  if (sub)
    x <- against - x %*% (against - purecol)
  else
    x <- x %*% purecol
  
  x [x < min] <- min
  x [x > max] <- max

  cols <- rep (NA, nrow (x))
  cols [! is.na (x [,1])] <-   rgb (x [!is.na (x [, 1]),])

  cols
}

The colour mixing function expects the input matrix to have values between 0 and 1, thus I divide each column of the counts by its maximum. As the classes are quite different in sample size, I find this much better. The legend will give the translation to the counts.

h$col <- I (sweep (h$counts, 2, apply (h$count, 2, max), `/`))
h$col <- colmix.rgb (h$col, purecol = cols [1:3])
Now the data is ready for plotting. The only thing to remember is that geom_stat () by default does binning, which must be overwritten by StatIdentity here as the binning is done already. Besides, for some reason there must be a grouping variable or the size of the hexagons will be completely wrong
p <- ggplot (data = h, aes (x = x, y = y, fill = colour, colour = colour, group = 1)) + geom_hex (stat = StatIdentity) +
  scale_fill_identity() + scale_colour_identity()  +
  opts (panel.background = theme_rect(fill = NA, colour = "gray75"),
        panel.grid.major = theme_line(colour = NA), 
        panel.grid.minor = theme_line(colour = NA),
        plot.margin = unit(c(0.5, 0, 0 ,0), "lines")
        )  +
  scale_x_continuous (limits = c (-3.2,3.5)) + scale_y_continuous (limits = c (-3, 2)) + coord_equal () + 
  labs (x = "LDA 1", y = "LDA 2")
 
The default background and grid options do not go nicely with the hexagons, and are removed via opts (). Plot and calculation need to take into account the aspect ratio of the axes. The hexagons were calculated using the same ranges for x and y and are plotted with coord_equal () so that the hexagons do not get squeezed. This calls for manual setting of the axis limits.

In addition to this histogram, I want to draw contours into the plot that include 50% of the hard spectra of each class. I construct these contours by using hull polygons (package sp) and peeling of shells, until only half of the spectra are left inside the hull polygon. I think of this as 2d quantiles, and the 50% contour is something like the interquartile range in 2d. Going on with the peeling, I arrive at the "2d median". As the peeling function is a bit longer and not of immediate concern for the graph, I do not display the code here (it is available nicely commented in function definitions). I add each contour immediately to the plot as a data.frame holding all three contours and a grouping variable needs much longer to display.

library (sp)

desc <- desc [astro$label.factor != "soft", ]
median <- data.frame ()
for (class in seq_along (colnames (astro$label))){
  contours <- peel (desc, weights = astro$label [, class] > 1-1e-5, probs = c (0, .5))
  tmp <- apply (desc [contours [[1]],], 2, median)
  median <- rbind (median, data.frame (x = tmp [1], y = tmp [2], class = class))

  tmp <- data.frame (desc [contours [[2]],], class = class)
  colnames (tmp) <- c ("x", "y")
  p <- p + geom_polygon (data = tmp, aes (x = x, y = y), fill = NA, col = cols [class])  
}
p <- p + geom_point (data = median, aes (x = x, y = y), fill = "white", col = "black", shape = 21, size = 3)
As the second last step, I produce a legend for the histogram colours. It shows a colour scale for each of the classes, which is calculated from rectangles. As geom_rect () is spectified with continuous x and y axes, I reconstruct the x axis to appear like a discrete one though it is in fact continuous:
legend <- function (purecolours, counts, dx = 0.33, classlabels = names (purecolours)) {
  if (! is.matrix (counts))
    counts <- matrix (counts, ncol = length (counts))
  
  maxcnt <- apply (counts, 2, max)
  df <- data.frame ()
  for (class in seq_along (maxcnt)){
    if (max (maxcnt) == 1)
      tmp <- c (0, seq_len (maxcnt [class] * 100) / 100)
    else
      tmp <- c (0, seq_len (maxcnt [class]))
    
    df <- rbind (df, data.frame (class = class,
                                 col = colmix.rgb (tmp / maxcnt [class], purecolours [class]),
                                 counts = tmp,
                                 dx = dx,
                                 dy = tmp[2] - tmp [1]))
  }


  l <- ggplot (df, aes (x = class), col = col) +
    geom_point (aes (x=class, y = 1), col = NA)  # trick to access continuous x values
  l <- l + geom_rect (aes (xmin = as.numeric (class) - dx,
                           xmax = as.numeric (class) + dx,
                           ymin = counts - dy,
                           ymax = counts + dy,
                           fill = col,
                           colour = col), dx = force (dx)
                      )

  l <- l + opts (plot.margin = unit(c(0.5, 0, 0 ,0), "lines"),
                 axis.text = p$options$legend.text,
                 axis.title = p$options$legend.text 
                 ) +
       scale_fill_identity () + scale_colour_identity () +
       scale_y_continuous (name = "counts", expand = c(0, max (df$dy)),
                           minor = pretty (c (0, max (maxcnt)), 25)) +
       scale_x_continuous (name = "class", expand = c (0, 0.5 * dx), minor = NA, major = NA,
                           breaks = seq_along (maxcnt), labels = classlabels)

  l
}

l <- legend (cols [-4], h$counts)
Finally, Legend and main plot are displayed together:
library (grid)

plot.with.legend.right <- function (graph, legend, legend.width = 8, legend.unit = "lines"){
  plot.new()
  pushViewport (viewport (layout = grid.layout (1, 2, 
                            widths = unit (c (1, legend.width), c("null",legend.unit))
                          )))
  print (graph,  viewport (layout.pos.col = 1), newpage = FALSE)
  print (legend, viewport (layout.pos.col = 2), newpage = FALSE)
  popViewport ()
}

plot.with.legend.right (p, l)  # save this one with png () or pdf (): ggsave () would save only the legend

The data set is quite dominated by the borderline samples - particularly normal and low grade tissues are frequently mixed. But also the 50% contours of the classes practically touch each other.

However, the classes are quite overlapping even if only the spectra with hard reference are considered (same hexagon areas and colour scales as above):

Claudia Beleites,
Bioanalytische Chemie,
Technische Universität Dresden,
Dresden/Germany,

and

Dipartimento dei Materiali e delle Risorse Naturali,
Università degli Studi di Trieste,
Trieste/Italy

⚠️ **GitHub.com Fallback** ⚠️