Fundamentals of Data Visualization - ua-datalab/Bioinformatics GitHub Wiki

Data Visualization using R



Important

🕐 Schedule

  • 3:00pm-3:20pm: Welcome and introduction to topic (What is Data Visualization?)
  • 3:20pm-4:00pm: Run examples of visualizations using RStudio

Important

Requirements

  • A registered CyVerse account (Register for a CyVerse account)

Important

Expected Outcomes

  • Understanding a general overview what does data visualization implies
  • Revisiting some example plot using R
  • Understanding the basics of the visualization library Ggplot2


Introduction

Data visualization transforms complex information into easily understandable visual formats like graphs and charts.

Key aspects include:

  • Scientific Research: Helps identify patterns and trends. For example, climate scientists use graphs to show temperature changes over time.

  • Key Principles: Clarity (easily interpretable), accuracy (truthful representation), and simplicity (avoid overwhelming viewers).

  • Tools: RStudio (R language) and Python offer powerful visualization capabilities. Common libraries include ggplot2 (R), Matplotlib, and Seaborn (Python).

  • Data Preparation: Essential steps include importing data, cleaning (addressing errors), and tidying (organizing for easy manipulation).


RStudio

RStudio is a powerful tool for working with data using the R programming language. ggplot2, an R package, simplifies the creation of various graphs and plots.

Let's break down the key points mentioned in the research text:

Basic Plotting:

This section covers fundamental graph types:

  • Scatter Plots: Illustrate relationships between two variables. Picture a graph with height on one axis and weight on the other, each point representing an individual.
  • Line Plots: Connect data points with lines, ideal for showing trends over time—like a plant's growth week by week.
  • Bar Plots: Use bars to represent quantities. For instance, displaying the sales of different fruits in a store.
  • Boxplots: Visualize the basic data distribution, median, quartiles.
  • Histograms: Visualize data distribution. Imagine grouping people into age ranges (0-10 years, 11-20 years, etc.) and counting how many fall into each category.
  • and much more...

Customization:

Enhance your plots for better comprehension:

  • Titles: Summarize the graph's content, e.g., "Fruit Sales in January."
  • Labels: Name the axes. For example, "Fruits" on the x-axis and "Number Sold" on the y-axis.
  • Legends: Explain what different colors or shapes represent. If red bars signify apples and blue ones bananas, the legend clarifies this.
  • Themes: Modify the plot's overall appearance, adjusting color schemes or typography.

Advanced Plotting:

Explore more sophisticated visualization techniques:

  • Principal Component Analysis (PCA): A method to simplify complex data. Consider having extensive information about fruits (color, size, taste); PCA identifies and displays the most crucial features.
  • Interactive Heatmaps: Use color intensity to represent data. For example, showing a store's busiest days of the week with darker colors indicating higher activity.
  • Volcano Plots: Common in scientific research, especially gene expression studies, to display relationships between two variables.

Ggplot2 basic references

Other Option: Data Visualization with Python


Best Practices

When scientists want to share their findings, they often use visuals like graphs and charts. Here are some important tips to make these visuals clear and effective:

  • Clarity and Simplicity: Keep visuals simple and clear. Avoid excessive decorations or complex designs that may confuse viewers. A clean graph with clear lines and labels facilitates quick understanding, much like a well-organized room allows easy navigation.

  • Correctness: Ensure data accuracy in visuals. Numbers and information must be true and reliable. Inaccurate representations, like overstating a medicine's effectiveness, can lead to false conclusions. It's akin to a map showing a non-existent road. Always verify data before sharing.

  • Accessibility: Consider color blindness when creating visuals. Use universally visible colors like blue and yellow instead of red and green to ensure broader accessibility and comprehension of shared information.

  • Consistency: Maintain consistent styles and formats across visuals to enhance understanding. Follow established scientific conventions. For example, using consistent colors for specific data points aids quick interpretation, similar to how team uniforms help identify players at a glance.

  • Publication-quality: Ensure visuals for scientific papers or presentations are professional, high-quality, and easily readable. Use clear fonts, appropriate sizes, and good resolution to make your visuals impactful and credible.


Examples

Visualization Examples

1. PCA example with the penguins dataset

Here we will replicate Allison Horst Palmer Penguins PCA](https://allisonhorst.github.io/palmerpenguins/articles/pca.html)

(R Markdown source file)

---
title: "PCA with penguins and recipes"
---

```{r, include = FALSE}
install.packages("palmerpenguins")
knitr::opts_chunk$set(
  comment = "#>",
  warning = FALSE,
  message = FALSE,
  eval = FALSE,
  out.width = "100%"
)
library(palmerpenguins)
library(recipes)
library(tidyr)
library(dplyr)
library(ggplot2)
theme_set(theme_minimal())

You'll need these packages to follow along with the code in this article locally:

install.packages("corrr")
install.packages("GGally")
install.packages("tidytext")

library(palmerpenguins)
library(corrr)
library(GGally)
library(recipes)
library(tidytext)
library(dplyr)
library(tidyr)
library(ggplot2)
theme_set(theme_minimal())

Correlation matrix

The palmerpenguins::penguins data contains several size measurement variables that are correlated. Let's take a look at the correlation matrix with the corrr package and the corrr::correlate() function:

library(corrr)
penguins_corr <- penguins %>%
  dplyr::select(body_mass_g, ends_with("_mm")) %>%
  correlate() %>%
  rearrange()
penguins_corr
#> # A tibble: 4 x 5
#>   rowname           flipper_length_mm body_mass_g bill_length_mm bill_depth_mm
#>   <chr>                         <dbl>       <dbl>          <dbl>         <dbl>
#> 1 flipper_length_mm            NA           0.871          0.656        -0.584
#> 2 body_mass_g                   0.871      NA              0.595        -0.472
#> 3 bill_length_mm                0.656       0.595         NA            -0.235
#> 4 bill_depth_mm                -0.584      -0.472         -0.235        NA  

Body mass and flipper length appear highly correlated, but neither of the bill variables appears to be as highly correlated with any other variables.

Pairwise plot matrix

We can visualize these correlations with the GGally package. The function we'll use is called GGally::ggpairs().

penguins %>%
  select(species, body_mass_g, ends_with("_mm")) %>% 
  GGally::ggpairs(aes(color = species),
          columns = c("flipper_length_mm", "body_mass_g", 
                      "bill_length_mm", "bill_depth_mm")) +
  scale_colour_manual(values = c("darkorange","purple","cyan4")) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"))
Yesggsave("vignettes/articles/figs/penguin-pairs.png", width = 8)
knitr::include_graphics("vignettes/articles/figs/penguin-pairs.png")
Principal component analysis (PCA)

We'll use the recipes package from tidymodels to perform a principal component analysis (PCA).

First, we'll also use a few recipe steps to preprocess the data for PCA; namely, we need to:

  • remove any NA values,
  • center all predictors, and
  • scale all predictors.

If you've never used the recipes package before, try this article to get started.

library(recipes)
penguin_recipe <-
  recipe(~., data = penguins) %>% 
  update_role(species, island, sex, year, new_role = "id") %>% 
  step_naomit(all_predictors()) %>% 
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), id = "pca") %>% 
  prep()

penguin_pca <- 
  penguin_recipe %>% 
  tidy(id = "pca") 

penguin_pca

The value column here is the loading. For each component, the value tells us the linear combination of weights for each variable that contributes to that component.

This output is a tidy version of this using stats::prcomp():

penguins %>% 
  dplyr::select(body_mass_g, ends_with("_mm")) %>% 
  tidyr::drop_na() %>% 
  scale() %>% 
  prcomp() %>%  
  .$rotation

We can also apply the recipes::tidy() method to the output from recipes::step_pca() to examine how much variance each component accounts for:

penguin_recipe %>% 
  tidy(id = "pca", type = "variance") %>% 
  dplyr::filter(terms == "percent variance") %>% 
  ggplot(aes(x = component, y = value)) + 
  geom_col(fill = "#b6dfe2") + 
  xlim(c(0, 5)) + 
  ylab("% of total variance")
Plot PCA loadings

We can plot these loadings by principal component too, following Julia Silge's example:

library(ggplot2)
penguin_pca %>%
  mutate(terms = tidytext::reorder_within(terms, 
                                          abs(value), 
                                          component)) %>%
  ggplot(aes(abs(value), terms, fill = value > 0)) +
  geom_col() +
  facet_wrap(~component, scales = "free_y") +
  tidytext::scale_y_reordered() +
  scale_fill_manual(values = c("#b6dfe2", "#0A537D")) +
  labs(
    x = "Absolute value of contribution",
    y = NULL, fill = "Positive?"
  ) 
ggsave("vignettes/articles/figs/pca-loadings-plot.png", width = 8)
knitr::include_graphics("vignettes/articles/figs/pca-loadings-plot.png")
Plot PCA loadings + scores

We have the PCA loadings in penguin_pca. But we need them in a wide format now for plotting.

# get pca loadings into wider format
pca_wider <- penguin_pca %>% 
  tidyr::pivot_wider(names_from = component, id_cols = terms)

We also need to go back to our prepped penguin recipe, prepped_penguins, and recipes::juice() it to get the PCA scores back.

# define arrow style
arrow_style <- arrow(length = unit(.05, "inches"),
                     type = "closed")


pca_plot <-
  juice(penguin_recipe) %>%
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(color = species, shape = species), 
             alpha = 0.8, 
             size = 2) +
  scale_colour_manual(values = c("darkorange","purple","cyan4")) 

pca_plot +
  geom_segment(data = pca_wider,
               aes(xend = PC1, yend = PC2), 
               x = 0, 
               y = 0, 
               arrow = arrow_style) + 
  geom_text(data = pca_wider,
            aes(x = PC1, y = PC2, label = terms), 
            hjust = 0, 
            vjust = 1,
            size = 5, 
            color = '#0A537D') 

In the above plot, you can see a lot!

First, if you focus on the x-axis showing us the first principal component, you can see that flipper length and body mass are very important (confirming what we saw in the above bar chart). Along this dimension, Gentoo penguins stand out clearly from the other two species. We can confirm this looking at summary statistics:

penguins %>% 
  group_by(species) %>% 
  summarize(across(c(flipper_length_mm, body_mass_g), 
                   mean, 
                   na.rm = TRUE)) 

We can see this with a simple scatterplot:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, colour = species)) +
  geom_point() +
  scale_colour_manual(values = c("darkorange","purple","cyan4")) 

If you now focus more on the y-axis showing us the second principal component, you can see that our two bill size variables, bill_length_mm and bill_depth_mm, are very important (again, confirming what we saw in the above bar chart).

Let's do the same thing for principal component 2 and 3.

pca_plot %+% 
  aes(PC2, PC3) +
  geom_segment(data = pca_wider,
               aes(xend = PC2, yend = PC3), 
               x = 0, 
               y = 0, 
               arrow = arrow_style) + 
  geom_text(data = pca_wider,
            aes(x = PC2, y = PC3, label = terms), 
            hjust = 0, 
            vjust = 1,
            size = 5, 
            color = '#0A537D') 

We see again that PC2 seems most associated with our bill size variables, bill_length_mm and bill_depth_mm. But now we can see more clearly that this dimension seems to separate Chinstrap penguins from the other two species. We can confirm this by glancing at summary statistics again by species:

penguins %>% 
  group_by(species) %>% 
  summarize(across(c(bill_depth_mm, bill_length_mm), 
                   mean, 
                   na.rm = TRUE))

We can see this with a simple scatterplot too:

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, colour = species)) +
  geom_point() +
  scale_colour_manual(values = c("darkorange","purple","cyan4")) 

This is actually a pretty neat scatterplot---it highlights a perfect example of why you'd need the combination of two variables to differentiate between these three species. Comparing distributions for any single variable only differentiates one species from the other two![^1]

[^1]: this is also a great example Simpson's paradox, see vignette("examples")

ggplot(data = penguins, aes(x = bill_length_mm)) +
  geom_histogram(aes(fill = species), alpha = 0.5, position = "identity") +
  scale_fill_manual(values = c("darkorange","darkorchid","cyan4"))

ggplot(data = penguins, aes(x = bill_depth_mm)) +
  geom_histogram(aes(fill = species), alpha = 0.5, position = "identity") +
  scale_fill_manual(values = c("darkorange","darkorchid","cyan4"))

Summary

So, Gentoos appear to just be giants, compared to the Adelie and Chinstrap penguins. While Adelie and Chinstraps are similar size-wise as measured by flipper length and body mass, Chinstraps seem to have longer bills and Adelie penguins have stubbier bills (a pug-gein, if you will?). And Gentoos, despite being large overall, have flatter bills than either of the other two species.

Reminder:

References and other resources


2. Heatmaps R Script Example

We replicate a part of the publication PCA, MDS, k-means, Hierarchical clustering and heatmap for microarray data , by Ming Tang.

See this additional information on building heatmaps with R.

# install the package if you do not have it.
install.packages("ISLR")
library(ISLR)

ncidat = t(NCI60$data)
colnames(ncidat) = NCI60$labs

dim(ncidat)

unique(colnames(ncidat))

ncidat[1:6,1:6]

# PCA - take SVD to get solution.

X = t(scale(t(ncidat),center=TRUE,scale=FALSE))

# we transpose X again for svd
sv = svd(t(X))
U = sv$u
V = sv$v
D = sv$d

## in R calculate the rank of a matrix is by
qr(t(X))$rank

length(D)

min(D)

# the last diagnal in D is very small.
# it is very close to 0, it has to do with the precision of the decimals in computer

# let’s plot scatter plot between PCs
# PC scatterplots

cols = as.numeric(as.factor(colnames(ncidat)))

plot(U[,1],U[,2],type="n",xlab="PC1",ylab="PC2")
text(U[,1],U[,2],colnames(X),col=cols)

par(mfrow=c(1,1))
Z = t(X)%*%V

# plot PC1 vs PC2
plot(Z[,1], Z[,2], type ="n", xlab="PC1", ylab="PC2")
text(Z[,1], Z[,2], colnames(X), col=cols)

# It looks much the same as the the figure of U1 vs U2 above using U, but with different scales.

plot(Z[,2], Z[,3], type ="n", xlab = "PC2", ylab="PC3")
text(Z[,2], Z[,3], colnames(X), col=cols)

# plot PC2 vs PC3
plot(Z[,2], Z[,3], type ="n", xlab = "PC2", ylab="PC3")
text(Z[,2], Z[,3], colnames(X), col=cols)

# Use ggplot for a prettier plot
pc_dat<- data.frame(type = rownames(Z), PC1 = Z[,1], PC2= Z[,2])
library(ggplot2)
ggplot(pc_dat,aes(x=PC1, y=PC2, col=type)) + geom_point() + geom_text(aes(label = type), hjust=0, vjust=0)

## the text is a bit messy, may try packages below or play more with ggplot2
## use directlabels http://directlabels.r-forge.r-project.org/
## use cowplot https://github.com/wilkelab/cowplot

# PC loadings - visualize data by limiting to top genes in magnitude in the PC loadings

## get a gradient of colors for grey, green, red.
## one can do better use other libraries such RcolorBrewer. see examples later.

aa<- grep("grey",colors())
bb<- grep("green",colors())
cc<-  grep("red",colors())
gcol2<- colors()[c(aa[1:30],bb[1:20],rep(cc,2))]

## use the genes that drive the first PC1. This is the first major patter in the data
k=1
ord1<- order(abs(V[,k]),decreasing=TRUE)
x1<- as.matrix(X[ord1[1:250],])
heatmap(x1,col=gcol2)

# use the genes that drive the second PC (PC2)
j<- 2
ord<- order(abs(V[,j]),decreasing=TRUE)

## we just use the first 250 features(genes) to plot a heatmap, This is the second major pattern.
x<- as.matrix(X[ord[1:250],])
heatmap(x,col=gcol2)

# Variance explained

varex = 0
cumvar = 0
denom = sum(D^2)
for(i in 1:64){
  varex[i] = D[i]^2/denom
  cumvar[i] = sum(D[1:i]^2)/denom
}

## variance explained by each PC cumulatively
cumvar

# Scree plot
par(mfrow=c(1,2))
plot(1:64,varex,type="l",lwd=2,xlab="PC",ylab="% Variance Explained")
plot(1:64,cumvar,type="l",lwd=2,xlab="PC",ylab="Cummulative Variance Explained")

# Sparse PCA
# When p>>n (we have way more genes than the samples), many featuers(genes) are irrelevant. 
# PCA can perform very badly. Sparse PCA zero out irrelevant features from PC loadings.

install.packages("PMA")
#install.packages("impute")

library("PMA")
library('plyr')
#library("impute")

## Loading required package: plyr
## Loading required package: impute

## we also look at the first 4 PCs
spc = SPC(t(X),sumabsv=10,K=4)

#how many genes selected? we look at the V matrix again, if the weights are zeros, they are not important features. sparse PCA zero them out. For each of the four PCs, how many features are retained?
apply(spc$v!=0, 2, sum)

#PC scatterplots
cols = as.numeric(as.factor(colnames(ncidat)))
K = 3
pclabs = c("SPC1","SPC2","SPC3","SPC4")
par(mfrow=c(1,K))
for(i in 1:K){
  j = i+1
  plot(spc$u[,i],spc$u[,j],type="n",xlab=pclabs[i],ylab=pclabs[j])
  text(spc$u[,i],spc$u[,j],colnames(X),col=cols)
}

#SPC loadings - visualize data by limiting to genes selected by the sparse PC loadings
aa = grep("grey",colors())
bb = grep("green",colors())
cc = grep("red",colors())
gcol2 = colors()[c(aa[1:30],bb[1:20],rep(cc,2))]

j = 1
ind = which(spc$v[,j]!=0)
x = as.matrix(X[ind,])
heatmap(x,col=gcol2)

# It looks different from the heatmap we get using the top 250 features that drive PC1 in our regular PCA, 
# because in this case, we are not selecting the top 250, rather using only the vs (173 features) that are not zeros. 
# Select different features for heatmap, it will look different.

length(ind)

# variance explained
spc$prop.var.explained



Updated: 10-15-2024 (C. Lizárraga).

UArizona Data Lab, Data Science Institute, University of Arizona.

CC BY-NC-SA 4.0

⚠️ **GitHub.com Fallback** ⚠️