BMA231 V: Intro to R - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)


The aim of these exercises is to introduce you to R. Due to the limited time we have, these exercises focus only on elements that are used when analyzing variants and gene expression datasets. Learning R requires a lot of time and practice and there are several great resources that you can visit later on if you want to deepen your knowledge (and you should!).



Getting started

  • Open RStudio

  • Type the following in the console and press ⏎

    234 / 543

    You should get 0.4309392

  • Check your current working directory, type:

    getwd()

    You should get something like "C:/Users/Marcela/University of Gothenburg/Bioinformatics Core Facility - Documents/Masters_NG01CF/2022/03_r"

  • Change your working directory to your Desktop

    • Session -> Set Working Directory -> Choose Directory -> select your Desktop

    In your console a similar text should appear:

    setwd("C:/Users/Marcela/Desktop")

    This is the command line that was used to change the directory.

  • List the files in your Desktop:

    list.files()

Q1. How many files/directories do you have?

Click here for output
I have 42!

Saving your work

As mentioned before, scripts are a great way to document your analyses, so you can revisit them later on.

  • Open a new file
    • File -> New file -> R script

    • Name it R1_yourName.R You will see the name of your script as a tab in the Source panel

      Click here for output

It is a good idea to save any R script using .R as file extension, so you know that it is R code in the file. Also, using spaces in file names may cause trouble later on, so we try to use “_” instead.

Commenting your code will make it easier to read or to remember what the code does. Add the following code at the top of your script (modify accordingly):

# Name: TYPE_YOUR_NAME_HERE
# Date: TYPE_TODAYS_DATE_HERE
# Description: practicing R

Anything written on a line after # will be disregarded when running the code, so it is a perfect way to add comments across the script.

  • Save your code

    • Ctrl + S

      NOTE: Do this continuously

  • Type list.files() in the Source panel, or the script (as we will call it from now on)

  • Place the cursor on the line and

    • either click the Run button in the top left corner of the editor,
    • or press Ctrl+Enter

    This will run the code and print out the results in the Console

Some data types

For these exercises, if it gets tricky have a look a the answers we provide, however try to come up with the answer yourself. Also, remember that there may be more than one way to answer, so you may get the answer with a different code which is more than fine!

Let's start simple

Vectors

  • Create a vector with the numbers 1, 2, … , 1000 and assign it to the variable x

  • Display the values of x to check you have the correct values

    Click here for output
    > x <- seq(1:1000)
    > x
      [1]    1    2    3    4    5    6    7    8    9   10   11   12   13
     [14]   14   15   16   17   18   19   20   21   22   23   24   25   26
     [27]   27   28   29   30   31   32   33   34   35   36   37   38   39
     [40]   40   41   42   43   44   45   46   47   48   49   50   51   52
     [53]   53   54   55   56   57   58   59   60   61   62   63   64   65
     [66]   66   67   68   69   70   71   72   73   74   75   76   77   78
     [79]   79   80   81   82   83   84   85   86   87   88   89   90   91
     [92]   92   93   94   95   96   97   98   99  100  101  102  103  104
     [105]  105  106  107  108  109  110  111  112  113  114  115  116  117
     [118]  118  119  120  121  122  123  124  125  126  127  128  129  130
     [131]  131  132  133  134  135  136  137  138  139  140  141  142  143
     [144]  144  145  146  147  148  149  150  151  152  153  154  155  156
     [157]  157  158  159  160  161  162  163  164  165  166  167  168  169
     [170]  170  171  172  173  174  175  176  177  178  179  180  181  182
     [183]  183  184  185  186  187  188  189  190  191  192  193  194  195
     [196]  196  197  198  199  200  201  202  203  204  205  206  207  208
     [209]  209  210  211  212  213  214  215  216  217  218  219  220  221
     [222]  222  223  224  225  226  227  228  229  230  231  232  233  234
     [235]  235  236  237  238  239  240  241  242  243  244  245  246  247
     [248]  248  249  250  251  252  253  254  255  256  257  258  259  260
     [261]  261  262  263  264  265  266  267  268  269  270  271  272  273
     [274]  274  275  276  277  278  279  280  281  282  283  284  285  286
     [287]  287  288  289  290  291  292  293  294  295  296  297  298  299
     [300]  300  301  302  303  304  305  306  307  308  309  310  311  312
     [313]  313  314  315  316  317  318  319  320  321  322  323  324  325
     [326]  326  327  328  329  330  331  332  333  334  335  336  337  338
     [339]  339  340  341  342  343  344  345  346  347  348  349  350  351
     [352]  352  353  354  355  356  357  358  359  360  361  362  363  364
     [365]  365  366  367  368  369  370  371  372  373  374  375  376  377
     [378]  378  379  380  381  382  383  384  385  386  387  388  389  390
     [391]  391  392  393  394  395  396  397  398  399  400  401  402  403
     [404]  404  405  406  407  408  409  410  411  412  413  414  415  416
     [417]  417  418  419  420  421  422  423  424  425  426  427  428  429
     [430]  430  431  432  433  434  435  436  437  438  439  440  441  442
     [443]  443  444  445  446  447  448  449  450  451  452  453  454  455
     [456]  456  457  458  459  460  461  462  463  464  465  466  467  468
     [469]  469  470  471  472  473  474  475  476  477  478  479  480  481
     [482]  482  483  484  485  486  487  488  489  490  491  492  493  494
     [495]  495  496  497  498  499  500  501  502  503  504  505  506  507
     [508]  508  509  510  511  512  513  514  515  516  517  518  519  520
     [521]  521  522  523  524  525  526  527  528  529  530  531  532  533
     [534]  534  535  536  537  538  539  540  541  542  543  544  545  546
     [547]  547  548  549  550  551  552  553  554  555  556  557  558  559
     [560]  560  561  562  563  564  565  566  567  568  569  570  571  572
     [573]  573  574  575  576  577  578  579  580  581  582  583  584  585
     [586]  586  587  588  589  590  591  592  593  594  595  596  597  598
     [599]  599  600  601  602  603  604  605  606  607  608  609  610  611
     [612]  612  613  614  615  616  617  618  619  620  621  622  623  624
     [625]  625  626  627  628  629  630  631  632  633  634  635  636  637
     [638]  638  639  640  641  642  643  644  645  646  647  648  649  650
     [651]  651  652  653  654  655  656  657  658  659  660  661  662  663
     [664]  664  665  666  667  668  669  670  671  672  673  674  675  676
     [677]  677  678  679  680  681  682  683  684  685  686  687  688  689
     [690]  690  691  692  693  694  695  696  697  698  699  700  701  702
     [703]  703  704  705  706  707  708  709  710  711  712  713  714  715
     [716]  716  717  718  719  720  721  722  723  724  725  726  727  728
     [729]  729  730  731  732  733  734  735  736  737  738  739  740  741
     [742]  742  743  744  745  746  747  748  749  750  751  752  753  754
     [755]  755  756  757  758  759  760  761  762  763  764  765  766  767
     [768]  768  769  770  771  772  773  774  775  776  777  778  779  780
     [781]  781  782  783  784  785  786  787  788  789  790  791  792  793
     [794]  794  795  796  797  798  799  800  801  802  803  804  805  806
     [807]  807  808  809  810  811  812  813  814  815  816  817  818  819
     [820]  820  821  822  823  824  825  826  827  828  829  830  831  832
     [833]  833  834  835  836  837  838  839  840  841  842  843  844  845
     [846]  846  847  848  849  850  851  852  853  854  855  856  857  858
     [859]  859  860  861  862  863  864  865  866  867  868  869  870  871
     [872]  872  873  874  875  876  877  878  879  880  881  882  883  884
     [885]  885  886  887  888  889  890  891  892  893  894  895  896  897
     [898]  898  899  900  901  902  903  904  905  906  907  908  909  910
     [911]  911  912  913  914  915  916  917  918  919  920  921  922  923
     [924]  924  925  926  927  928  929  930  931  932  933  934  935  936
     [937]  937  938  939  940  941  942  943  944  945  946  947  948  949
     [950]  950  951  952  953  954  955  956  957  958  959  960  961  962
     [963]  963  964  965  966  967  968  969  970  971  972  973  974  975
     [976]  976  977  978  979  980  981  982  983  984  985  986  987  988
     [989]  989  990  991  992  993  994  995  996  997  998  999 1000
  • Select the five first elements in x, assign them to the variable x1 and display them

    Click here for output
    > x1 <- x[1:5]
    > x1
    [1]  1  2  3  4  5 
  • Select the five last elements in x, assign them to the variable x2 and display them

    Click here for output
    > x2 <- x[996:1000]
    > x2
    [1]  996  997  998  999 1000
  • Calculate the sum of x1 and x2, assign the result to the variable sum and display it

    You should get 996 997 998 999 1000

    Click here for output
    > sum <- x1 + x2
    > sum
    [1]  997  999  1001 1003 1005

Here is a warning! You can take sums of vectors of different length and get a result. The shorter vector will be recycled until the end of the longer vector. If you’re lucky you get a warning but always check the results so you get what you expect.

Data Frames

Now that you have some practice, let's start analyzing some more complex data.

  • In your script, add a comment indicating we are using data frames

    Click here for output
    > # analyzing data frames (or any other comment you want)
  • Create the vectors animal, weight_g, years and common_pet using the following code

    > # Defining variables
    > animal     <- c("chicken", "anaconda", "gecko", "ladybug", "ant", "dog", "rat", "elephant")
    > weight_g   <- c(2500, 250000, 70, 0.02, 0.005, 40000, 140, 4000000)
    > years      <- c(5, 10, 15, 1, 1, 13, 2, 48)
    > common_pet <- c("y","n","n","n","n","y","y","n")
  • Create a data frame called exp with these vectors as columns

    Click here for output
    > # Creating data frame
    > exp <- data.frame(animal, weight_g, years, common_pet)
    > exp
    >   animal weight_g years common_pet
    1  chicken  2.5e+03     5          y
    2 anaconda  2.5e+05    10          n
    3    gecko  7.0e+01    15          n
    4  ladybug  2.0e-02     1          n
    5      ant  5.0e-03     1          n
    6      dog  4.0e+04    13          y
    7      rat  1.4e+02     2          y
    8 elephant  4.0e+06    48          n
  • Use the following functions with exp as argument, to explore the data and answer the corresponding questions

    • head()

    • tail()

      Q2. What is the difference between head() and tail()

      Click here for output
      > # looking at the beginning of the file
      > head(exp)
          animal weight_g years common_pet
      1  chicken  2.5e+03     5          y
      2 anaconda  2.5e+05    10          n
      3    gecko  7.0e+01    15          n
      4  ladybug  2.0e-02     1          n
      5      ant  5.0e-03     1          n
      6      dog  4.0e+04    13          y
      >
      > # looking at the end of the file
      > tail(exp)
          animal weight_g years common_pet
      3    gecko  7.0e+01    15          n
      4  ladybug  2.0e-02     1          n
      5      ant  5.0e-03     1          n
      6      dog  4.0e+04    13          y
      7      rat  1.4e+02     2          y
      8 elephant  4.0e+06    48          n
      
      ANS = head displays the first rows, while tail displays the last rows
    • dim()

      Q3. What information do you get with dim()

      Click here for output
      > dim(exp)
      [1] 8 4
      
      ANS = dim gives us the number of rows and columns
    • str()

      Q4. Looking at the output from str(), which variables are numerical?

      Click here for output
      > str(exp)
      'data.frame':	8 obs. of  4 variables:
       $ animal    : chr  "chicken" "anaconda" "gecko" "ladybug" ...
       $ weight_g  : num  2.5e+03 2.5e+05 7.0e+01 2.0e-02 5.0e-03 4.0e+04 1.4e+02 4.0e+06
       $ years     : num  5 10 15 1 1 13 2 48
       $ common_pet: chr  "y" "n" "n" "n" ...head(data)
       
      ANS = weight and years are numeric while animal and common_pet are strings or characters (char)                 
    • summary()

      Q5. From the summary() results, how much does the heaviest animal weight? what is the mean life expectancy of our group of animals?

      Click here for output
      > summary(exp)
          animal             weight_g           years      
       Length:8           Min.   :      0   Min.   : 1.00  
       Class :character   1st Qu.:     53   1st Qu.: 1.75  
       Mode  :character   Median :   1320   Median : 7.50  
                          Mean   : 536589   Mean   :11.88  
                          3rd Qu.:  92500   3rd Qu.:13.50  
                          Max.   :4000000   Max.   :48.00  
        common_pet       
       Length:8          
       Class :character  
       Mode  :character  
                 
      ANS = the heaviest animal weights 4000000 gr while the average life expectancy is 11.88 years                 

Now, let's investigate how many animals that in average live at least 10 years, are considered not to be common pets. Once again, there are a lot of ways to answer this question, the proposed workflow below is just one way.

First, we would need to select those animals that live at least 10 years and then count which are not considered common pets.

  • Select the animals that in average live at least 10 years

    • Display the column called years using exp$years

      Click here for output
      > # Selecting data
      > exp$years
      [1]  5 10 15  1  1 13  2 48
    • Query which elements (years) are equal to 10 years or more, using the >= operator
      You will see a vector of logical operators (the answer of our comparison for each value)

      Click here for output
      > # querying data
      > exp$years >= 10
      [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
    • Extract the indexes of the TRUE comparisons using which() with the expression used above as argument

      Click here for output
      > # extracting TRUE positions
      > which(exp$years >= 10)
      [1] 2 3 6 8
    • Save the result to a variable called pos

      Click here for output
      > # saving the positions to a variable
      > pos <- which(exp$years >= 10)
    • Extract all the rows from exp that match the row numbers in pos
      Remember that inside the square brackets you first tell which rows, then which columns you want to see

      Click here for output
      > # extracting rows belonging to animals that live at least 10 years
      > exp[pos,]
          animal weight_g years common_pet
      2 anaconda   250000    10          n
      3    gecko       70    15          n
      6      dog    40000    13          y
      8 elephant  4000000    48          n
    • Save the result to a variable called exp_filt

      Click here for output
      > # saving results
      > exp_filt <- exp[pos,]

      Q6. Which animals did you get?

      Click here for output
      > # displaying the animal column
      > exp_filt$animal
      [1] "anaconda" "gecko"    "dog"      "elephant"
  • Select and count animals considered not to be common pets

    • Display the column called common_pet from exp_filt

      Click here for output
      > # displaying the common_pet category of the filtered data
      > exp_filt$common_pet
      [1] "n" "n" "y" "n"
    • Run table() with the expression used above

      Click here for output
      > # creating a contingency table (counts of each value in the variable)
      > table(exp_filt$common_pet)
      
      n y 
      3 1 

      Q7. How many of these long living animals are considered not to be common pets?

      Click here for output
      3
      

Installing packages

R packages are a set of R functions, compiled code, and sample data in a standardized collection format that any user can install and apply to analyze data. In these exercises we will be using tidyverse, if you google how to install it, you fill find this code:

> install.packages("tidyverse")
  • Go ahead and run it

    Click here for output
    > install.packages("tidyverse")
    WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
    
    https://cran.rstudio.com/bin/windows/Rtools/
    Installing package intoC:/Users/Marcela/AppData/Local/R/win-library/4.2’
    (aslibis unspecified)
    trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.2/tidyverse_2.0.0.zip'
    Content type 'application/zip' length 430916 bytes (420 KB)
    downloaded 420 KB
    
    packagetidyversesuccessfully unpacked and MD5 sums checked
    
    The downloaded binary packages are in
    	C:\Users\Marcela\AppData\Local\Temp\Rtmpcn9gYd\downloaded_packages
  • Load the package using library()

    Click here for output
    > library("tidyverse")
    ── Attaching core tidyverse packages ────────────────────────────────────────────── tidyverse 2.0.0 ──
    ✔ dplyr     1.1.0readr     2.1.4forcats   1.0.0stringr   1.5.0ggplot2   3.4.2tibble    3.2.0lubridate 1.9.2tidyr     1.3.0purrr     1.0.1     
    ── Conflicts ──────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
    ✖ dplyr::filter() masks stats::filter()
    ✖ dplyr::lag()    masks stats::lag()
    ℹ Use the conflicted package to force all conflicts to become errors
    Warning messages:
    1: packagetidyversewas built under R version 4.2.3 
    2: packageggplot2was built under R version 4.2.3 
    3: packagetibblewas built under R version 4.2.3 
    4: packagetidyrwas built under R version 4.2.3 
    5: packagereadrwas built under R version 4.2.3 
    6: packagepurrrwas built under R version 4.2.3 
    7: packagestringrwas built under R version 4.2.3 
    8: packageforcatswas built under R version 4.2.3 
    9: packagelubridatewas built under R version 4.2.3 
    

Other packages we will be using are: readxl and writexl

  • Install them as you did with tidyverse

    Click here for output
    > install.packages("readxl")
    
     WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
    
     https://cran.rstudio.com/bin/windows/Rtools/
     Installing package intoC:/Users/Marcela/AppData/Local/R/win-library/4.2’
     (aslibis unspecified)
     trying URL 
     'https://cran.rstudio.com/bin/windows/contrib/4.2/readxl_1.4.3.zip'
     Content type 'application/zip' length 1183808 bytes (1.1 MB)
     downloaded 1.1 MB
    
     packagereadxlsuccessfully unpacked and MD5 sums checked
    
     The downloaded binary packages are in
     C:\Users\Marcela\AppData\Local\Temp\Rtmpcn9gYd\downloaded_packages
    
     > install.packages("writexl")
    
     WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
    
     https://cran.rstudio.com/bin/windows/Rtools/
     Installing package intoC:/Users/Marcela/AppData/Local/R/win-library/4.2’
     (aslibis unspecified)
     trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.2/writexl_1.4.2.zip'
     Content type 'application/zip' length 190446 bytes (185 KB)
     downloaded 185 KB
    
     packagewritexlsuccessfully unpacked and MD5 sums checked
    
     The downloaded binary packages are in
     C:\Users\Marcela\AppData\Local\Temp\Rtmpcn9gYd\downloaded_packages
  • Load them using library()

    Click here for output
    > library("readxl")
    Warning message:
    packagereadxlwas built under R version 4.2.3 
    
    > library("writexl")
    Warning message:
    packagewritexlwas built under R version 4.2.3 

We will also be using EnhancedVolcano(), however this package is from Bioconductor, so its installation is a little different. Run the following:

 > if (!requireNamespace('BiocManager', quietly = TRUE))
 >  install.packages('BiocManager')
 > BiocManager::install('EnhancedVolcano')
Click here for output
    > if (!requireNamespace('BiocManager', quietly = TRUE))
    >  install.packages('BiocManager')
    > BiocManager::install('EnhancedVolcano')
 
    'getOption("repos")' replaces Bioconductor standard
    repositories, see 'help("repositories", package =
    "BiocManager")' for details.
    Replacement repositories:
        CRAN: https://cran.rstudio.com/
    Bioconductor version 3.16 (BiocManager 1.30.20), R 4.2.2
      (2022-10-31 ucrt)
    Installation paths not writeable, unable to update
      packages
      path: C:/Program Files/R/R-4.2.2/library
         packages:
        boot, class, codetools, foreign, KernSmooth,
        lattice, MASS, Matrix, mgcv, nlme, nnet, spatial,
        survival
    Old packages: 'askpass', 'BiocManager', 'broom',
      'bslib', 'cli', 'cpp11', 'curl', 'dbplyr',
      'DelayedArray', 'digest', 'dplyr', 'dqrng', 'DT',
      'evaluate', 'fontawesome', 'fs', 'gargle', 'ggplot2',
      'googledrive', 'googlesheets4', 'gtable', 'haven',
      'htmltools', 'httr', 'jsonlite', 'knitr', 'labeling',
      'lme4', 'lubridate', 'markdown', 'MatrixModels',
      'minqa', 'mvtnorm', 'openssl', 'pkgload', 'polyclip',
      'prettyunits', 'processx', 'promises', 'ps', 'purrr',
      'quantreg', 'Rcpp', 'RcppArmadillo', 'RcppHNSW',
      'rematch', 'reticulate', 'rlang', 'rmarkdown',
      'rstudioapi', 'sass', 'sctransform', 'Seurat',
      'SeuratObject', 'snakecase', 'spatstat.explore',
      'spatstat.geom', 'spatstat.random', 'sys', 'testthat',
      'tibble', 'tinytex', 'tzdb', 'uuid', 'V8', 'vctrs',
      'viridisLite', 'waldo', 'withr', 'xfun', 'xml2', 'zoo'
    Update all/some/none? [a/s/n]: 
    n
    Warning message:
    package(s) not installed when version(s) same as or
      greater than current; use `force = TRUE` to
      re-install: 'EnhancedVolcano' 

And load it with

> library('EnhancedVolcano')
Click here for output
> library('EnhancedVolcano')
Loading required package: ggrepel
Warning message:
packageggrepelwas built under R version 4.2.3

Reading data

Download expression_data.xlsx from Canvas and save it in your Desktop (or in your working directory)

This file has proteomic data from 3 samples grown in low oxygen conditions and 3 control samples grown under normal oxygen conditions. We will analyze this dataset to identify proteins that are different between the low and the normal oxygen conditions.

  • Read the data using read_xlsx() and save it as dat

    Click here for output
    > # reading data
    > dat <- read_xlsx("expression_data.xlsx")
  • Use dim() and head() with dat as argument to explore the data

    Click here for output
    > dim(dat)
    [1] 2870   8
    
    > head(dat)
    # A tibble: 6 × 8
      Accession Description     Control_1 Control_2 Control_3
      <chr>     <chr>               <dbl>     <dbl>     <dbl>
    1 Q09666    Neuroblast dif1.03      0.997     0.966
    2 Q15149    Plectin OS=Hom0.968     1.01      1.01 
    3 P21333    Filamin-A OS=H0.986     1.02      0.984
    4 O75369    Filamin-B OS=H1.00      1.01      0.987
    5 P78527    DNA-dependent0.942     1.04      1.01 
    6 P07900    Heat shock pro0.998     1.02      0.983
    # ℹ 3 more variables: Hypoxia_1 <dbl>, Hypoxia_2 <dbl>,
    #   Hypoxia_3 <dbl>

    Q8. Each row is a protein, how many are we going to analyze?

    Click here for output
    2870 proteins

Data selection

Filtering our data (selecting values above a threshold, genes that belong to the same pathway or targeting certain samples, etc) is one important step during data analysis. We will practice this during the practicals, showcasing several common strategies.

In this example, there are proteins without expression values (common in a proteomics experiment). These values (NA) need to be removed so we can do the proper calculations.

  • Use the drop_na function as follows

    > # removing NA values
    > dat <- dat %>% drop_na

    Q9. How many NA values did you remove? Hint: use dim()

    Click here for output
    > dim(dat)
    5

Now let's select the columns where the sample data is stored and save them in a new variable called expr_data:

> # Selecting columns with expression data
> expr_data <- dat %>% select(Control_1:Hypoxia_3)
> head(expr_data)
# A tibble: 6 × 6
  Control_1 Control_2 Control_3 Hypoxia_1 Hypoxia_2
      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1.03      0.997     0.966     0.908     0.869
2     0.968     1.01      1.01      1.21      1.20 
3     0.986     1.02      0.984     0.963     0.925
4     1.00      1.01      0.987     1.25      1.19 
5     0.942     1.04      1.01      0.941     0.894
6     0.998     1.02      0.983     0.992     0.921
# ℹ 1 more variable: Hypoxia_3 <dbl>

Descriptive statistics

Before any analysis, we need to understand the data so we can apply the correct statistical tests. To do this, we can generate simple summaries about our data and visualize them with different graphs. Let's try some.

Boxplot

A boxplot shows locality, spread and skewness of numerical data and R has a basic function for this.

  • Use boxplot() with expr_data as argument

    Click here for output
    > # plotting a boxplot
    > boxplot(expr_data)

    Q10. What is plotted in the x and y-axis?

    Click here for output
    ANS =
    x = samples
    y = values, in this case expression data of the proteins
    • Add the xlab and ylab arguments, describing what is plotted

      Click here for output
      > # adding labels to axes
      > boxplot(expr_data, ylab="ADD_Y_AXIS_TITLE_HERE", xlab="ADD_X_AXIS_TITLE_HERE")

    • Add the col argument selecting any color you want (here is a list of some colors)

      Click here for output
      > # adding colors to data points
      > boxplot(expr_data, ylab="ADD_Y_AXIS_TITLE_HERE", xlab="ADD_X_AXIS_TITLE_HERE", col="ADD_COLOR_NAME_HERE")

    • Create a variable sample_colors with a total of 6 colors, the first three with the color you selected above and the last three with another color

      Click here for output
      > # color vector per sample category
      > sample_colors= rep(c("ADD_COLOR_FOR_GROUP_1_HERE","ADD_COLOR_FOR_GROUP_2_HERE"),each=3)
      > sample_colors
      [1] "red"  "red"  "red"  "blue" "blue" "blue"
    • Use sample_colors as argument for col

      Click here for output
      > # adding category colors
      > boxplot(expr_data, ylab="ADD_Y_AXIS_TITLE_HERE", xlab="ADD_X_AXIS_TITLE_HERE", col=sample_colors)

      Q11. What do you conclude from the boxplot?

      Click here for output
      ANS = the distribution of the data is similar among samples of the same group, where the hypoxia samples have a higher variation
      

Histogram

A histogram shows the distribution of the data points. This graph is widely used to assess if our data is normally distributed or not. If it is, then we can use the common parametric tests, like the t-test to evaluate if the changes we see when we compare to groups are statistically significant. If our data is not normally distributed, then we would need to do some kind of data transformation to approximate our data to a normal distribution. However if the distribution of the data is not normal, then we would need to use non-parametric tests to evaluate the significance of the changes we see.

  • Use hist() with unlist(expr_data) as argument. We use unlist() so we transform our data to a vector rather than a data.frame, which is the input type for an histogram.

    Click here for output
    > # plotting a histogram
    > hist(unlist(expr_data))

    • Add the main argument, which prints out the title of the plot

      Click here for output
      > # adding a title to the plot
      > hist(unlist(expr_data), main="ADD_PLOT_TITLE_HERE")

    • Add the col argument with a color of your choice

      Click here for output
      > # adding a custom color to the plot
      > hist(unlist(expr_data), main="ADD_PLOT_TITLE_HERE", col="ADD_COLOR_NAME_HERE")

    • Adjust the y-axis to 4000 using the ylim argument

      Click here for output
      > # adjusting the limits to be plotted
      > hist(unlist(expr_data), main="ADD_PLOT_TITLE_HERE", col="ADD_COLOR_NAME_HERE", ylim=c(0,4000))

    Q12. Do you think our data is normally distributed? Here some examples of different distributions

    Click here for output
    ANS = slightly skewed

Log transformation

As shortly mentioned before, to meet statistical assumptions the data should be close to a normal distribution. To achieve this, we need to transform our data using an adequate technique, like log-ing the data, using the square-root or the reciprocal transformation, among so many others. As our data is a little skewed, we use the Log transformation.

  • Use log2() with expr_data as argument and save it in a variable called log_expr_data

    Click here for output
    > # log transformation
    > expr_data_log <- log2(expr_data)
  • Use the latest code you used to create a histogram, but now using unlist(log_expr_data) as data argument and modifying the graph title accordingly

    Click here for output
    > # plotting the log transformed data
    > hist(unlist(expr_data_log), main="ADD_PLOT_TITLE_HERE", col="ADD_COLOR_NAME_HERE", ylim=c(0,4000))

Principal Component Analysis (PCA)

A PCA is a statistical procedure that summarizes the information in large data tables, like when we are working with NGS data. We won't get into details, however it is the most common tool to explore the variance in the data.

Since we have data per protein, and we want to check the difference per sample and not per gene (at least not yet), we need to transpose the data (changing the columns to rows)

  • Use t() with expr_data_log as data argument and save it as expr_data_lot_t

    Click here for output
    > # transposing data
    > expr_data_log_t <- t(expr_data_log)
  • Display the first 4 rows and 4 columns of expr_data_log and expr_data_log_t to visualize the differences

    Click here for output
    > # data per gene/protein
    > expr_data_log[1:4,1:4]
         Control_1   Control_2   Control_3  Hypoxia_1
    1  0.046840254 -0.00433459 -0.04990491 -0.1392358
    2 -0.046921047  0.01863417  0.02005765  0.2714257
    3 -0.020340448  0.02856915 -0.02326978 -0.0543923
    4  0.004321606  0.02005765 -0.01887801  0.3207735
    
    > # data per sample
    > expr_data_log_t[1:4,1:4]
                     [,1]        [,2]        [,3]
    Control_1  0.04684025 -0.04692105 -0.02034045
    Control_2 -0.00433459  0.01863417  0.02856915
    Control_3 -0.04990491  0.02005765 -0.02326978
    Hypoxia_1 -0.13923580  0.27142568 -0.05439230
                      [,4]
    Control_1  0.004321606
    Control_2  0.020057652
    Control_3 -0.018878010
    Hypoxia_1  0.320773477
  • Run the following code to calculate the PCA

     > # calculating the principal component
     > pc <- prcomp(expr_data_log_t, center=TRUE, scale=TRUE)
  • Save the value of pc$x in a variable called pc_comp and display the result

    Click here for output
    > # displaying all PCAs
    > pc_comp <- pc$x
    > pc_comp
    
    Control_1 -21.78749  13.492513 -18.5562064   1.420378
    Control_2 -23.78424  -6.919167   0.6504717   1.642475
    Control_3 -23.94487  -3.534305  18.1821496  -3.415268
    Hypoxia_1  19.43210 -21.433067  -9.3744625   4.139792
    Hypoxia_2  24.95254   5.663831  -0.3380005 -16.567501
    Hypoxia_3  25.13197  12.730195   9.4360480  12.780124
                    PC5           PC6
    Control_1 -5.031875  1.238961e-14
    Control_2 14.165258  2.897595e-14
    Control_3 -8.777509  1.045301e-14
    Hypoxia_1 -4.657209 -2.610542e-15
    Hypoxia_2  2.786876 -3.384717e-14
    Hypoxia_3  1.514459 -1.826534e-14

We have 6 principal components as we have 6 samples, and the numbers tells us the correlation of the items (samples) to each component. PC1 represents the component that explains the most variation in the data, if we plot it against PC2 (representing the second component that explains most variation in the data) we may be able to identify these components.

  • Use plot() with pr_comp as argument

    Click here for output
    > # plotting the components
    > plot(pc_comp)

    • Adjust the color points using col=sample_colors

      Click here for output
      > # adding colors per sample
      > plot(pc_comp, col=sample_colors)

    • Change the plotting symbol to the symbol you want, using the pch argument. Here you can find a list of the symbols

      Click here for output
      > # changing the plotting symbol
      > plot(pc_comp, col=sample_colors, pch=ADD_PLOTTING_SYMBOL_HERE)

    • Adjust the size of the plotting symbol to your liking by using the cex argument

      Click here for output
      > # adjusting the size of the plotting symbols
      > plot(pc_comp, col=sample_colors, pch=ADD_PLOTTING_SYMBOL_HERE, cex=ADD_PLOTTING_SIZE_HERE)

At this point, we can see that there samples form two clusters. If we add the sample names it will be easier to understand how and why they cluster.

  • Use the following code

    > # plotting pc's
    > plot(pc_comp, col=sample_colors, pch=ADD_PLOTTING_SYMBOL_HERE, cex=ADD_PLOTTING_SIZE_HERE)
    
    > # adding sample names
    > text(pc_comp[1:3,1], pc_comp[1:3,2], colnames(expr_data_log)[1:3], col=sample_colors[1:3], pos=4)
    > text(pc_comp[4:6,1], pc_comp[4:6,2], colnames(expr_data_log)[4:6], col=sample_colors[4:6], pos=2)
    Click here for output

    Q13. Explain the code: text(pc_comp[1:3,1], pc_comp[1:3,2], colnames(expr_data_log)[1:3], col=sample_colors[1:3], pos=4)

    Click here for output
    ANS = 
    
    text() -> text draws the strings given in the vector labels at the coordinates given by x and y
    
    pc_comp[1:3,1] -> selects the PC1 coordinates of the control group
    Control_1 Control_2 Control_3 
    -21.78749 -23.78424 -23.94487 
    
    pc_comp[1:3,2] -> selects the PC2 coordinates of the control group
    Control_1 Control_2 Control_3 
    13.492513 -6.919167 -3.534305
    
    head(expr_data_log[1:3]) -> selects the columns of the control group
         Control_1   Control_2   Control_3
    1  0.046840254 -0.00433459 -0.04990491
    2 -0.046921047  0.01863417  0.02005765
    3 -0.020340448  0.02856915 -0.02326978
    4  0.004321606  0.02005765 -0.01887801
    5 -0.086201035  0.05797007  0.02005765
    
    colnames(expr_data_log)[1:3] -> with colnames() we select the names of the control samples
    [1] "Control_1" "Control_2" "Control_3"
    
    sample_colors[1:3] -> selects the colors we set for the control group
    [1] "red" "red" "red"
    
    pos=4 -> a position specifier for the text. 
             Values of 123 and 4, respectively indicate positions 
                       below, to the left of, above and to the right of the specified (x,ycoordinates
  • Use xlim=c(-30,30) and ylim=c(-30,20) in the plot() function to control the x and y-axis

    Click here for output
    > # adjusting the printing limits
    > plot(pc_comp, col=sample_colors, pch=ADD_PLOTTING_SYMBOL_HERE, cex=ADD_PLOTTING_SIZE_HERE, xlim=c(-30,30), ylim=c(-30,20))
    
    > # adding sample names
    > text(pc_comp[1:3,1], pc_comp[1:3,2], colnames(expr_data_log)[1:3], col=sample_colors[1:3], pos=4)
    > text(pc_comp[4:6,1], pc_comp[4:6,2], colnames(expr_data_log)[4:6], col=sample_colors[4:6], pos=2)

With this information, if you draw a vertical line across the x-axis, the samples can be easily clustered in 2 groups: Control and Hypoxia. So PC1 is explaining the difference in protein expression given the type of sample (which we usually are after!).

If you draw a horizontal line across the y-axis, you can see that we also have 2 groups, one including Control_1, Hypoxia_3 and Hypoxia_2 and the other including Control_3, Control_2 and Hypoxia_1. It seems there is another factor affecting the expression of our proteins that was revcovered by PC2. As we do not have more information about the samples (metadata) we can't determine the source of this grouping. Is it important? we hope not!

Some statistics

With this first exploratory analysis we checked that our replicates "behaved the same" (similar overall protein expression), so now it is time to compare our samples to identify proteins that are expressed at different levels depending the condition. In a clinical setting these proteins are called biomarkers and they are quite important to detect diseases (as an example).

The process can be summarized in the following steps:

  • Calculate the average expression per group, so we can compare the groups
  • Calculate the ratio (or fold change) between groups, to see the difference in expression
  • Calculate the statistical significance of these changes, to know which changes to trust
  • Calculate the false positive rate, to compensate for so many statistical tests that will be done

Fold change

  • Use the following code to create a new variable called expr_data_out

    > # calculating the mean for both groups
    > expr_data_out <- cbind(Accession=dat$Accession,
                             expr_data, 
                             mean_Control = apply(expr_data[,1:3], 1, mean), 
                             mean_Hypoxia = apply(expr_data[,4:6], 1, mean))
    > head(expr_data_out)
    Click here for output
    > head(expr_data_out)
      Accession Control_1 Control_2 Control_3 Hypoxia_1
    1    Q09666 1.0328136 0.9969049 0.9661567 0.9075991
    2    Q15149 0.9682882 1.0134108 1.0136614 1.2065772
    3    P21333 0.9863230 1.0202332 0.9843270 0.9631440
    4    O75369 1.0025070 1.0141024 0.9868860 1.2494849
    5    P78527 0.9415689 1.0411724 1.0139032 0.9408542
    6    P07900 0.9977023 1.0187355 0.9834650 0.9918571
      Hypoxia_2 Hypoxia_3 mean_Control mean_Hypoxia
    1 0.8687352 0.8796478    0.9986251    0.8853274
    2 1.1955181 1.1898906    0.9984535    1.1973286
    3 0.9253854 0.9130711    0.9969611    0.9338668
    4 1.1944714 1.1596355    1.0011651    1.2011973
    5 0.8937926 0.9129364    0.9988815    0.9158610
    6 0.9207342 0.8771933    0.9999676    0.9299282
  • Compute the log2-fold change with the following code

    > # calculating the mean for both groups
    > expr_data_out <- expr_data_out %>% 
                       mutate(log2FC = log2(mean_Hypoxia/mean_Control))
    > head(expr_data_out)
    Click here for output
    > head(expr_data_out)
      Accession Control_1 Control_2 Control_3 Hypoxia_1
    1    Q09666 1.0328136 0.9969049 0.9661567 0.9075991
    2    Q15149 0.9682882 1.0134108 1.0136614 1.2065772
    3    P21333 0.9863230 1.0202332 0.9843270 0.9631440
    4    O75369 1.0025070 1.0141024 0.9868860 1.2494849
    5    P78527 0.9415689 1.0411724 1.0139032 0.9408542
    6    P07900 0.9977023 1.0187355 0.9834650 0.9918571
      Hypoxia_2 Hypoxia_3 mean_Control mean_Hypoxia
    1 0.8687352 0.8796478    0.9986251    0.8853274
    2 1.1955181 1.1898906    0.9984535    1.1973286
    3 0.9253854 0.9130711    0.9969611    0.9338668
    4 1.1944714 1.1596355    1.0011651    1.2011973
    5 0.8937926 0.9129364    0.9988815    0.9158610
    6 0.9207342 0.8771933    0.9999676    0.9299282
           log2FC
    1 -0.17373212
    2  0.26205210
    3 -0.09432032
    4  0.26279317
    5 -0.12518487
    6 -0.10476197

Statistical significance

  • Use the following code to determine if the fold changes you just calculated are statistically significant

    > # calculating the mean for both groups
    > expr_data_out <- cbind(expr_data_out, 
                             pvalue = apply(expr_data_log, 1, function(x) {t.test(x[1:3], x[4:6])$p.value}))
    
    
    > head(expr_data_out)
    Click here for output
    > head(expr_data_out)
    Accession Control_1 Control_2 Control_3 Hypoxia_1
    1    Q09666 1.0328136 0.9969049 0.9661567 0.9075991
    2    Q15149 0.9682882 1.0134108 1.0136614 1.2065772
    3    P21333 0.9863230 1.0202332 0.9843270 0.9631440
    4    O75369 1.0025070 1.0141024 0.9868860 1.2494849
    5    P78527 0.9415689 1.0411724 1.0139032 0.9408542
    6    P07900 0.9977023 1.0187355 0.9834650 0.9918571
      Hypoxia_2 Hypoxia_3 mean_Control mean_Hypoxia
    1 0.8687352 0.8796478    0.9986251    0.8853274
    2 1.1955181 1.1898906    0.9984535    1.1973286
    3 0.9253854 0.9130711    0.9969611    0.9338668
    4 1.1944714 1.1596355    1.0011651    1.2011973
    5 0.8937926 0.9129364    0.9988815    0.9158610
    6 0.9207342 0.8771933    0.9999676    0.9299282
           log2FC      pvalue
    1 -0.17373212 0.009422054
    2  0.26205210 0.004481209
    3 -0.09432032 0.034120750
    4  0.26279317 0.007762774
    5 -0.12518487 0.085029848
    6 -0.10476197 0.167316618

Multiple testing correction

When we are testing for potential differences between datasets, the incidence of false positives (in this case, peptides that are falsely called differentially expressed when they are not) will increase when we increase the amount of tests performed. It is therefore necessary to perform a multiple testing correction, which will adjust the individual p-value for each peptide while keeping the overall error rate (or false positive rate).

The p.adjust() function harbours several methods to correct our pvalues. In this case we will use the Benjamini and Hochberg test, which deals with the false discover rate (FDR): the expected proportion of false discoveries amongst the rejected null hypotheses.

  • Run the following:

    > # calculating the mean for both groups
    > expr_data_out <- cbind(expr_data_out, fdr = p.adjust(expr_data_out$pvalue, method="BH"))
    
    > head(expr_data_out)
    Click here for output
    Accession Control_1 Control_2 Control_3 Hypoxia_1
    1    Q09666 1.0328136 0.9969049 0.9661567 0.9075991
    2    Q15149 0.9682882 1.0134108 1.0136614 1.2065772
    3    P21333 0.9863230 1.0202332 0.9843270 0.9631440
    4    O75369 1.0025070 1.0141024 0.9868860 1.2494849
    5    P78527 0.9415689 1.0411724 1.0139032 0.9408542
    6    P07900 0.9977023 1.0187355 0.9834650 0.9918571
      Hypoxia_2 Hypoxia_3 mean_Control mean_Hypoxia
    1 0.8687352 0.8796478    0.9986251    0.8853274
    2 1.1955181 1.1898906    0.9984535    1.1973286
    3 0.9253854 0.9130711    0.9969611    0.9338668
    4 1.1944714 1.1596355    1.0011651    1.2011973
    5 0.8937926 0.9129364    0.9988815    0.9158610
    6 0.9207342 0.8771933    0.9999676    0.9299282
           log2FC      pvalue        fdr
    1 -0.17373212 0.009422054 0.05249245
    2  0.26205210 0.004481209 0.03699903
    3 -0.09432032 0.034120750 0.10469770 
    4  0.26279317 0.007762774 0.04793582
    5 -0.12518487 0.085029848 0.18275358
    6 -0.10476197 0.167316618 0.28171728

Saving results

At this point it is recommendable to save all the results in a file. We can of course save the filtered results (see next section) however, some downstream tools require all the results to calculate background noise, like when we do a pathway analysis.

  • Use write_xlsx() to save expr_data_out in a file named results.xls

    Click here for output
    # saving results to a file
    # write_xlsx(exprdata_out, "results.xlsx")

Data filtering

Now let's apply some filters to narrow down our candidate proteins.

  • Select a threshold for the FDR and save is as alfa

  • Select a threshold for the fold change and save is as FC

    Click here for output
    > # setting some thresholds
    > alfa=ADD_YOUR_THRESHOLD_HERE
    > FC=ADD_YOUR_THRESHOLD_HERE

Let's identify the statistically significant changes by running the following code:

> # calculating the number of statistically significant changes using FDR
> sum(expr_data_out$fdr < alfa)
[1] 529

> # calculating the number of statistically significant changes using p-value
> sum(expr_data_out$pvalue < alfa)
[1] 39

Q14. Why are there less statistically significant changes when we filter our data based on FDR?

  • Save the indexes of the proteins that passed our threshold in a variable called sig

    Click here for output
    > # protein indexes that are statistically significant
    > sig <- expr_data_out$fdr < alfa
  • Use sig to display the name of the selected proteins from expr_data_out

    Click here for output
    > # displaying selected proteins 
    > expr_data_out$Accession[sig]

Some plots

Heatmap

A common visualization is a heatmap of the log values, where we can see the expression of the significant proteins across all the samples.

> # plotting expression data per sample
> heatdata <- as.matrix(expr_data_out[sig,2:7])
> rownames(heatdata) <- expr_data_out[sig,1] -> rownames(heatdata)
> heatmap(heatdata, margins = c(7,5))
Click here for output

Scatterplot

A different visualization of the results is a volcano plot. This graph displays all the data and colors them based on different thresholds.

> # plotting a volcano plot
> EnhancedVolcano(exprdata_out,                         # data to plot
                  lab = exprdata_out$Accession,         # labels (protein name)
                  x = "log2FC",                         # value to plot in the X-axis
                  y = "pvalue",                         # value to plot in the Y-axis
                  pCutoff = alfa,                       # cut off for the pvalue
                  FCcutoff = FC,                        # cut off for the fold change
                  title = "ADD_YOUR_TITLE_HERE",        # Title of the plot
                  subtitle = NULL,                      # Remove subtitle to create more space for the plot
                  caption = NULL,                       # Remove caption to create more space for the plot 
                                                        # (if you remove this line you will get the number of proteins plotted)
                  legendPosition = "top",               # Position the legend on top of the plot
                  axisLabSize = 11,                     # Set font size for axis labels
                  labSize = 2.8,                        # Set font size for protein labels
                  xlim = c(-3,3),                       # Set x-axis limits to -3 and 3 so the plot is symmetric
                  ylim = c(0, 7))                       # Set y-axis limits to 0 and 7
Click here for output

Q15. There are some proteins depicted as red dots, what does that mean? Are these proteins also plotted in the heatmap?

Data grouping

To showcase some other operations and plots, download the SraRunTable_rnaseq_CANVAS.xlsx file from Canvas.

  • Read the file in a variable called meta

    Click here for output
    > # reading the file
    > meta <- read_xlsx("SraRunTable_rnaseq_CANVAS.xlsx")
    > meta
    # A tibble: 696 × 34
       `Sample group` Run       `Assay Type` AvgSpotLen Bases
       <chr>          <chr>     <chr>             <dbl> <chr>
     1 A              SRR34747RNA-Seq             202 76162 A              SRR34747RNA-Seq             202 71323 A              SRR34752RNA-Seq             202 26604 A              SRR34752RNA-Seq             202 27265 B              SRR34752RNA-Seq             202 11256 B              SRR34747RNA-Seq             202 74967 B              SRR34747RNA-Seq             202 34988 B              SRR34747RNA-Seq             202 29299 B              SRR34752RNA-Seq             202 112210 C              SRR34747RNA-Seq             202 9587# ℹ 686 more rows
    # ℹ 29 more variables: BioProject <chr>,
    #   BioSample <chr>, Bytes <dbl>, `Center Name` <chr>,
    #   Consent <chr>, `DATASTORE filetype` <chr>,
    #   `DATASTORE provider` <chr>,
    #   `DATASTORE region` <chr>, Experiment <chr>,
    #   `GEO_Accession (exp)` <chr>, Instrument <chr>, …
    # ℹ Use `print(n = ...)` to see more rows
  • Display the column names with colnames() using meta as argument

    Click here for output
    > # displaying all columns
    > colnames(meta)
     [1] "Sample group"        "Run"                
     [3] "Assay Type"          "AvgSpotLen"         
     [5] "Bases"               "BioProject"         
     [7] "BioSample"           "Bytes"              
     [9] "Center Name"         "Consent"            
    [11] "DATASTORE filetype"  "DATASTORE provider" 
    [13] "DATASTORE region"    "Experiment"         
    [15] "GEO_Accession (exp)" "Instrument"         
    [17] "LibraryLayout"       "LibrarySelection"   
    [19] "LibrarySource"       "Organism"           
    [21] "Platform"            "ReleaseDate"        
    [23] "Sample Name"         "source_name"        
    [25] "SRA Study"           "AGE"                
    [27] "gender"              "Histology"          
    [29] "ps_who"              "Smoking"            
    [31] "stage_tnm"           "surgery_date"       
    [33] "dead"                "vital_date"     

Let's summarize the data based on gender.

  • Display the different genders using the following code

    > # displaying all columns
    > meta %>% distinct(gender)
    Click here for output
    > # displaying all columns
    > meta %>% distinct(gender)
    # A tibble: 3 × 1
      gender
      <chr> 
    1 female
    2 male  
    3 NA    
  • Create a table with the total number of samples corresponding to the different genders, and save it as gender:

    > # grouping by gender
    > gender <- meta %>% group_by(gender) %>% 
                summarise(samplesNo=n())
    Click here for output
    > # grouping by gender
    > gender <- meta %>% group_by(gender) %>% 
                summarise(samplesNo=n())
    # A tibble: 3 × 2
      gender samplesNo
      <chr>      <int>
    1 female       325
    2 male         275
    3 NA            96

Plots with ggplot

Although there are base graphics in R (as we have seen in previous examples), ggplot2 is quickly replacing them, as it is an scheme for data visualization which breaks up graphs into semantic components such as scales and layers. It is a little harder to learn, but once you do, your graphs will be spotless!

Pie chart

Let's create a pie chart using ggplot() to visualize this information. Run:

> # creating pie chart
> gender %>% 
  ggplot(aes(x="", y=samplesNo, fill=gender))  +
  geom_bar(stat="identity") + 
  coord_polar("y", start=0)
Click here for output

  • Add a title with this argument labs(title="ADD_YOUR_TITLE_HERE")

    Click here for output
    > # adding labels
    > gender %>% 
      ggplot(aes(x="", y=samplesNo, fill=gender)) +
      geom_bar(stat="identity") + 
      coord_polar("y", start=0) + 
      labs(title="Gender distribution")

  • Add theme_void() to make a cleaner plot

    Click here for output
    > # prettifying the plot
    > gender %>% 
      ggplot(aes(x="", y=samplesNo, fill=gender)) +
      geom_bar(stat="identity") + 
      coord_polar("y", start=0) + 
      labs(title="Gender distribution") +
      theme_void()

Q16. Make a pie chart using Smoking instead of gender

Click here for output
# selecting and grouping the data based on the smoking column
smoking <- meta %>% group_by(Smoking) %>% 
  summarise(samplesNo=n()) 

# plotting
smoking %>% 
  ggplot(aes(x="", y=samplesNo, fill=Smoking)) +
  geom_bar(stat="identity") + 
  coord_polar("y", start=0) + 
  labs(title="ADD_YOUR_TITLE_HERE") +
  theme_void() 

Barplot

Let's create a subset of our data based on the Histology column, where we would like to show for the different histologies, the number of females and males.

First, we investigate the values of all histologies, run:

# extracting the number of categories
> table(meta$Histology)
  1   2   3 
209 318  73 

There are 3 histologies, where 1 = squamos cell cancer, 2 = unspecified and 3 = Large cell/NOS. This information is found in the metadate of where the data was downloaded. Thus for each one of these categories, we need to extract gender and create a list that we can then plot. Try to follow these code:

> # extracting gender for each category
> scc <- meta %>%  filter(Histology %in% 1) %>% select(gender) %>% mutate (Sample="SCC")
> uns <- meta %>%  filter(Histology %in% 2) %>% select(gender) %>% mutate (Sample="uns")
> nos <- meta %>%  filter(Histology %in% 3) %>% select(gender) %>% mutate (Sample="NOS")

> # concatenating all the variables
> histology <- bind_rows(scc,uns,nos)

> # checking results
> table(histology)
        Sample
gender   NOS SCC uns
  female  45  81 199
  male    28 128 119
  • Create a barplot for histology

    Click here for output
    > # creating a barplot
    > histology %>%
         ggplot(aes(x=Sample, fill=gender)) +  
         geom_bar(stat="count", position="dodge") +
         labs(title="ADD_YOUR_TITLE_HERE") +
         theme_classic()

Venn diagram

A venn diagram is great to show overlapping features. For instance, let's compare the Runs from non Smokers and individuals classified with metastasis. Non smokers can be found in the Smoking column with a value of 3, while metastasis is in the stage_tnm column annotated as 7.

Run this code, line by line so you understand what data we are retrieving:

> # creating nonSmokers list
> nonSmokers <- unlist(meta %>% filter(Smoking %in% 3) %>% select(Run))

> # creating the metastasis list
> metastasis<- unlist(meta %>% filter(stage_tnm %in% 7) %>% select(Run))

> # comparing lists
> intersect(nonSmokers,metastasis)

Q17. How many Runs are shared between Non smokers and individuals with metastasis?

Click here for output
[1] "SRR3475142" "SRR3475143" "SRR3475144"
  • Visualize the results with ggvenn()  

    Click here for output
    > # install and load ggvenn if not done already
    > # install.packages("ggvenn") 
    > # library(ggvenn)
    
    > creating a venn diagram
    > ggvenn(list(nonSmokers=nonSmokers, metastasis=metastasis))

  • Modify the color of the plot by using the fill_color argument

    Click here for output
    > adjusting colors
    > ggvenn(list(nonSmokers=nonSmokers, metastasis=metastasis), fill_color = c("ADD_YOUR_COLOR_HERE", "ADD_YOUR_COLOR_HERE"))

Histograms

Let's look at the age distribution and generate a histogram. Run the following examples:

> # plotting a histogram for age
> ggplot(meta, aes(x=AGE)) + geom_histogram()
Click here for output

> # modifying binwdth
> ggplot(data, aes(x=AGE)) + geom_histogram(binwidth = ADD_A_BINWIDTH_HERE) 
Click here for output

Q18. What is binwidth?

ANS = The number of bars of the histogram
> # plotting a histogram for age
> ggplot(meta, aes(x=AGE)) + geom_histogram(binwidth = 5, fill="ADD_COLOR_HERE", color="ADD_ANOTHER_COLOR_HERE") 
Click here for output

> # plotting a histogram for age
> ggplot(meta, aes(x=AGE, fill=gender)) + geom_histogram() 
Click here for output

Q19. What can you conclude from this graph?



Created by Marcela Dávila with material from Maria Nethander, 2023.

⚠️ **GitHub.com Fallback** ⚠️