R - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Note: Don’t just copy and paste the code,write/type it by yourself Knowing is not the same as doing


title: "R for Beginners Session 1 - Get Started"

Today's workshop is a brief introduction to R and RStudio as well as the basic data structure in R, vector.

Why R

Key features of R include:

Open Source Software: R is open source, which means it is freely available for anyone to use, modify, and distribute. This fosters a collaborative environment where users can contribute to its development and share their work.

Comprehensive Statistical Analysis: R provides a wide array of statistical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and more.

Data Manipulation and Visualization: With packages like dplyr and ggplot2, R excels in data manipulation and visualization, enabling users to create complex and informative plots with ease.

Extensible with Packages: The Comprehensive R Archive Network (CRAN) hosts thousands of packages contributed by the community, extending R's capabilities in various fields, from bioinformatics to social sciences.

Active Community and Support: R has a large and active community that contributes to its development and provides extensive documentation, tutorials, and forums for support.

RStudio

RStudio is an integrated development environment (IDE) that provides a comprehensive set of tools to enhance your coding experience, making it more productive and enjoyable. RStudio is currently one of the most popular IDEs for R.

While RStudio is not required to use R, you need to have both R and RStudio installed on your computer to use RStudio effectively.

RStudio Interface

RStudio's interface is organized into four main panes, each designed to streamline different aspects of your workflow:

  • Source (Build-in script editor) pane: You write and edit your scripts, functions, and markdown documents. It supports syntax highlighting, code completion, and easy navigation between files.

  • Console pane: You can enter and execute R commands directly. It also displays output, error messages, and results from your code.

  • Environment/History pane: The Environment tab lists all active objects, such as data frames, variables, and functions, allowing you to manage and inspect them easily. The History tab keeps a log of all commands you’ve executed, which can be reused or modified.

  • Files/Plots/Packages/Help pane:

    • Files: Navigate your project directory and manage files.
    • Plots: View plots generated by your code.
    • Packages: Manage installed packages and load new ones.
    • Help: Access R documentation and help files.

# Task 1: check the working directory

getwd()

# Task 2: create a folder under the working directory 

dir.create("data")

A quick exercise


# Exercise: create two folders "output" and "scripts" in the working directory

dir.create("output")

dir.create("scripts")

Download,read and view the data


# Task 3: download the test data

download.file(
  "https://raw.githubusercontent.com/datacarpentry/r-socialsci/main/episodes/data/SAFI_clean.csv",
  "./SAFI_clean.csv", mode = "wb")

# you will get the SAFI_clean.csv in your current working directory

A function is a block of code designed to perform a specific task. Functions are essential in programming because they allow you to reuse code, making your work more efficient. Usually, functions take inputs, called arguments, process them, and return an output.


  # Task 4: 

  # an example of function
    sqrt(9)
  
  # to learn more about a function, use help() or ?
   help(sqrt)
   ?sqrt
  
  # another example of function
   round(pi)

   # round(x, digits = 0, ...) the argument digits has a default value of 0. You can change the default value to a value you want

   round(x = pi, digits = 2 )
   
   # function is a block of code
   download.file # download.file is a function in the package 'utils'

All programming languages allow programmers to include comments in their code, and doing so offers many advantages. Comments help you explain your reasoning and encourage tidiness in your code. They also serve as valuable references for your collaborators and your future self. In fact, comments are crucial for ensuring that your analysis is reproducible.

In R, comments are added using the # character. Anything written to the right of the # and up to the end of the line is treated as a comment and ignored by R. You can start a line with a comment or place a comment after code on the same line.

Data Structures in R

R has several key data structures to store and organize data. Each structure serves a different purpose and can handle different types of data. In this workshop series, we mainly cover two types of data structures, vector and data frame.

Vector

A vector is the basic data structure in R. It is composed by a series of values that can be numbers, characters, or logical values. All elements in a vector must be of the same type.

The most common types of vectors are numeric (or double), integer, character, and logical.

The function c() is used to create a vector. "c" stands for "combine" or "concatenate".

Here are examples of vectors:


# Task 5: create a numeric vector

num_vector <- c(1.5, 50, pi)

num_vector <- c(1.5, 50, pi) assigns a vector of three numeric elements to a variable num_vector.

  • Creating a vector: c()

    c() is the function in R to create a vector.

  • Assignment operator: <-

    The symbol <- is the assignment operator in R. It assigns the value on the right-hand side to the variable on the left-hand side.

  • Variable: num_vector

    The created vector is stored in the variable num_vector, which can be used later in your R code to perform calculations, make plots, or analyze data.

    naming convention:

    • names must start with a letter, cannot start with a number
    • names cannot contain spaces
    • avoid special characters, such as @, #, etc. "_" is acceptable, for example, my_data.
    • case sensitive. my_data and My_data are two different variables.
    • avoid reserved words or existing function names, such as while, if, else, TRUE, etc.

# Task 6: create other types of vectors

# integer vector example
int_vector2 <- c(4:7)
int_vector2

int_vector <- c(3L, 2L, 30L, 2L, -1L)
int_vector

# character vector example
char_vector <- c("R", "workshop", "hello", "world")
char_vector

# logical vector example
log_vector <- c(TRUE, FALSE, T)
log_vector

Inspect vectors


# Task 7: inspect vectors with functions

# is it a vector?
  is.vector(num_vector)
  
  x <- 3
  x
  is.vector(x)
  
  y <- c(1:100)
  y
  is.vector(y)

# data type of a vector
  typeof(num_vector)
  
  typeof(int_vector)
  
  typeof(char_vector)
  
  typeof(log_vector)

# how many elements in a vector
  length(y)
  
  length(x)
  
  length(char_vector)

# unique elements in a vector
  unique(int_vector)

# summary of a vector
  str(char_vector)  # an overview of the structure of a vector and its elements
  
  str(x)
  
  summary(char_vector)
  
  summary(y)

Add more elements to a vector


# Task 8: add more elements to a vector using the function c()

  x <- c(1, 2, x, 4, 7)
  x

Subset a vector - using []

Subsetting (sometimes referred to as extracting or indexing) involves accessing out one or more values based on their numeric placement or “index” within a vector. If we want to subset one or several values from a vector, we must provide one index or several indices in square brackets. For instance:


# Task 9: subset a vector using the index

  x[3]
  
  x[c(4,2)]

  z <- x[c(3,4,2,1,2,4,5)] # subset does not mean the new vector has less elements

# how about index 0, negative index, or index greater than the number of elements
  
  x[-1] # remove the first element
  
  x[0]  # what it is? is it a vector?
  is.vector(x[0])
  typeof(x[0])
  length(x[0])
  

  x[7] # NA means not avaible or missing elements

R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

Conditional subsetting

For instance, if you wanted to select only the values above 2:



# Task 10 (contiuned) : subset a vector with conditions

  z > 2
  
  z[z > 2]
  
  # subset a vector with multiple conditions
  
  z[z >= 2 & z < 4]
  
  z[z < 2 | z > 3]

  char_vector[char_vector == "world" | char_vector == "work"]

  
  # operator %in%
  
  z[z %in% c(2:4)]  # the same results as z[z >= 2 & z<= 4]
  
  char_vector[char_vector %in% c("hi", "workshop", "R")]  # identify shared elements; the same results as char_vector[char_vector == "hi" | char_vector == "workshop" | char_vector == "R"]
  

Data Frames

A data frame in R is a table or a 2-dimensional structure where data is organized in rows and columns, similar to a spreadsheet or database table, with columns representing variables and rows representing observations. It’s one of the most commonly used data structures in R.

Create data frames

You can create a data frame using the data.frame() function by passing vectors as arguments.


# Task 11: create a data frame

df <- data.frame(
  ID = c(1:3),
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(23, 34, 21),
  Grad = c(TRUE, TRUE, FALSE)
)
# display the data frame: each column is a variable, each row represents a person

df

Inspect data frames


# Task 12: inspect data frames

# Is it a data frame?
  is.data.frame(df)

# size of a data frame

  nrow(df) # the number of rows
  ncol(df) # the number of columns
  dim(df) # the dimensions of the data frame

# content of a data frame
  head(df) # shows the first 6 rows of the data frame
  tail(df) # shows the last 6 rows of the data frame
  View(df)

# name of columns
  names(df)

# summary
  str(df)
  summary(df)

Subset a data frame

There are three common symbols to subset a data frame: [], [[]], and $


# Task 13: use two-dimention index to subset a data frame

  df[2,3]  # first digit represents row and second digit represents column
  
  df[2:3, 1:2]
  df[c(1,3), c(2,4)]
  df[-2,c(-1,-3)]

# obtain row(s)
  df[3, ]
  df[2:3, ]

# obtain column(s)
  df[, c(1,4)]

  df[c(1,4)]
  is.data.frame(df[c(1,4)])

# use column names to extract column(s)
  df[c("ID", "Name")]
  
  df["Age"]
  is.data.frame(df["Age"])

# Use [[]] to extract a column, the output is a vector

  df[[3]]  # use [[]] to extract a specific column
  
  is.data.frame(df[[3]])
  is.vector(df[[3]])

  df[[3:4]]  # the output of [[]] a vector 

# Use $ to extract a vector, the output is a vector

  df$Age  # use $ to extract a specific column
  
  is.data.frame(df$Age)
  is.vector(df$Age)

  • [] - result is a data frame
  • [[]] - result is a vector
  • $ - result is a vector

Conditional subset


# Task 14: Conditional subsetting

  df[df$Age > 22, ]
  
  df[df$ID == 2, ]
⚠️ **GitHub.com Fallback** ⚠️