R - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
Note: Don’t just copy and paste the code,write/type it by yourself Knowing is not the same as doing
Today's workshop is a brief introduction to R and RStudio as well as the basic data structure in R, vector.
Key features of R include:
Open Source Software: R is open source, which means it is freely available for anyone to use, modify, and distribute. This fosters a collaborative environment where users can contribute to its development and share their work.
Comprehensive Statistical Analysis: R provides a wide array of statistical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and more.
Data Manipulation and Visualization: With packages like dplyr
and ggplot2
, R excels in data manipulation and visualization, enabling users to create complex and informative plots with ease.
Extensible with Packages: The Comprehensive R Archive Network (CRAN) hosts thousands of packages contributed by the community, extending R's capabilities in various fields, from bioinformatics to social sciences.
Active Community and Support: R has a large and active community that contributes to its development and provides extensive documentation, tutorials, and forums for support.
RStudio is an integrated development environment (IDE) that provides a comprehensive set of tools to enhance your coding experience, making it more productive and enjoyable. RStudio is currently one of the most popular IDEs for R.
While RStudio is not required to use R, you need to have both R and RStudio installed on your computer to use RStudio effectively.
RStudio's interface is organized into four main panes, each designed to streamline different aspects of your workflow:
-
Source (Build-in script editor) pane: You write and edit your scripts, functions, and markdown documents. It supports syntax highlighting, code completion, and easy navigation between files.
-
Console pane: You can enter and execute R commands directly. It also displays output, error messages, and results from your code.
-
Environment/History pane: The Environment tab lists all active objects, such as data frames, variables, and functions, allowing you to manage and inspect them easily. The History tab keeps a log of all commands you’ve executed, which can be reused or modified.
-
Files/Plots/Packages/Help pane:
- Files: Navigate your project directory and manage files.
- Plots: View plots generated by your code.
- Packages: Manage installed packages and load new ones.
- Help: Access R documentation and help files.
# Task 1: check the working directory
getwd()
# Task 2: create a folder under the working directory
dir.create("data")
# Exercise: create two folders "output" and "scripts" in the working directory
dir.create("output")
dir.create("scripts")
# Task 3: download the test data
download.file(
"https://raw.githubusercontent.com/datacarpentry/r-socialsci/main/episodes/data/SAFI_clean.csv",
"./SAFI_clean.csv", mode = "wb")
# you will get the SAFI_clean.csv in your current working directory
A function is a block of code designed to perform a specific task. Functions are essential in programming because they allow you to reuse code, making your work more efficient. Usually, functions take inputs, called arguments, process them, and return an output.
# Task 4:
# an example of function
sqrt(9)
# to learn more about a function, use help() or ?
help(sqrt)
?sqrt
# another example of function
round(pi)
# round(x, digits = 0, ...) the argument digits has a default value of 0. You can change the default value to a value you want
round(x = pi, digits = 2 )
# function is a block of code
download.file # download.file is a function in the package 'utils'
All programming languages allow programmers to include comments in their code, and doing so offers many advantages. Comments help you explain your reasoning and encourage tidiness in your code. They also serve as valuable references for your collaborators and your future self. In fact, comments are crucial for ensuring that your analysis is reproducible.
In R, comments are added using the #
character. Anything written to the right of the #
and up to the end of the line is treated as a comment and ignored by R. You can start a line with a comment or place a comment after code on the same line.
R has several key data structures to store and organize data. Each structure serves a different purpose and can handle different types of data. In this workshop series, we mainly cover two types of data structures, vector and data frame.
A vector is the basic data structure in R. It is composed by a series of values that can be numbers, characters, or logical values. All elements in a vector must be of the same type.
The most common types of vectors are numeric (or double), integer, character, and logical.
The function c()
is used to create a vector. "c" stands for "combine" or "concatenate".
Here are examples of vectors:
# Task 5: create a numeric vector
num_vector <- c(1.5, 50, pi)
num_vector <- c(1.5, 50, pi)
assigns a vector of three numeric elements to a variable num_vector
.
-
Creating a vector:
c()
c()
is the function in R to create a vector. -
Assignment operator:
<-
The symbol
<-
is the assignment operator in R. It assigns the value on the right-hand side to the variable on the left-hand side. -
Variable:
num_vector
The created vector is stored in the variable
num_vector
, which can be used later in your R code to perform calculations, make plots, or analyze data.naming convention:
- names must start with a letter, cannot start with a number
- names cannot contain spaces
- avoid special characters, such as @, #, etc. "_" is acceptable, for example, my_data.
- case sensitive. my_data and My_data are two different variables.
- avoid reserved words or existing function names, such as while, if, else, TRUE, etc.
# Task 6: create other types of vectors
# integer vector example
int_vector2 <- c(4:7)
int_vector2
int_vector <- c(3L, 2L, 30L, 2L, -1L)
int_vector
# character vector example
char_vector <- c("R", "workshop", "hello", "world")
char_vector
# logical vector example
log_vector <- c(TRUE, FALSE, T)
log_vector
# Task 7: inspect vectors with functions
# is it a vector?
is.vector(num_vector)
x <- 3
x
is.vector(x)
y <- c(1:100)
y
is.vector(y)
# data type of a vector
typeof(num_vector)
typeof(int_vector)
typeof(char_vector)
typeof(log_vector)
# how many elements in a vector
length(y)
length(x)
length(char_vector)
# unique elements in a vector
unique(int_vector)
# summary of a vector
str(char_vector) # an overview of the structure of a vector and its elements
str(x)
summary(char_vector)
summary(y)
# Task 8: add more elements to a vector using the function c()
x <- c(1, 2, x, 4, 7)
x
Subsetting (sometimes referred to as extracting or indexing) involves accessing out one or more values based on their numeric placement or “index” within a vector. If we want to subset one or several values from a vector, we must provide one index or several indices in square brackets. For instance:
# Task 9: subset a vector using the index
x[3]
x[c(4,2)]
z <- x[c(3,4,2,1,2,4,5)] # subset does not mean the new vector has less elements
# how about index 0, negative index, or index greater than the number of elements
x[-1] # remove the first element
x[0] # what it is? is it a vector?
is.vector(x[0])
typeof(x[0])
length(x[0])
x[7] # NA means not avaible or missing elements
R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
For instance, if you wanted to select only the values above 2:
# Task 10 (contiuned) : subset a vector with conditions
z > 2
z[z > 2]
# subset a vector with multiple conditions
z[z >= 2 & z < 4]
z[z < 2 | z > 3]
char_vector[char_vector == "world" | char_vector == "work"]
# operator %in%
z[z %in% c(2:4)] # the same results as z[z >= 2 & z<= 4]
char_vector[char_vector %in% c("hi", "workshop", "R")] # identify shared elements; the same results as char_vector[char_vector == "hi" | char_vector == "workshop" | char_vector == "R"]
A data frame in R is a table or a 2-dimensional structure where data is organized in rows and columns, similar to a spreadsheet or database table, with columns representing variables and rows representing observations. It’s one of the most commonly used data structures in R.
You can create a data frame using the data.frame()
function by passing vectors as arguments.
# Task 11: create a data frame
df <- data.frame(
ID = c(1:3),
Name = c("Alice", "Bob", "Charlie"),
Age = c(23, 34, 21),
Grad = c(TRUE, TRUE, FALSE)
)
# display the data frame: each column is a variable, each row represents a person
df
# Task 12: inspect data frames
# Is it a data frame?
is.data.frame(df)
# size of a data frame
nrow(df) # the number of rows
ncol(df) # the number of columns
dim(df) # the dimensions of the data frame
# content of a data frame
head(df) # shows the first 6 rows of the data frame
tail(df) # shows the last 6 rows of the data frame
View(df)
# name of columns
names(df)
# summary
str(df)
summary(df)
There are three common symbols to subset a data frame: []
, [[]]
, and $
# Task 13: use two-dimention index to subset a data frame
df[2,3] # first digit represents row and second digit represents column
df[2:3, 1:2]
df[c(1,3), c(2,4)]
df[-2,c(-1,-3)]
# obtain row(s)
df[3, ]
df[2:3, ]
# obtain column(s)
df[, c(1,4)]
df[c(1,4)]
is.data.frame(df[c(1,4)])
# use column names to extract column(s)
df[c("ID", "Name")]
df["Age"]
is.data.frame(df["Age"])
# Use [[]] to extract a column, the output is a vector
df[[3]] # use [[]] to extract a specific column
is.data.frame(df[[3]])
is.vector(df[[3]])
df[[3:4]] # the output of [[]] a vector
# Use $ to extract a vector, the output is a vector
df$Age # use $ to extract a specific column
is.data.frame(df$Age)
is.vector(df$Age)
-
[]
- result is a data frame -
[[]]
- result is a vector -
$
- result is a vector
# Task 14: Conditional subsetting
df[df$Age > 22, ]
df[df$ID == 2, ]