Journal 2: R Prep - bcb420-2025/Keren_Zhang GitHub Wiki
Date: January 10, 2025
Estimated Time: 5 hour
Time Taken: 5.5 hours
These Chapters are skipped as it presents information that I have already leanred
This chapter covers simple commands, basic syntax, operators, variables, class, mode, and attributes in R.
- R Commands and Syntax: Learned how to write and debug R expressions, especially those that are nested with multiple layers of parentheses.
- Operators: Explored various types of operators including arithmetic, logical, and assignment operators. This included practical exercises in RStudio to solidify my understanding.
- Variables: Discussed best practices for naming variables and the importance of using meaningful and unique names to avoid confusion in code.
- Debugging Expressions: Initially struggled with debugging deeply nested expressions but improved through practice and by breaking down expressions into smaller parts.
- Variable Naming: Learned the importance of choosing clear and descriptive names for variables to make code more readable and maintainable.
# Testing arithmetic operators
5 + 1 / 2 # Outputs 5.5, not 3 due to operator precedence
Logical operators
!FALSE # Outputs TRUE
Variable assignment and comparison
a <- 5
b <- 8
a + b # Outputs 13
a == b # Outputs FALSE
Using parentheses to force immediate evaluation and output
(numbers <- sample(1:20, 5))
This chapter introduces scalar and vector objects in R, explaining their creation, manipulation, and how to perform operations such as subsetting. It covers the fundamental data types in R, including scalars, vectors, matrices, data frames, and lists.
- Definition: Scalars in R are essentially vectors of length one.
-
Creation: Assign a value to a variable (e.g.,
x <- pi
). -
Properties: Scalars are accessed like vectors (
x[1]
), and trying to access a non-existent element returns NA (e.g.,x[2]
).
- Supported Types: Logical (TRUE, FALSE), numeric (integers, floats), and character.
- Coercion: R automatically converts data types in vectors to the most general type that accommodates all elements.
-
Type Checking: Use
typeof()
,mode()
, andclass()
to check an object's type.
-
Creation: Use the
c()
function to concatenate elements into a vector. -
Example:
myVec <- c(1, 1, 3, 5, 8, 13, 21)
. -
Length: Use
length(myVec)
to get the number of elements. -
Subsetting:
- By index:
myVec[1]
ormyVec[1:4]
. - By name: If elements are named.
- By boolean vectors:
myVec[myVec > 4]
.
- By index:
-
Vectorized Operations: Operations on vectors are performed element-wise without the need for explicit loops (e.g.,
myVec + 1
). - Example: Calculating the Fibonacci sequence and exploring properties like the golden ratio using element-wise operations on vectors.
- Definition: Matrices are two-dimensional vectors, and higher-dimensional arrays extend this concept.
-
Creation: Use
matrix()
or manipulate dimensions withdim()
. -
Subsetting: Access elements, rows, columns, or slices using indices (e.g.,
matrix[1, ]
for the first row). - Example: Adjusting matrix dimensions and exploring matrix operations.
This chapter introduces data frames in R, which are crucial for handling tabular data, especially in bioinformatics. Data frames allow for the storage and manipulation of data with heterogeneous types across different columns.
- Definition
- Data frames are table-like structures that store data with varying types across columns, analogous to spreadsheets or SQL tables.
- Creation
- Data frames are generally created by importing data from external files like CSV or TSV. They can also be constructed from other R data structures like lists or matrices through transformations.
- Loading Data
- External data is loaded into data frames using functions such as
read.table()
orread.csv()
. For example, to load a TSV file: read.table("data_files/plasmidData.tsv", sep="\t", header=TRUE, stringsAsFactors=FALSE)
- Viewing Data
- The structure of a data frame can be examined using
str()
or by viewing it in RStudio's environment pane.
- Accessing Data
- Data within a data frame can be accessed by row and column through indices, names, or logical vectors.
- Modifying Data
- Data frames can be modified by adding or removing rows and columns. Data can be directly altered by reassigning values within the frame.
- Subsetting
- Subsets of data frames can be extracted using specific row and column indices.
- Appending Data
- New rows can be appended using
rbind()
, and new columns can be added usingcbind()
.
- Handling Row Names
- Row names can be managed by directly setting them or by assigning a column as row names:
rownames(df) <- df$column
Lists in R are versatile data structures that can hold elements of varying types and sizes, unlike matrices and data frames which require uniformity. Lists can contain various types of data including characters, booleans, numerics, and even functions.
Lists are created using the list()
function. Elements within a list can be accessed via their index using double square brackets [[ ]]
, or by their names using the $
operator if names are defined.
pUC19 <- list(size=2686, marker="ampicillin", ori="ColE1", accession="L01397", BanI=c(235, 408, 550, 1647))
pUC19[[1]] # Outputs: 2686
pUC19$ori # Outputs: "ColE1"
pUC19$BanI[2] # Outputs: 408
Lists can be nested within other lists, allowing for the creation of complex hierarchical structures. This is useful for representing databases or collections of related items.
Create and manipulate lists representing plasmid data:
- Define a new list for pACYC184 with specific attributes.
- Create a
plasmidDB
list and add multiple plasmid lists to it. - Retrieve data and perform operations using functions like
lapply()
to process list elements.
# Adding pACYC184 to plasmidDB
plasmidDB[["pACYC184"]] <- list(size=4245, marker="Tet, Cam", ori="p15A")
Retrieving ori elements across all plasmids
lapply(plasmidDB, function(x) { return(x$ori) })
Outputs: $pUC19 "ColE1", $pACYC184 "p15A"
Use lapply()
for operations over list elements, and unlist()
to flatten lists for simpler operations like finding minimum values.
Subsetting and filtering are essential techniques for data manipulation in R. This chapter focuses on using R's powerful syntax to effectively select and filter data.
Subsetting in R can be done using three main operators:
-
[]
- Used to extract multiple elements. -
[[]]
- Used to extract a single element. -
$
- Used to extract a single named element.
Subsetting a Data Frame:
# Subsetting rows
plasmidData[1, ] # Retrieves the first row
plasmidData[c(1, 2), ] # Retrieves multiple specified rows
Subsetting by column
plasmidData[, 2] # Retrieves the second column
plasmidData[, "Name"] # Retrieves the 'Name' column using column name
Combined row and column subsetting
plasmidData[1:2, "Name"] # Retrieves 'Name' column for the first two rows
Subsetting with Logical Vectors:
# Filtering rows based on a condition
plasmidData$Name[plasmidData$Ori != "ColE1"] # Names where 'Ori' is not 'ColE1'
Using grep() to filter rows based on text matching
plasmidData[grep("Tet", plasmidData$Marker), ] # Rows where 'Marker' contains "Tet"
Ordering and Sorting:
# Ordering rows by the 'Size' column
plasmidData[order(plasmidData$Size), ] # Sorts data frame by 'Size'
You can also replace elements in R objects by assigning new values to subsetted elements.
x <- sample(1:10)
x[4] <- 99 # Replaces the fourth element with 99
This chapter explores R's control structures, including conditional statements and loops, which are essential for writing efficient, conditional, and iterative code in R.
Control structures in R dictate the flow of execution and include conditions and loops. The primary structures covered in this unit are `if`, `else if`, `ifelse`, `for`, and `while`.
R uses conditional statements to execute code based on logical conditions.
- if and else if:
if (condition) {
# code to execute if condition is TRUE
} else if (another_condition) {
# code to execute if another condition is TRUE
} else {
# code to execute if none of the above conditions are TRUE
}
- ifelse(): A vectorized alternative to `if` that works on vectors.
result <- ifelse(test_condition, true_value, false_value)
Loops in R are used for executing a block of code repeatedly.
- for Loop: Executes a block of code for a set number of times.
for (variable in sequence) {
# code to execute
}
- while Loop: Continues to execute a block of code as long as the specified condition is TRUE.
while (condition) {
# code to execute
}
Simple If-Else Condition:
x <- 5
if (x > 7) {
print("x is greater than 7")
} else {
print("x is not greater than 7")
}
For Loop Example:
for (i in 1:5) {
print(i)
}
While Loop Example:
count <- 1
while (count <= 5) {
print(count)
count <- count + 1
}
This chapter explores the structure and design of functions in R, emphasizing how functions handle arguments and parameters, and introducing the concept of functional programming.
R functions are fundamental for conducting robust and repeatable data analysis. They allow for encapsulating logic that can be reused and help in maintaining clean code.
R functions consist of the following components:
- Arguments: Inputs to functions which can be mandatory or optional.
- Parameters: Variables used within the function to manipulate the arguments.
- Return Values: The output of a function after processing inputs.
To define a function in R, you use the function keyword followed by a set of parameters and the body of the function containing statements that define what the function does.
myFunction <- function(parameter1, parameter2) {
result <- parameter1 + parameter2
return(result)
}
R supports functional programming which encourages simple functions that avoid side-effects, which means they don’t alter the state outside their scope.
- Built-in Functions:
# Using the sin() function to calculate the sine of pi radians
sin(pi)
- User-defined Functions:
# A simple function to calculate the area of a rectangle
areaRectangle <- function(length, width) {
area <- length * width
return(area)
}
Task 24: Writing Functions Write a function that calculates the square of a number and use it to find the square of numbers from 1 to 5.
squareNumber <- function(x) {
return(x^2)
}
Using sapply to apply the function to numbers 1 through 5
sapply(1:5, squareNumber)
Chapter 13 delves into the use of complex numbers in R and explores the function seq() for generating sequences of numbers, illustrating practical uses and handling of missing parameters in functions.
Complex numbers in R can be managed using the complex() function, which allows setting real and imaginary parts:
# Euler's identity example
exp(complex(i = pi, 1, 0))
[1] -1+1.224647e-16i
-
Task 24:
- Generate a sequence of integers from -5 to 3 without using argument names:
seq(-5, 3)
- Generate a sequence from -2 to 2 in intervals of 1/3 using argument names:
seq(from = -2, to = 2, by = 1/3)
- Create a sequence of 30 numbers between 1 and 100 with a specific order of arguments:
seq(30, 100, 1)
A common issue in function usage is dealing with missing parameters. R can either throw an error or use default values if specified.
This function calculates either the smaller or larger golden ratio pair of a number based on a Boolean parameter.
goldenRatio <- function(x, smaller = TRUE) {
phi <- (1 + sqrt(5)) / 2
if (smaller) {
return(x / phi)
} else {
return(x * phi)
}
}
Usage:
goldenRatio(1)
[1] 0.618034
goldenRatio(1, smaller = FALSE)
[1] 1.618034
goldenRatio <- function(x, smaller) {
if (missing(smaller)) {
smaller <- TRUE
}
phi <- (1 + sqrt(5)) / 2
if (smaller == TRUE) {
return(x / phi)
} else {
return(x * phi)
}
}
R users can view the source code of functions by simply typing the function name without parentheses. For more complex functions, such as those involving S3 methods or primitives, users can explore code using methods() and getAnywhere() functions.
- Built-in Functions:
# Viewing the definition of a built-in function
seq
- S3 Methods and Primitives:
# Finding methods associated with a function
methods(seq)
Accessing the source code of a method
getAnywhere(seq.default)
Writing functions in R allows for more modular, reusable, and manageable code. Functions should be written clearly with well-defined inputs (parameters) and outputs (return values).
Write a function to perform a countdown from a given number to zero.
countDown <- function(start) {
while (start >= 0) {
print(start)
start <- start - 1
}
print("Lift Off!")
}
Example usage:
countDown(5)
Chapter 14 introduces the basic concepts and techniques for creating graphics in R, focusing on data visualization and descriptive statistics.
?plot
x <- rnorm(200)
y <- x^3 * 0.25 + rnorm(200, 0, 0.75)
plot(x, y)
rug(x)
rug(y, side=2, col="red")
?barplot
barplot(table(round(y)))
?hist
set.seed(12357)
x <- rnorm(50)
hist(x, breaks=5)
stripchart(x, pch="|", add=TRUE, col="red3", xlim=c(-3, 3), at=-0.5)
?boxplot
x <- rnorm(200)
m <- cbind(x, x^2, x^3, x^4, x^5)
boxplot(log(abs(m)))
Explains color specifications in R using numbers, names, hex-triplets, and through palettes.
barplot(rep(1,9), col=0:8, axes=FALSE, names.arg=c(0:8))
colors() # Lists all 657 named colors available in R.
# Examples of using hex colors with transparency
plot(x, y, pch = 19, col = "#EE3A8C12")
Explains how to customize lines in plots using 'lty' for line types and 'lwd' for line widths.
plot(c(0,10), c(0,10), type = "n", axes = FALSE, xlab = "", ylab = "")
for (i in 1:8) {
y <- 10.5-(i/2)
segments(1, y, 5, y, lty=i)
text(6, y, paste("lty = ", i), col="grey60", adj=0, cex=0.75)
}
Details on how to manage plot layouts using 'par()' function, controlling multiple plot windows, and margins.
opar <- par(bg="steelblue", fg="lightyellow")
plot(x, y, col.axis="lightyellow", col.lab="lightyellow")
par(opar)
- Readability over Brevity
- Write code that is easy to read and understand. Avoid overly complex expressions.
- Consistent Style
- Adhere to a consistent style in naming, syntax, and layout to enhance readability and maintainability.
- Commenting
- Use comments to explain the "why" behind the code, not just the "what".
- Line Length
- Keep lines to a maximum of 80 characters.
- Indentation
- Use spaces for indentation to align elements within the code clearly.
- Descriptive Names
- Use meaningful and descriptive names for variables and functions.
- Case Usage
- Employ camelCase for variables and functions to maintain readability and consistency.
- Clear Function Definitions
- Functions should have clear, concise, and descriptive headers. Document the purpose, parameters, and return values.
- Global Variables
- Define global variables at the start of the script and use capital letters (e.g., MAX_WIDTH).
- Explicit Conditions
- Use explicit conditions in if statements to make the logic clear and understandable.
- Handling False Cases
- Always define the behavior for the else case in conditionals, even if it is to do nothing.
- Avoid Global Modifications
- Refrain from using <<- for global assignments and set.seed() within functions.
- Library Management
- Use specific function calls (e.g., package::function()) instead of loading entire libraries.
- Pre-allocation
- Pre-allocate memory for vectors and matrices to improve performance.
- Avoid Dynamic Expansion
- Avoid dynamically resizing data structures within loops.
- Code Reviews
- Regularly review and refactor code to improve quality and efficiency.
- Version Control
- Use version control systems like Git to manage changes and collaborate effectively.