Journal 2: R Prep - bcb420-2025/Keren_Zhang GitHub Wiki

Table of Contents

Time

Date: January 10, 2025

Estimated Time: 5 hour

Time Taken: 5.5 hours

Chapter 1-5

These Chapters are skipped as it presents information that I have already leanred

Chapter 6: Basics of R Syntax

This chapter covers simple commands, basic syntax, operators, variables, class, mode, and attributes in R.

Key Points

  • R Commands and Syntax: Learned how to write and debug R expressions, especially those that are nested with multiple layers of parentheses.
  • Operators: Explored various types of operators including arithmetic, logical, and assignment operators. This included practical exercises in RStudio to solidify my understanding.
  • Variables: Discussed best practices for naming variables and the importance of using meaningful and unique names to avoid confusion in code.

Challenges Encountered

  • Debugging Expressions: Initially struggled with debugging deeply nested expressions but improved through practice and by breaking down expressions into smaller parts.
  • Variable Naming: Learned the importance of choosing clear and descriptive names for variables to make code more readable and maintainable.

Examples Tried

# Testing arithmetic operators
5 + 1 / 2  # Outputs 5.5, not 3 due to operator precedence

Logical operators
!FALSE  # Outputs TRUE

Variable assignment and comparison
a <- 5
b <- 8
a + b  # Outputs 13
a == b # Outputs FALSE

Using parentheses to force immediate evaluation and output
(numbers <- sample(1:20, 5))

Chapter 7: Scalars and Vectors

This chapter introduces scalar and vector objects in R, explaining their creation, manipulation, and how to perform operations such as subsetting. It covers the fundamental data types in R, including scalars, vectors, matrices, data frames, and lists.

Scalars

  • Definition: Scalars in R are essentially vectors of length one.
  • Creation: Assign a value to a variable (e.g., x <- pi).
  • Properties: Scalars are accessed like vectors (x[1]), and trying to access a non-existent element returns NA (e.g., x[2]).

Data Types and Coercion

  • Supported Types: Logical (TRUE, FALSE), numeric (integers, floats), and character.
  • Coercion: R automatically converts data types in vectors to the most general type that accommodates all elements.
  • Type Checking: Use typeof(), mode(), and class() to check an object's type.

Vectors

  • Creation: Use the c() function to concatenate elements into a vector.
  • Example: myVec <- c(1, 1, 3, 5, 8, 13, 21).
  • Length: Use length(myVec) to get the number of elements.
  • Subsetting:
    • By index: myVec[1] or myVec[1:4].
    • By name: If elements are named.
    • By boolean vectors: myVec[myVec > 4].

Vector Operations

  • Vectorized Operations: Operations on vectors are performed element-wise without the need for explicit loops (e.g., myVec + 1).
  • Example: Calculating the Fibonacci sequence and exploring properties like the golden ratio using element-wise operations on vectors.

Matrices and Higher-dimensional Objects

  • Definition: Matrices are two-dimensional vectors, and higher-dimensional arrays extend this concept.
  • Creation: Use matrix() or manipulate dimensions with dim().
  • Subsetting: Access elements, rows, columns, or slices using indices (e.g., matrix[1, ] for the first row).
  • Example: Adjusting matrix dimensions and exploring matrix operations.

Chapter 8: Data Frames

Overview

This chapter introduces data frames in R, which are crucial for handling tabular data, especially in bioinformatics. Data frames allow for the storage and manipulation of data with heterogeneous types across different columns.

Data Frames

Definition
Data frames are table-like structures that store data with varying types across columns, analogous to spreadsheets or SQL tables.
Creation
Data frames are generally created by importing data from external files like CSV or TSV. They can also be constructed from other R data structures like lists or matrices through transformations.

Basic Operations

Loading Data
External data is loaded into data frames using functions such as read.table() or read.csv(). For example, to load a TSV file:
read.table("data_files/plasmidData.tsv", sep="\t", header=TRUE, stringsAsFactors=FALSE)
Viewing Data
The structure of a data frame can be examined using str() or by viewing it in RStudio's environment pane.
Accessing Data
Data within a data frame can be accessed by row and column through indices, names, or logical vectors.
Modifying Data
Data frames can be modified by adding or removing rows and columns. Data can be directly altered by reassigning values within the frame.

Manipulating Data Frames

Subsetting
Subsets of data frames can be extracted using specific row and column indices.
Appending Data
New rows can be appended using rbind(), and new columns can be added using cbind().
Handling Row Names
Row names can be managed by directly setting them or by assigning a column as row names:
rownames(df) <- df$column

Chapter 9: Lists

Lists in R are versatile data structures that can hold elements of varying types and sizes, unlike matrices and data frames which require uniformity. Lists can contain various types of data including characters, booleans, numerics, and even functions.

Creating and Accessing Lists

Lists are created using the list() function. Elements within a list can be accessed via their index using double square brackets [[ ]], or by their names using the $ operator if names are defined.

Example

pUC19 <- list(size=2686, marker="ampicillin", ori="ColE1", accession="L01397", BanI=c(235, 408, 550, 1647))
Access elements:
pUC19[[1]]  # Outputs: 2686
pUC19$ori   # Outputs: "ColE1"
pUC19$BanI[2]  # Outputs: 408

Nested Lists

Lists can be nested within other lists, allowing for the creation of complex hierarchical structures. This is useful for representing databases or collections of related items.

Example Task

Create and manipulate lists representing plasmid data:

  • Define a new list for pACYC184 with specific attributes.
  • Create a plasmidDB list and add multiple plasmid lists to it.
  • Retrieve data and perform operations using functions like lapply() to process list elements.

Manipulation Example

# Adding pACYC184 to plasmidDB
plasmidDB[["pACYC184"]] <- list(size=4245, marker="Tet, Cam", ori="p15A")

Retrieving ori elements across all plasmids
lapply(plasmidDB, function(x) { return(x$ori) })
Outputs: $pUC19 "ColE1", $pACYC184 "p15A"

Retrieving Data

Use lapply() for operations over list elements, and unlist() to flatten lists for simpler operations like finding minimum values.

Chapter 10: Subsetting and Filtering R Objects

Subsetting and filtering are essential techniques for data manipulation in R. This chapter focuses on using R's powerful syntax to effectively select and filter data.

Subsetting Principles

Subsetting in R can be done using three main operators:

  • [] - Used to extract multiple elements.
  • [[]] - Used to extract a single element.
  • $ - Used to extract a single named element.
These operators are versatile and can be used to access elements, rows, and columns within vectors, lists, and data frames.

Examples of Subsetting

Subsetting a Data Frame:

# Subsetting rows
plasmidData[1, ]  # Retrieves the first row
plasmidData[c(1, 2), ]  # Retrieves multiple specified rows

Subsetting by column
plasmidData[, 2]  # Retrieves the second column
plasmidData[, "Name"]  # Retrieves the 'Name' column using column name

Combined row and column subsetting
plasmidData[1:2, "Name"]  # Retrieves 'Name' column for the first two rows

Subsetting with Logical Vectors:

# Filtering rows based on a condition
plasmidData$Name[plasmidData$Ori != "ColE1"]  # Names where 'Ori' is not 'ColE1'

Using grep() to filter rows based on text matching
plasmidData[grep("Tet", plasmidData$Marker), ]  # Rows where 'Marker' contains "Tet"

Ordering and Sorting:

# Ordering rows by the 'Size' column
plasmidData[order(plasmidData$Size), ]  # Sorts data frame by 'Size'

Replacing Elements

You can also replace elements in R objects by assigning new values to subsetted elements.

x <- sample(1:10)
x[4] <- 99  # Replaces the fourth element with 99

Chapter 11: Control Structures of R

This chapter explores R's control structures, including conditional statements and loops, which are essential for writing efficient, conditional, and iterative code in R.

Control Structures

Control structures in R dictate the flow of execution and include conditions and loops. The primary structures covered in this unit are `if`, `else if`, `ifelse`, `for`, and `while`.

Conditional Statements

R uses conditional statements to execute code based on logical conditions.

  • if and else if:
if (condition) {
  # code to execute if condition is TRUE
} else if (another_condition) {
  # code to execute if another condition is TRUE
} else {
  # code to execute if none of the above conditions are TRUE
}
  • ifelse(): A vectorized alternative to `if` that works on vectors.
result <- ifelse(test_condition, true_value, false_value)

Loops

Loops in R are used for executing a block of code repeatedly.

  • for Loop: Executes a block of code for a set number of times.
for (variable in sequence) {
  # code to execute
}
  • while Loop: Continues to execute a block of code as long as the specified condition is TRUE.
while (condition) {
  # code to execute
}

Examples

Simple If-Else Condition:

x <- 5
if (x > 7) {
  print("x is greater than 7")
} else {
  print("x is not greater than 7")
}

For Loop Example:

for (i in 1:5) {
  print(i)
}

While Loop Example:

count <- 1
while (count <= 5) {
  print(count)
  count <- count + 1
}

Chapter 12: R Functions

This chapter explores the structure and design of functions in R, emphasizing how functions handle arguments and parameters, and introducing the concept of functional programming.

Anatomy of R Functions

R functions are fundamental for conducting robust and repeatable data analysis. They allow for encapsulating logic that can be reused and help in maintaining clean code.

Basics of R Functions

R functions consist of the following components:

  • Arguments: Inputs to functions which can be mandatory or optional.
  • Parameters: Variables used within the function to manipulate the arguments.
  • Return Values: The output of a function after processing inputs.

Creating Functions

To define a function in R, you use the function keyword followed by a set of parameters and the body of the function containing statements that define what the function does.

myFunction <- function(parameter1, parameter2) {
  result <- parameter1 + parameter2
  return(result)
}

Functional Programming

R supports functional programming which encourages simple functions that avoid side-effects, which means they don’t alter the state outside their scope.

Example: Using Built-in and Custom Functions

  • Built-in Functions:
# Using the sin() function to calculate the sine of pi radians
sin(pi)
  • User-defined Functions:
# A simple function to calculate the area of a rectangle
areaRectangle <- function(length, width) {
  area <- length * width
  return(area)
}

Practical Application

Task 24: Writing Functions Write a function that calculates the square of a number and use it to find the square of numbers from 1 to 5.

squareNumber <- function(x) {
  return(x^2)
}

Using sapply to apply the function to numbers 1 through 5
sapply(1:5, squareNumber)

Chapter 13: Handling Complex Numbers and Using Functions

Chapter 13 delves into the use of complex numbers in R and explores the function seq() for generating sequences of numbers, illustrating practical uses and handling of missing parameters in functions.

Working with Complex Numbers

Complex numbers in R can be managed using the complex() function, which allows setting real and imaginary parts:

# Euler's identity example
exp(complex(i = pi, 1, 0)) 
[1] -1+1.224647e-16i

Tasks Involving seq() Function

  • Task 24:
    • Generate a sequence of integers from -5 to 3 without using argument names:
seq(-5, 3)
    • Generate a sequence from -2 to 2 in intervals of 1/3 using argument names:
seq(from = -2, to = 2, by = 1/3)
    • Create a sequence of 30 numbers between 1 and 100 with a specific order of arguments:
seq(30, 100, 1)

Handling Missing Parameters in Functions

A common issue in function usage is dealing with missing parameters. R can either throw an error or use default values if specified.

Example: Golden Ratio Function

This function calculates either the smaller or larger golden ratio pair of a number based on a Boolean parameter.

goldenRatio <- function(x, smaller = TRUE) {
  phi <- (1 + sqrt(5)) / 2
  if (smaller) {
    return(x / phi)
  } else {
    return(x * phi)
  }
}
Usage:
goldenRatio(1)
[1] 0.618034
goldenRatio(1, smaller = FALSE)
[1] 1.618034
Using missing() function to handle missing arguments dynamically:
goldenRatio <- function(x, smaller) {
  if (missing(smaller)) {
    smaller <- TRUE
  }
  phi <- (1 + sqrt(5)) / 2
  if (smaller == TRUE) {
    return(x / phi)
  } else {
    return(x * phi)
  }
}

Advanced Usage: Viewing and Writing Functions

R users can view the source code of functions by simply typing the function name without parentheses. For more complex functions, such as those involving S3 methods or primitives, users can explore code using methods() and getAnywhere() functions.

Exploring Function Definitions

  • Built-in Functions:
# Viewing the definition of a built-in function
seq
  • S3 Methods and Primitives:
# Finding methods associated with a function
methods(seq)
Accessing the source code of a method
getAnywhere(seq.default)

Practical Applications: Writing Your Own Functions

Writing functions in R allows for more modular, reusable, and manageable code. Functions should be written clearly with well-defined inputs (parameters) and outputs (return values).

Task 25: Countdown Function

Write a function to perform a countdown from a given number to zero.

countDown <- function(start) {
  while (start >= 0) {
    print(start)
    start <- start - 1
  }
  print("Lift Off!")
}
Example usage:
countDown(5)

Chapter 14: Introduction to R Plots

Chapter 14 introduces the basic concepts and techniques for creating graphics in R, focusing on data visualization and descriptive statistics.

plot()

?plot
x <- rnorm(200)
y <- x^3 * 0.25 + rnorm(200, 0, 0.75)
plot(x, y)
rug(x)
rug(y, side=2, col="red")

barplot()

?barplot
barplot(table(round(y)))

hist()

?hist
set.seed(12357)
x <- rnorm(50)
hist(x, breaks=5)
stripchart(x, pch="|", add=TRUE, col="red3", xlim=c(-3, 3), at=-0.5)
Histograms with customization, adding stripchart for actual values display.

boxplot()

?boxplot
x <- rnorm(200)
m <- cbind(x, x^2, x^3, x^4, x^5)
boxplot(log(abs(m)))

Colour in Plots

Explains color specifications in R using numbers, names, hex-triplets, and through palettes.

Colours by Number

barplot(rep(1,9), col=0:8, axes=FALSE, names.arg=c(0:8))
Shows usage of basic color numbers.

Colours by Name

colors()  # Lists all 657 named colors available in R.
Details on how to use named colors like "peachpuff" and "firebrick".

Colours as Hex-Triplets

# Examples of using hex colors with transparency
plot(x, y, pch = 19, col = "#EE3A8C12")

Lines

Explains how to customize lines in plots using 'lty' for line types and 'lwd' for line widths.

plot(c(0,10), c(0,10), type = "n", axes = FALSE, xlab = "", ylab = "")
for (i in 1:8) {
    y <- 10.5-(i/2)
    segments(1, y, 5, y, lty=i)
    text(6, y, paste("lty = ", i), col="grey60", adj=0, cex=0.75)
}

Layout

Details on how to manage plot layouts using 'par()' function, controlling multiple plot windows, and margins.

opar <- par(bg="steelblue", fg="lightyellow")
plot(x, y, col.axis="lightyellow", col.lab="lightyellow")
par(opar)

Chapter 15:R Coding Style

General Principles

Readability over Brevity
Write code that is easy to read and understand. Avoid overly complex expressions.
Consistent Style
Adhere to a consistent style in naming, syntax, and layout to enhance readability and maintainability.
Commenting
Use comments to explain the "why" behind the code, not just the "what".

Layout and Formatting

Line Length
Keep lines to a maximum of 80 characters.
Indentation
Use spaces for indentation to align elements within the code clearly.

Naming Conventions

Descriptive Names
Use meaningful and descriptive names for variables and functions.
Case Usage
Employ camelCase for variables and functions to maintain readability and consistency.

Functions and Global Variables

Clear Function Definitions
Functions should have clear, concise, and descriptive headers. Document the purpose, parameters, and return values.
Global Variables
Define global variables at the start of the script and use capital letters (e.g., MAX_WIDTH).

Conditional Logic

Explicit Conditions
Use explicit conditions in if statements to make the logic clear and understandable.
Handling False Cases
Always define the behavior for the else case in conditionals, even if it is to do nothing.

Best Practices

Avoid Global Modifications
Refrain from using <<- for global assignments and set.seed() within functions.
Library Management
Use specific function calls (e.g., package::function()) instead of loading entire libraries.

Efficiency Tips

Pre-allocation
Pre-allocate memory for vectors and matrices to improve performance.
Avoid Dynamic Expansion
Avoid dynamically resizing data structures within loops.

Debugging and Maintenance

Code Reviews
Regularly review and refactor code to improve quality and efficiency.
Version Control
Use version control systems like Git to manage changes and collaborate effectively.
⚠️ **GitHub.com Fallback** ⚠️