Introduction to R - biobakery/biobakery GitHub Wiki
R is a language and environment for statistical computing and graphics. R and its libraries/packages implement a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
- Vast capabilities, a wide range of statistical and graphical techniques
- Excellent community support: mailing list, blogs, tutorials
- Easy to extend by writing new functions
- 1. Lab Setup and basics in R
- 2. Working with Data in R
- 3. Exporting data
- 4. Basic statistics
- 5. Simple graphs
- 6. tidyverse
- Option: Use RStudio (free and open-source integrated development environment for R)
-
Start the RStudio program
-
Open a new document and save the file.
-
The window in the upper-left is your R script. This is where you will write instructions for R to carry out.
-
The window in the lower-left is the R console. This is where results will be displayed.
- Option: Use the terminal. This is what I am going to be using today:
Start R session:
R

End R session:
q()

Let's start by getting comfortable in the R environment:
The user interacts with R by inputting commands at the prompt (>
). We did so above
by using the sessionInf()
command. We can also, for example, ask R to do basic
calculations for us:
1 + 1
[1] 2
Additional operators include -
, *
, /
, and ^
(exponentiation). As an
example, the following command calculates the (approximate) area of a circle with
a radius of 2:
3.14 * 2^2
[1] 12.56
Install and load a package. From experience, if you receive the error command not found
- make sure you have loaded the library! (Made this mistake one too many times).
install.packages("ggplot2")
library(ggplot2)
Alternative: In Rstudio, go to the "Packages" tab and click the "Install" button. Search in the pop-up window and click "Install".
help(help)
help(sqrt)
?sqrt
You can create variables in R - individual units with names to store values in. These units can then be called on later on, to examine or use their stored values:
r = 2
In the above command, I created a variable named r
, and assigned the value 2
to it (using the =
operator). Note that the above command didn't prompt R to
generate any output messages; the operation here is implicit. However, I can now
call on r
to check its stored value:
r
[1] 2
I can use stored variables for future operations:
3.14 * r^2
[1] 12.56
R has some built-in variables that we can directly make use of. For example, the
pi
variable stores a more accurate version of the constant 3.14
:
pi
[1] 3.141593
Now, can you make sense of the following operations (notice how I can change the
value stored in r
with a new assignment operation):
r = 3
area = pi * r^2
area
[1] 28.27433
Lastly, R can use and handle other "classes" of values than just numbers. For example, character strings:
circle = "cake"
circle
[1] "cake"
- Question: try the following command in R:
circle = cake
Does it run successfully? What is the problem?
Functions are conveniently enclosed operations, that take zero or more input
and generate the desired outcome. We use a couple of examples to illustrate the
concept of R functions. The first one, the very basic c()
function, combines
values into a vector:
c(1, 2, 3)
[1] 1 2 3
Notice that you call functions by providing parameters (values in the the parentheses) as input. They then (most times) return values as input. You can, of course, use variables as input, or assign the returned value to new variables. Imagine two researchers individually collected sample measurements of the same population, and now would like to combine their data. They can do so with:
samples1 = c(3, 4, 2, 4, 7, 5, 5, 6, 3, 2)
samples2 = c(2, 3)
samples_all = c(samples1, samples2)
samples_all
[1] 3 4 2 4 7 5 5 6 3 2 2 3
The second example, t.test()
, does exactly what its name suggests: it performs
a t-test between two vectors,
to see if the difference in their means is statistically significant:
t.test(samples1, samples2)
Welch Two Sample t-test
data: samples1 and samples2
t = 2.2047, df = 3.9065, p-value = 0.09379
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4340879 3.6340879
sample estimates:
mean of x mean of y
4.1 2.5
Certain function parameters have names, and you can explicitly invoke them during
function calls. For example, here you will notice that the test performed
is a two-sided test.
What if we wanted to perform a one-sided test, to see if the average of samples1
is
significantly higher than that of samples2
? For this, we can invoke the
alternative
parameter in t.test()
, which lets us select one of the options
("two.sided"
, "less"
, or "greater"
), depending on the alternative hypothesis
we are interested in.
t.test(x = samples1, y = samples2, alternative = "greater")
Welch Two Sample t-test
data: samples1 and samples2
t = 2.2047, df = 3.9065, p-value = 0.04689
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.04217443 Inf
sample estimates:
mean of x mean of y
4.1 2.5
You can check the full list of parameters for functions in R with the command
?
+ function name. For example ?t.test
gives you the full documentation
for the function t.test
.
The functions we used so far are built-in. Just like variables, we can also
create our own functions, by invoking the function
keyword.
area_circle = function(r) {
return(pi * r^2)
}
area_circle(r = 3)
[1] 28.27433
- Question: study the following two functions, aimed at calculating the overall
mean of samples collected by two separate researchers.
- What happened in each function?
- What are their differences?
- Which one is better?
overall_mean1 = function(samples1, samples2) {
samples_all = c(samples1, samples2)
return(mean(samples_all))
}
overall_mean2 = function(samples1, samples2) {
mean1 = mean(samples1)
mean2 = mean(samples2)
return((mean1 + mean2) / 2)
}
- Hint: imagine the following scenarios:
- If the first researcher collected a lot more samples than the second one, which way is better?
- If the first researcher collected a lot more samples than the second one, but their experimental protocol is flawed, leading to overestimation of measurements, which way is better?
We will use an example project of the most popular baby names in the United States and the United Kingdom. A cleaned and merged version of the data file is available at http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv.
In order to read data from a file, you have to know what kind of file it is.
?read.csv
Q. What would you use for other file types?
Read in the file and assign the result to the name baby.names
.
baby.names = read.csv(file="https://www.dropbox.com/s/kr76pha2p82snj4/babyNames.csv?dl=1")
Look at the first 10 lines:
head(baby.names)
What kind of object is our variable?
class(baby.names)
str(baby.names)
Here we will work further with the 'baby.names' files that you loaded in above.
Usually, data read into R will be stored as a data.frame.
- A data.frame is a list of vectors of equal length
- Each vector in the list forms a column
- Each column can be a different type of vector
- Typically columns are variables and the rows are observations
A data.frame has two dimensions corresponding to the number of rows and the number of columns (in that order).
Check the dimensions of the data.frame then extract the first three rows of the data.frame.
dim(baby.names)
baby.names[1:3,]
Extract the first three columns of the data.frame.
baby.names[,1:3]
Output a specific columns of the data.frame.
baby.names$Name
Output only the unique values within a column.
unique(baby.names$Name)
Using the base function sum()
. Have many babies were called "jill"?
sum(baby.names$Name == "Jill")
Extract rows where Name == "Jill". This can also be done with the subset
command in R.
baby.names[baby.names$Name == "Jill",]
subset(baby.names, Name == "Jill")
Operator | Meaning |
---|---|
== |
equal to |
!= |
not equal to |
> |
greater than |
>= |
greater than or equal to |
< |
less than |
<= |
less than or equal to |
%in% |
contained in |
& |
and |
| |
or |
Exercise # 1:
How many female babies are listed in the table?
How many babies were born after 2003? Save the subset in a new dataframe.
Add a new column specifying the country.
head(baby.names)
table(baby.names$Location)
Output:
> head(baby.names)
Location Year Sex Name Count Percent Name.length
1 England and Wales 1996 Female sophie 7087 2.394273 6
2 England and Wales 1996 Female chloe 6824 2.305421 5
3 England and Wales 1996 Female jessica 6711 2.267245 7
4 England and Wales 1996 Female emily 6415 2.167244 5
5 England and Wales 1996 Female lauren 6299 2.128055 6
6 England and Wales 1996 Female hannah 5916 1.998662 6
> table(baby.names$Location)
AK AL AR AZ
8685 31652 23279 42775
CA CO CT DC
133257 35181 21526 11074
DE England and Wales FL GA
8625 227449 77218 58715
HI IA ID IL
12072 22748 16330 66576
IN KS KY LA
40200 24548 27041 35370
MA MD ME MI
33190 35279 10030 51601
MN MO MS MT
32946 37078 24520 9759
NC ND NE NH
51874 8470 17549 9806
NJ NM NV NY
47219 18673 21894 89115
OH OK OR PA
55633 29857 27524 53943
RI SC SD TN
9020 29738 9687 39714
TX UT VA VT
113754 29828 44859 5519
WA WI WV WY
41231 32858 13726 5786
baby.names$Country = "US"
baby.names$Country[baby.names$Location == "England and Wales"] = "UK"
head(baby.names)
Output:
> head(baby.names)
Location Year Sex Name Count Percent Name.length Country
1 England and Wales 1996 Female sophie 7087 2.394273 6 UK
2 England and Wales 1996 Female chloe 6824 2.305421 5 UK
3 England and Wales 1996 Female jessica 6711 2.267245 7 UK
4 England and Wales 1996 Female emily 6415 2.167244 5 UK
5 England and Wales 1996 Female lauren 6299 2.128055 6 UK
6 England and Wales 1996 Female hannah 5916 1.998662 6 UK
table(baby.names$Country)
Especially when it comes to metadata - some cleaning may be needed before you run statistics. Here lets take a look at the sex column.
table(baby.names$Sex)
Do you notice any discrepancy in the output?
If we ran statistics on this column it would be confused by the classification of Males
here. To fix the column you can run:
baby.names$Sex = gsub("M$", "Male", baby.names$Sex)
Why do we need the $
sign? What happens if we omit it?
Check the output table again.
Now that we have made some changes to our data set, we might want to save those changes to a file.
getwd() # Check current working directory. Is this where you want to save your file?
setwd("/home/hutlab_public/Desktop") # Change the current working directory
getwd()
dir.create("R_Tutorial") # Create a new directory
setwd("/home/hutlab_public/Desktop/R_Tutorial")
write.csv(baby.names, file="babyNames_v2.csv")
How would you save other file formats?
Locate and open the file outside of R.
save(baby.names, file="babyNames.Rdata")
How do you load an R object?
?load
Descriptive statistics of single variables are straightforward:
Find the mean of baby name lengths:
mean(baby.names$Name.length)
Find the median of baby name lengths:
median(baby.names$Name.length)
Find the standard deviation of baby name lengths:
sd(baby.names$Name.length)
Summarize the baby name lengths:
summary(baby.names$Name.length)
Exercise #3:
Which are the longest names?
Which are the shortest names?
summary(baby.names)
Compare the length of baby names for boys and girls using a boxplot.
p = ggplot(data = baby.names, aes(x = Sex, y = Name.length)) +
geom_boxplot()
ggsave(plot = p,
filename = "basic_box_introR.png",
width = 7, height = 6)
p

Adding color to the boxplot:
p2 = ggplot(baby.names, aes(x = Sex, y = Name.length, fill = Sex)) +
geom_boxplot() +
theme_bw() +
labs(y = "Length of Name")
ggsave(plot = p2,
filename = "fill_box_introR.png",
width = 7, height = 6)
p3 = ggplot(baby.names, aes(x = Sex, y = Name.length, color = Sex)) +
geom_boxplot(lwd = 2) +
theme_bw() +
labs(y = "Length of Name")
ggsave(plot = p3,
filename = "color_box_introR.png",
width = 7, height = 6)
p2

p3

Change the layout of the plot:
- Add a plot title.
- Add a title to the y-axis.
- Change the color of the boxplot. A good place to look up color names are:
How many names were recorded for each year?
- Check the timeframe that is included in the table.
- How many records were obtained in total?
p4 = ggplot(baby.names, aes(x = Name.length)) +
geom_histogram()
ggsave(plot = p4,
filename = "basic_histogram_introR.png",
width = 7, height = 6)
p4

Exercise # 4:
Take a look at ?geom_histogram
and change the layout of the plot. Googling works well as well!
The "tidyverse" is
- a dialect of R
- a collection of R packages with a shared design philosophy focused on manipulating data frames
One prominent feature of the tidyverse is the pipe: %>%
. Pipes take input from their left and pass it to the function on their right. So these two lines of code will produce the same result:
f(x,y)
x %>% f(y)
This makes code more readable when chaining multiple operations performed on an input data frame.
The example command above takes the baby.names data, filters out the "England and Wales" observations, groups by Year and Sex, then computes the average name length by group, and arranges the result in descending order. This is done using several functions from the dplyr
package from the tidyverse family.
library(dplyr)
baby.names %>%
filter(Location != "England and Wales") %>%
group_by(Year, Sex) %>%
summarise(mean_length = mean(Name.length)) %>%
arrange(-mean_length)
You can see that the command ends up looking similar to the English sentence describing what it does. The final result shows that Females in 1989 had the longest names on average.
The base R language recently introduced its own pipe in version 4.1.0 that looks like this: |>
. There are some subtle differences in behavior but for the most part they are interchangeable.
....
Other tutorials using the baby names data: rpub