Introduction to R - ECE-180D-WS-2023/Knowledge-Base-Wiki GitHub Wiki

Introduction to R

By: Terry Hu

This guide is here to help with R syntax and how to use RStudio, now known as Posit. By copying the code in the code blocks, you will be able to follow along this tutorial. A version of this guide with all the compiled results can be found here.

History

R is a programming language which focuses on statistical computing and graphics that was developed as a GNU Project. The GNU Project is part of the Free Software Movement that sought to make programs free to use for all. R is available under the terms of the Free Software Movement's GNU Public License, and is widely supported across systems including Windows, MacOS and Linux. As a statistical programming language, R lacks the versatility of general purpose languages such as Python or C++. However, R excels at statistical applications of large datasets, including matrix-array operations, data analysis and data visualization.

R's core development team has been committed to documentation, and with continual dedication from the R community, core R libraries and functions are cataloged. Many packages and libraries exists for the language, and users can also write their own functions, increasing its flexibility. For advance programmers, R allows code written in C/C++ to be called and for the objects in R to be manipulated by the C/C++ code.

One of the most common environments to run R is RStudio, which is now know as Posit; RStudio will be the IDE for which this wiki uses (but feel free to use other environments as well).

You can read more about the GNU Project here or about the R Project here.

Some key ideas behind R Syntax

Below is a short list of R syntax that users of other languages may need to take notice of when using R.

R is case sensitive
Hastags (#) are used for comments
<- is used for assignments
R can also be space sensitive

R data structures

R, like many other programming languages, has data structures. The main types of data structures in R are: vectors, matrices, arrays, lists and data frames. The first three are similar, but differ only in their dimensionality. Vectors are usually 1-D, matrices 2-D, and array n-D. Unlike other programming languages, the data type need not be declared in front of the variable assignment (though you can); rather R interprets the data structure, often through the parameters passed in.

Below are some of the basic data structures in R and how to use and declare them.

Vectors

Vectors in R is a group of data elements of the same type stored together
Vectors have two properties: type and length
There are 6 primary types of atomic vectors:

Logical (Boolean)
Integer
Double
Character (String)
Complex
Raw

To create vectors that contain integers, you must add L after the number

x<-c(2L,3L,5L)
print(x)

The above codes declares a vector x that contains three integers.

To create vectors that contain strings, you must add " " around the string

x<-c("Amber","Anin","Kit")
print(x)

The above codes declares a vector x that contains three strings.

To find vector length, there is function length(vector)
To find vector type, there is function typeof(vector)
You can rename the elements of a vector thru function names(vector)

x<-c(1,2,3)
names(x)<-c("a","b","c")
print(x)

The above code renames the elements in vector x.

Lists

Lists can contain elements of different data types
To create a list, you can use function list()

cars<-list("Taycan", 911L, "A45S", TRUE)

In this example, the list consists of two strings ("Taycan" and "A45S") a integer (911) and a boolean (TRUE).

You can store a list inside a list
Just like everything else, you should use the assignment operator to create a list
To see the structure of a list, use str() function

str (cars)

The above code will return the elements types in list cars.

To rename a list, you can use =

porsche <-list('Taycan' =1,'911'=2)
print(porsche)

The above code changes the elements in list porsche.

Dates and Time in R

Often when working with data, time and dates are important. By default, R uses YYYY-MM-DD. Below are some functions that have to do with date and time:

today() returns today's date
now() returns the current time
ymd("string), mdy("string"), dmy("string") to convert to date
as_date() turns date-time to date
ymd_hms gives hours, mins and secs

install.packages("lubridate")
library("lubridate")
print(today("GMT"))
print(now("GMT"))

In the code above, we can see the use of the package "lubridate," which contains many of the listed functions used to manipulate time and date. You can install packages with the function or command install.packages

Boolean and Logical Operators

When attempting to work with data, boolean and logical operators can help the user compare data. Below is a list of commonly used boolean operators in R. Some points to take note of are that both & and && can stand for the AND operation and that though <- is the assignment operator, the EQUALS operation still uses ==.

AND (& or &&)
OR (| or ||)
NOT (!)
EQUALS (==)
if()
else()
elseif()

Other Basic Functions and Structures

Data frames are 2-D arrays like spreadsheets, they can be created with the function _data.frame()_

a <-data.frame(x=c(1,2,3), y=c(2,3,4))
print(a)

In this example above, a data frame a is created that consists of two vectors x and y.

The following functions below have to deal with file and object manipulation.

dir.create creates new directory
file.create("file_name") creates new file
file.copy("file_name", "destination")
matrix(vector,nrow=" " or ncol=" ") creates matrix

matrix <-matrix(c(3:8), nrow=2)
print(matrix)

The code above prints elements in matrix matrix.

facet_wrap() creates different plots for different elements
install.packages("package") installs packages to diversify functionality
library("package") loads package
usually you will need to install and load packages before being able to use them
tidyverse is one of the most useful R packages, you can use tidyverse_update() to update, tidyverse has the following core packages

ggplot2 - for viz
tidyr - data cleaning
readr - importing data
dplyr - data manipulation
tibble
purrr
stringr
forcats

For additional help, you can visit CRAN

Pipe

A pipe in R takes the output of one statement into input of another statement, they can be used thr the % > % operator.
The standard form of a pipe is FUNCTION 1 % >% FUNCTION 2
If the Pipe is successfully implemented, the pipe should be auto-indented
The pipe operation should be added after each operation except the last one
A pipe can also be called using CTRL + SHIFT + M

data("ToothGrowth")
install.packages("tidyverse")
library("tidyverse")
filtered_toothgrowth <-ToothGrowth %>%
  filter(dose==0.5) %>%
  arrange(len)
print(filtered_toothgrowth)

Dataframes and R

Dataframes are collections of columns
- columns should be named
- data stored can be different type
- each colu,m should contain same number of data items
tibles ~ streamlined dataframes
- never change data type of outputs
- never change variable names
- never create row names
- makes printing easier

Functions to work with dataframes and tibbles

head(dataset) views first rows
str(dataset) views structure
colnames(dataset) views column names
mutate(dataset,new_col_name = definition) add new column
readr functions, for all readr functions, remember to put "" around file name -> "file_name.format"

read.csv()
read.tsv()
read.table()
read.delim()
read.log()
read.fwf()

To generate summaries of data:

skim_without_charts(dataset) for a comprehensive view
glimpse(dataset)
head(dataset)
select(dataset) select specifics, ie. a single column

install.packages("palmerpenguins")
library("palmerpenguins")
penguins %>%
  select(species)

clean_names(dataset) - just nice to use
rename(column_name_new = column_name_old) for this to work, you need to first select all data
rename_with(column_name, WHAT TO DO) ie. TOUPPER or TOLOWER
drop_na() leaves out missing elements
summarize() prints summary of data, can be customized to see what kind of summary

penguins %>%
  group_by(island)%>%
  drop_na() %>%
  summarize(max_bill_length = max(bill_length_mm))

dataframe_name <- data.frame(member 1; member 2...) to create custom data frame, similar to class in C
separate() separates column into new columns
unite() combines columns together
pivot_longer() increases row, decreases column
pivot_shorter() increases column, decreases row
arange(dataset) shows data in ascending order

GGPLOT2

GGPLOT2 is one of the most widely used package to create visuals in R. There is extensive support which will allow the user to manipulate different parts of the visualization process to create graphs that help them tell the story they want. In this section, we will see some basic functions of ggplot2 and how to come up with stunning visualizations of data.

Remember to install the package first with install.packages("ggplot2")

Benefits of ggplot2	Elements of ggplot2
Can create different types of plots	Aesthetics
Customize looks and feel of plot	Geom - geometric objects used to represent data
Create high quality visuals	Facets - allows you to display subsets of data
Combine data visualization and manipulation	Labels and Annotations

Making Viz (visuals) with ggplot2 Usually we will also need to load the packages with the following code:

install.packages("ggplot2")
library(ggplot2)
install.packages("palmerpenguins")
library(palmerpenguins)

but since they are already loaded from before, we will not execute the code to save computing resources

A basic plot from ggplot2 will look as follows:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))

To begin the plot, you can use the function ggplot(data)
Next, use + to add layer. The + always has to go at the end of line, similar to pipe
Use geom_function to display data
Map variables using the aes() function

To make aesthetics that are not mapped, make sure to put outside the aes() function

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g), color="red", shape="square")

You can also map multiple things at once

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, shape=species, color=sex))

Types of geom_functions

geom_point() creates scatter plot
geom_bar()
geom_line()
geom_smooth() for trendline

Types of aesthetics

X
Y
Size
Shape
Alpha - transparency
Color

You can get more information from the ggplot2 cheat sheet

To make a stacked bar chart

To make a stacked bar chart, you make a normal bar chart but map x to one variable while the fill to another variable For example:

  ggplot(diamonds)+
  geom_bar(mapping = aes(x=cut, fill=color))

Facet Functions

Facet function allows you to view same subset of a dataset, it will usually make many smaller plots side by side

facet_wrap()- makes 1-D into 2-D, usually for long data
facet_grid() - turns into matrix for easier viewing

Label Functions

Labels and annotations allow us to add words to the plot. Labels are outside while annotations are inside.

To make Labels, we can use labs(title = " ", subtitle = " ", caption = " ")
When making labels, remember to add + to add layer

Annotate Functions

Use annotate
adds text inside the plot
Annotate Arguments
- x= , y= - position of text
- label = - actual text
- color =
- fontsize =
- size =
- angle =

To Save Plot

You can use the export option
Use ggsave()

ggsave("file_name.format")

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

You can use shortcut CTRL + ALT + I

For more on R Markdown, you can click here

Including Plots

You can also embed plots, for example:

plot(pressure)

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Conclusion

This guide outlines the basic syntax of R, and applications when this language is useful. By following this guide, the user should be able to understand data structures in R, manipulate data and export the results visually. The guide also covers R Markdown, which can be used to create aesthetically pleasing reports transcribing the code and the results. Further reading is available through some of the links attached, and a plethora of other learning resources can be found at the R website and the links attached.

Sources and Links

Below is a list of resources that this wiki has allude to or has based information from: