Getting Started with R - jonathancolmer/lab-guide GitHub Wiki

R is an open-source programming language and environment designed specifically for statistical computing and graphics. It is often used for data cleaning, running regressions, and creating figures.

1. Setting Up R

Installation

Download and install R from the Comprehensive R Archive Network (CRAN) website. Once R is installed, you must choose an environment to write and run R code:

RStudio is the most popular integrated development environment (IDE) for R, but does not have integrated AI features.

  • Download RStudio from the Posit website.
  • RStudio will automatically detect your R installation, so you can begin using R by opening RStudio without further installation steps.

You can also write and run R code in VS Code.

  • Install the R extension in VS Code (version: v2.8.4 (or most recent version), developer: REditorSupport)
  • Install the “languageserver” package in R
    • Open RStudio and run this line of code: install.packages("languageserver"))

Basics

Here are some essential things to know when working with R:

  • Running Code:

    • R executes commands one line at a time.
    • Highlight chunks of code, then click “Run” to execute multiple lines at once.
  • Syntax Essentials:

    • Case Sensitivity: R distinguishes between uppercase and lowercase letters.
    • Assignment Operator: Use <- (e.g., x <- 5) to assign values.
    • Comments: Begin lines with # to add notes that the interpreter ignores.
  • Documentation:

    • Use ?function_name or help(function_name) to access built-in documentation.
  • Working Areas (RStudio):

    • Write longer pieces of code in the script editor (the box in the top left) so they can be saved.
    • The console can be used to test individual commands.
  • Packages:

    • A bundled collection of functions, data, and documentation that allows you to add new features and tools to your environment
    • Install new packages with install.packages("packageName") and load them using library(packageName).
  • Working Directory

    • Use getwd() to display the current working directory
    • Use setwd("your/path/here") to change the working directory.
    • Example code for setting a working directory can be found here.
  • Functions:

    • Reusable blocks of code that perform specific tasks
    • Call a function by writing its name followed by parentheses containing any required inputs (parameters)
    • Many useful functions have been programmed into packages, but you can also write your own
    • Example: the sum function – sum(3, 5)

2. Data Wrangling

Data wrangling is essential in research for cleaning and transforming raw data into analysis-ready formats. The tidyverse provides a suite of integrated packages that simplify these tasks through consistent syntax and design.

Keep your script as easy-to-read as possible, and use comments generously. The tidyverse style guide provides wonderful best-practices for R.

Importing Data

CSV Files:

  • Use the package readr (tidyverse):

    • read_csv("file.csv")

Excel Files:

  • Use the readxl package:

    • read_excel("file.xlsx")

Stata, SPSS, SAS Files:

  • Use the haven Package:

    • read_dta("file.dta") to import Stata files.

Output:

  • These functions return a data frame R’s core table structure.
  • Inspect your data with head() or str().

Data Manipulation

Use dplyr, a core tidyverse package for data manipulation. Key functions of dyplr include:

  • filter(): Subset rows based on conditions.
  • select(): Choose specific columns.
  • mutate(): Create or transform columns.
  • arrange(): Sort rows.
  • summarise(): Compute aggregate statistics.

Many simple data tasks can be completed using dyplr or standard R code. Dyplr has the advantage on many fronts – it is easier to read, work with, and integrated with other functions in the tidyverse package.

Example Workflow:

library(dplyr)  # Load the dplyr package for data manipulation

data <- read_csv("mydata.csv") %>%  # Read data from "mydata.csv" into a data frame

filter(year == 2020, country == "USA") %>%  # Keep only rows where 'year' is 2020 and 'country' is "USA"

mutate(gdp_pc = GDP / population) %>%  # Create a new column 'gdp_pc' by dividing the 'GDP' by 'population'

select(country, year, gdp_pc, unemployment_rate)  # Select only the specified columns for further analysis

* Note: The pipe operator %\>% allows you to chain these operations in a clear, left-to-right flow, which saves you from needing to specify the data being modified in every function.

Summarizing Data

The command summary() in R gives you a quick statistical summary of the dataset's columns. For numeric variables, it outputs key statistics like the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values. For factor or categorical variables, it displays the count of each level, and for logical variables, it shows the number of TRUE, FALSE, and missing (NA) values. For more advanced summary statistics, use the summarize() command, which is a part of the dplyr package in tidyverse.

Grouped Summaries:

  • Combine group\_by() with summarise() to calculate statistics by group:

Example Workflow:

# Load the dplyr package for data manipulation functions

library(dplyr)`

# Assume 'data' is your data frame containing the columns 'year' and 'gdp_pc'

summary_data <- data %>%

group_by(year) %>%          # Group the data by the 'year' column

summarise(mean_gdp_pc = mean(gdp_pc, na.rm = TRUE))  # Compute the mean of 'gdp_pc' for each group, ignoring missing values (NA)

# Print the summary data frame to display the results

print(summary_data)

Additional Tips

Handling Missing Values:

  • Most functions offer an na.rm parameter to manage missing values
    • na.rm = TRUE: Instructs functions to ignore or remove any missing values (NAs) during calculations, ensuring that summary statistics (like mean, sum, etc.) are computed only on available data.

    • na.rm = FALSE: Tells functions to include missing values in the calculation. If any NA is present, the result may be NA, reflecting incomplete data in the calculation.

Example Code

Example code for the following data manipulation tasks can be found at these links:


3. Visualization

Data visualization in R is streamlined using ggplot2, a core tidyverse package that employs a layered grammar of graphics. This means you start with a base plot specifying your data and aesthetic mappings (linking variables in your data to your plot), then add layers (geoms) that tell ggplot2 how to represent the data visually.

Note: You may need to install ggplot2 using install.packages("ggplot2") or as part of install.packages("tidyverse").

Example Code

Example code for the the following data visualization tasks can be found at these links:


4. Regression

While base R has regression function, the fixest package contains the best toolkit for performing regressions in R. As an RA, you will likely be asked to run simple linear regressions, intercept-only regressions to find means with standard errors, and calculate linear combinations of regressions coefficients.

Example code for regression can be found here.

⚠️ **GitHub.com Fallback** ⚠️