R Cheatsheet for Data Science

Hi, I'm Elle (ellecoding). Here's an R cheatsheet for Data Science and Machine Learning I've made. Hope it helps!

Basic Syntax

Assignment: Assign values to variables to store data.

RCopy code
x <- 5  # Assigns the value 5 to the variable x
y = 10  # Assigns the value 10 to the variable y

Data Structures

Vectors: Vectors are basic data structures that hold elements of the same type. Create a vector using the c() function.
```
RCopy code
v <- c(1, 2, 3, 4)  # Creates a numeric vector with elements 1, 2, 3, 4
```
Matrices: Matrices are two-dimensional arrays that hold elements of the same type. Create a matrix with matrix(), specifying the data, number of rows, and columns.
```
RCopy code
m <- matrix(1:9, nrow=3, ncol=3)  # Creates a 3x3 matrix with values from 1 to 9
```

Lists: Lists can contain elements of different types. Use list() to create a list.

RCopy code
l <- list(a=1, b="text", c=TRUE)  # Creates a list with elements of different types

Data Frames: Data frames are table-like structures that can hold different types of data in each column. Create a data frame with data.frame(), combining vectors of equal length.
```
RCopy code
df <- data.frame(id=1:4, name=c("John", "Doe", "Jane", "Smith"))  # Creates a data frame with columns id and name
```

Factors: Factors are used to handle categorical data. Use factor() to create a factor.

RCopy code
f <- factor(c("male", "female", "male"))  # Creates a factor for the categorical variable gender

Data Import

CSV: Read CSV files, which are commonly used for storing tabular data, using read.csv().

RCopy code
df <- read.csv("file.csv")  # Reads data from a CSV file into a data frame

Excel: Import Excel files, which are widely used for data storage and analysis, with read_excel() from the readxl package.
```
RCopy code
library(readxl)
df <- read_excel("file.xlsx")  # Reads data from an Excel file into a data frame
```

Database: Connect to a database and query data, facilitating data retrieval from structured databases, using DBI and RSQLite packages.

RCopy code
library(DBI)
conn <- dbConnect(RSQLite::SQLite(), "database.sqlite")  # Establishes a connection to an SQLite database
df <- dbGetQuery(conn, "SELECT * FROM table")  # Executes a SQL query and returns the result as a data frame

Data Manipulation

dplyr: dplyr is a grammar of data manipulation, providing a consistent set of verbs for data manipulation tasks like filtering, selecting, mutating, and arranging data.

RCopy code
library(dplyr)
df %>%
  filter(column > 10) %>%
  select(column1, column2) %>%
  mutate(new_column = column1 * 2) %>%
  arrange(desc(column1))  # Chains multiple data manipulation functions using dplyr

tidyr: tidyr helps tidy up data, making it easier to work with, by transforming data between wide and long formats.

RCopy code
library(tidyr)
df <- df %>%
  gather(key, value, -id) %>%
  spread(key, value)  # Converts data from wide to long format and vice versa

Data Visualization

ggplot2: ggplot2 is a powerful and flexible system for creating complex and customizable plots. Create plots with ggplot2.

RCopy code
library(ggplot2)
ggplot(df, aes(x=column1, y=column2)) +
  geom_point() +
  geom_smooth(method="lm") +
  theme_minimal()  # Creates a scatter plot with a linear regression line and minimal theme

Statistical Analysis

Descriptive Statistics: Summarize and understand your data using descriptive statistics with summary().
```
RCopy code
summary(df)  # Provides summary statistics of the data frame
```
Correlation: Measure the strength and direction of the relationship between two variables with cor().
```
RCopy code
cor(df$column1, df$column2)  # Calculates the correlation between two columns
```

t-test: Compare means between two groups using a t-test with t.test().

RCopy code
t.test(column1 ~ group, data=df)  # Performs a t-test to compare means between groups

Machine Learning

Linear Regression: Perform linear regression to understand the relationship between variables using lm().

RCopy code
model <- lm(column1 ~ column2 + column3, data=df)
summary(model)  # Fits a linear regression model and summarizes the results

Generalized Linear Models (GLM): GLM extends linear models to support non-normal distributions. Use glm() to fit these models.
```
RCopy code
model <- glm(column1 ~ column2 + column3, family=binomial, data=df)
summary(model)  # Fits a generalized linear model (e.g., logistic regression) and summarizes the results
```
- Difference between lm and glm: While lm is used for linear regression models assuming a normal distribution of errors, glm can handle various types of distributions (e.g., binomial, Poisson) by specifying a family argument.

Random Forest: Build a random forest model, an ensemble learning method, with the randomForest package.

RCopy code
library(randomForest)
model <- randomForest(column1 ~ column2 + column3, data=df)
print(model)  # Trains a random forest model and prints the model summary

Cross-Validation: Use caret for cross-validation, a technique to assess the performance of a model, and model training.

RCopy code
library(caret)
train_control <- trainControl(method="cv", number=10)
model <- train(column1 ~ column2 + column3, data=df, method="lm", trControl=train_control)
print(model)  # Performs cross-validation and trains a model using caret

Time Series Analysis

ARIMA: Fit and forecast time series data using ARIMA models with the forecast package.

RCopy code
library(forecast)
ts_data <- ts(df$column, start=c(2020,1), frequency=12)
model <- auto.arima(ts_data)
forecast(model, h=12)  # Fits an ARIMA model to the time series data and forecasts future values

Text Processing

Text Mining: Preprocess and analyze text data to extract meaningful insights using the tm package.

RCopy code
library(tm)
corpus <- Corpus(VectorSource(text_vector))
corpus <- tm_map(corpus, content_transformer(tolower))
dtm <- DocumentTermMatrix(corpus)  # Creates a text corpus, preprocesses the text, and creates a document-term matrix

Advanced Topics

Shiny App: Create interactive web applications for data visualization and analysis using the shiny package.

RCopy code
library(shiny)
ui <- fluidPage(
  titlePanel("Simple Shiny App"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("bins", "Number of bins:", 1, 50, 30)
    ),
    mainPanel(
      plotOutput("distPlot")
    )
  )
)

server <- function(input, output) {
  output$distPlot <- renderPlot({
    x <- faithful[, 2]
    bins <- seq(min(x), max(x), length.out = input$bins + 1)
    hist(x, breaks = bins, col = 'darkgray', border = 'white')
  })
}

shinyApp(ui = ui, server = server)  # Sets up a simple Shiny app with a slider input and a histogram plot

Tidyverse

Tidyverse: The tidyverse is a collection of R packages designed for data science, including ggplot2, dplyr, tidyr, readr, purrr, and tibble. Load all tidyverse packages at once.
```
RCopy code
library(tidyverse)  # Loads all core tidyverse packages in a single command
```

Helpful Libraries

Data Manipulation: dplyr, tidyr for transforming and cleaning data.
Data Visualization: ggplot2, lattice for creating various types of plots.
Machine Learning: caret, randomForest, e1071 for building and evaluating models.
Time Series: forecast, zoo for analyzing and forecasting time series data.
Text Mining: tm, text2vec for preprocessing and analyzing text data.
Web Applications: shiny for building interactive web apps.

R Programming Cheatsheet for Data Scientists - Part 2 (Intermediate/Advanced)

Advanced Data Structures

Matrices and Arrays

Matrices: Two-dimensional arrays that hold elements of the same type. Perform operations on matrices.

RCopy code
m <- matrix(1:9, nrow=3, ncol=3)  # Creates a 3x3 matrix
m_transpose <- t(m)  # Transposes the matrix
m_multiply <- m %*% m_transpose  # Multiplies the matrix by its transpose

Arrays: Multi-dimensional generalizations of vectors and matrices.

RCopy code
arr <- array(1:12, dim=c(2, 3, 2))  # Creates a 2x3x2 array
arr[1,,2]  # Accesses the first row of the second matrix

Advanced Data Manipulation

`data.table`

data.table: An enhanced version of data.frame for fast data manipulation.

RCopy code
library(data.table)
dt <- data.table(id=1:5, name=c("John", "Jane", "Tom", "Lucy", "Anna"))
dt[, .(mean_id = mean(id)), by = name]  # Calculates the mean of 'id' grouped by 'name'
dt[id > 2, .(name, id)]  # Selects 'name' and 'id' where 'id' is greater than 2

`dplyr` Advanced

Advanced dplyr: Perform complex data manipulation tasks using dplyr.

RCopy code
library(dplyr)
df <- data.frame(id=1:5, name=c("John", "Jane", "Tom", "Lucy", "Anna"), score=c(88, 92, 85, 90, 95))
df %>%
  group_by(name) %>%
  summarise(mean_score = mean(score), max_score = max(score)) %>%
  filter(mean_score > 90)  # Groups by 'name', calculates mean and max score, and filters by mean score > 90

Advanced Data Visualization

`ggplot2` Advanced

Advanced ggplot2: Create complex and customized visualizations.

RCopy code
library(ggplot2)
p <- ggplot(df, aes(x=id, y=score, color=name)) +
  geom_point(size=3) +
  geom_smooth(method="lm", se=FALSE) +
  theme_minimal() +
  labs(title="Scores by ID and Name", x="ID", y="Score")
p + theme(axis.text.x = element_text(angle=45, hjust=1))  # Rotates x-axis text

`plotly`

plotly: Create interactive plots.

RCopy code
library(plotly)
plot_ly(df, x = ~id, y = ~score, type = 'scatter', mode = 'lines+markers', color = ~name)  # Creates an interactive scatter plot

Advanced Statistical Analysis

Mixed Models

Mixed Models: Fit mixed-effects models using lme4.

RCopy code
library(lme4)
model <- lmer(score ~ (1 | name) + (1 | id), data = df)
summary(model)  # Fits a mixed-effects model and summarizes the results

Bayesian Analysis

Bayesian Analysis: Perform Bayesian data analysis using brms.

RCopy code
library(brms)
bayesian_model <- brm(score ~ (1 | name) + (1 | id), data = df, family = gaussian())
summary(bayesian_model)  # Fits a Bayesian mixed-effects model and summarizes the results

Advanced Machine Learning

XGBoost

XGBoost: Implement extreme gradient boosting for predictive modeling.

RCopy code
library(xgboost)
data_matrix <- as.matrix(df[, -c(1,2)])  # Converts data to matrix
labels <- df$score
dtrain <- xgb.DMatrix(data = data_matrix, label = labels)
params <- list(objective = "reg:squarederror", eta = 0.1, max_depth = 3)
xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 100)
print(xgb_model)  # Trains an XGBoost model

Neural Networks

Neural Networks: Implement neural networks using keras.

RCopy code
library(keras)
model <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = 'relu', input_shape = c(ncol(data_matrix))) %>%
  layer_dense(units = 1)
model %>% compile(
  loss = 'mse',
  optimizer = optimizer_rmsprop()
)
model %>% fit(data_matrix, labels, epochs = 50, batch_size = 10)

Natural Language Processing (NLP)

Text Mining with `tm` and `quanteda`

Text Mining: Process and analyze text data.

RCopy code
library(tm)
text <- c("This is a sample text", "Another example of text data")
corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, content_transformer(tolower))
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)  # Creates a document-term matrix and inspects it

quanteda: Advanced text analysis.

RCopy code
library(quanteda)
dfm <- dfm(corpus)
topfeatures(dfm, n = 5)  # Displays the top 5 most frequent terms

Topic Modeling with `topicmodels`

Topic Modeling: Identify topics in text data.

RCopy code
library(topicmodels)
lda_model <- LDA(dtm, k = 2, control = list(seed = 1234))
topics <- terms(lda_model, 5)
print(topics)  # Fits an LDA model and prints the top 5 terms in each topic

Time Series Analysis

`forecast` Advanced

Advanced Time Series Forecasting: Use the forecast package for advanced time series analysis.

RCopy code
library(forecast)
ts_data <- ts(df$score, frequency = 12)
arima_model <- auto.arima(ts_data)
forecasted_values <- forecast(arima_model, h = 12)
plot(forecasted_values)  # Fits an ARIMA model and plots the forecasted values

`prophet`

prophet: Perform time series forecasting with prophet.

RCopy code
library(prophet)
df_prophet <- data.frame(ds = as.Date('2000-01-01') + 0:29, y = rnorm(30))
prophet_model <- prophet(df_prophet)
future <- make_future_dataframe(prophet_model, periods = 365)
forecast <- predict(prophet_model, future)
plot(prophet_model, forecast)  # Fits a prophet model and plots the forecast

Big Data and Parallel Computing

`sparklyr`

sparklyr: Interface for Apache Spark.

RCopy code
library(sparklyr)
spark_install(version = "3.0.0")
sc <- spark_connect(master = "local")
sdf <- copy_to(sc, df, "df_spark", overwrite = TRUE)
spark_df <- sdf %>%
  filter(score > 90) %>%
  group_by(name) %>%
  summarise(mean_score = mean(score)) %>%
  collect()
print(spark_df)  # Connects to Spark, processes data, and collects results

`parallel`

parallel: Use parallel processing to speed up computations.

RCopy code
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, "df")
results <- parLapply(cl, 1:10, function(x) mean(df$score) + x)
stopCluster(cl)
print(results)  # Uses parallel processing to compute results

Interactive Applications

`shiny` Advanced

Advanced Shiny: Create interactive web applications.

RCopy code
library(shiny)
ui <- fluidPage(
  titlePanel("Advanced Shiny App"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("bins", "Number of bins:", 1, 50, 30),
      selectInput("variable", "Variable:", choices = colnames(df))
    ),
    mainPanel(
      plotOutput("distPlot"),
      tableOutput("summaryTable")
    )
  )
)

server <- function(input, output) {
  output$distPlot <- renderPlot({
    x <- df[[input$variable]]
    bins <- seq(min(x), max(x), length.out = input$bins + 1)
    hist(x, breaks = bins, col = 'darkgray', border = 'white')
  })
  output$summaryTable <- renderTable({
    summary(df[[input$variable]])
  })
}

shinyApp(ui = ui, server = server)  # Sets up an advanced Shiny app with interactive plots and tables

Cloud Computing for Data Science

AWS and Google Cloud Integration

AWS S3: Interact with AWS S3 for data storage.

RCopy code
library(aws.s3)
bucketlist()  # Lists all buckets in S3
s3write_using(df, FUN = write.csv, object = "s3://my-bucket/df.csv")  # Writes a dataframe to S3

Google Cloud Storage: Interact with Google Cloud Storage for data storage.

RCopy code
library(googleCloudStorageR)
gcs_auth("my-auth-file.json")
gcs_upload(file = "local_file.csv", bucket = "my-bucket")  # Uploads a file to Google

R Cheatsheet for Data Science - ElleCoding/Data_Science_Cheatsheets GitHub Wiki

R Cheatsheet for Data Science

Basic Syntax

Data Structures

Data Import

Data Manipulation

Data Visualization

Statistical Analysis

Machine Learning

Time Series Analysis

Text Processing

Advanced Topics

Tidyverse

Helpful Libraries

R Programming Cheatsheet for Data Scientists - Part 2 (Intermediate/Advanced)

Advanced Data Structures

Matrices and Arrays

Advanced Data Manipulation

data.table

dplyr Advanced

Advanced Data Visualization

ggplot2 Advanced

plotly

Advanced Statistical Analysis

Mixed Models

Bayesian Analysis

Advanced Machine Learning

XGBoost

Neural Networks

Natural Language Processing (NLP)

Text Mining with tm and quanteda

Topic Modeling with topicmodels

Time Series Analysis

forecast Advanced

prophet

Big Data and Parallel Computing

sparklyr

parallel

Interactive Applications

shiny Advanced

Cloud Computing for Data Science

AWS and Google Cloud Integration

⚠️ **GitHub.com Fallback** ⚠️

`data.table`

`dplyr` Advanced

`ggplot2` Advanced

`plotly`

Text Mining with `tm` and `quanteda`

Topic Modeling with `topicmodels`

`forecast` Advanced

`prophet`

`sparklyr`

`parallel`

`shiny` Advanced

⚠️ GitHub.com Fallback ⚠️