R Cheatsheet for Data Science - ElleCoding/Data_Science_Cheatsheets GitHub Wiki
Hi, I'm Elle (ellecoding). Here's an R cheatsheet for Data Science and Machine Learning I've made. Hope it helps!
-
Assignment: Assign values to variables to store data.
RCopy code x <- 5 # Assigns the value 5 to the variable x y = 10 # Assigns the value 10 to the variable y
-
Vectors: Vectors are basic data structures that hold elements of the same type. Create a vector using the
c()
function.RCopy code v <- c(1, 2, 3, 4) # Creates a numeric vector with elements 1, 2, 3, 4
-
Matrices: Matrices are two-dimensional arrays that hold elements of the same type. Create a matrix with
matrix()
, specifying the data, number of rows, and columns.RCopy code m <- matrix(1:9, nrow=3, ncol=3) # Creates a 3x3 matrix with values from 1 to 9
-
Lists: Lists can contain elements of different types. Use
list()
to create a list.RCopy code l <- list(a=1, b="text", c=TRUE) # Creates a list with elements of different types
-
Data Frames: Data frames are table-like structures that can hold different types of data in each column. Create a data frame with
data.frame()
, combining vectors of equal length.RCopy code df <- data.frame(id=1:4, name=c("John", "Doe", "Jane", "Smith")) # Creates a data frame with columns id and name
-
Factors: Factors are used to handle categorical data. Use
factor()
to create a factor.RCopy code f <- factor(c("male", "female", "male")) # Creates a factor for the categorical variable gender
-
CSV: Read CSV files, which are commonly used for storing tabular data, using
read.csv()
.RCopy code df <- read.csv("file.csv") # Reads data from a CSV file into a data frame
-
Excel: Import Excel files, which are widely used for data storage and analysis, with
read_excel()
from thereadxl
package.RCopy code library(readxl) df <- read_excel("file.xlsx") # Reads data from an Excel file into a data frame
-
Database: Connect to a database and query data, facilitating data retrieval from structured databases, using
DBI
andRSQLite
packages.RCopy code library(DBI) conn <- dbConnect(RSQLite::SQLite(), "database.sqlite") # Establishes a connection to an SQLite database df <- dbGetQuery(conn, "SELECT * FROM table") # Executes a SQL query and returns the result as a data frame
-
dplyr:
dplyr
is a grammar of data manipulation, providing a consistent set of verbs for data manipulation tasks like filtering, selecting, mutating, and arranging data.RCopy code library(dplyr) df %>% filter(column > 10) %>% select(column1, column2) %>% mutate(new_column = column1 * 2) %>% arrange(desc(column1)) # Chains multiple data manipulation functions using dplyr
-
tidyr:
tidyr
helps tidy up data, making it easier to work with, by transforming data between wide and long formats.RCopy code library(tidyr) df <- df %>% gather(key, value, -id) %>% spread(key, value) # Converts data from wide to long format and vice versa
-
ggplot2:
ggplot2
is a powerful and flexible system for creating complex and customizable plots. Create plots withggplot2
.RCopy code library(ggplot2) ggplot(df, aes(x=column1, y=column2)) + geom_point() + geom_smooth(method="lm") + theme_minimal() # Creates a scatter plot with a linear regression line and minimal theme
-
Descriptive Statistics: Summarize and understand your data using descriptive statistics with
summary()
.RCopy code summary(df) # Provides summary statistics of the data frame
-
Correlation: Measure the strength and direction of the relationship between two variables with
cor()
.RCopy code cor(df$column1, df$column2) # Calculates the correlation between two columns
-
t-test: Compare means between two groups using a t-test with
t.test()
.RCopy code t.test(column1 ~ group, data=df) # Performs a t-test to compare means between groups
-
Linear Regression: Perform linear regression to understand the relationship between variables using
lm()
.RCopy code model <- lm(column1 ~ column2 + column3, data=df) summary(model) # Fits a linear regression model and summarizes the results
-
Generalized Linear Models (GLM): GLM extends linear models to support non-normal distributions. Use
glm()
to fit these models.RCopy code model <- glm(column1 ~ column2 + column3, family=binomial, data=df) summary(model) # Fits a generalized linear model (e.g., logistic regression) and summarizes the results
-
Difference between
lm
andglm
: Whilelm
is used for linear regression models assuming a normal distribution of errors,glm
can handle various types of distributions (e.g., binomial, Poisson) by specifying afamily
argument.
-
Difference between
-
Random Forest: Build a random forest model, an ensemble learning method, with the
randomForest
package.RCopy code library(randomForest) model <- randomForest(column1 ~ column2 + column3, data=df) print(model) # Trains a random forest model and prints the model summary
-
Cross-Validation: Use
caret
for cross-validation, a technique to assess the performance of a model, and model training.RCopy code library(caret) train_control <- trainControl(method="cv", number=10) model <- train(column1 ~ column2 + column3, data=df, method="lm", trControl=train_control) print(model) # Performs cross-validation and trains a model using caret
-
ARIMA: Fit and forecast time series data using ARIMA models with the
forecast
package.RCopy code library(forecast) ts_data <- ts(df$column, start=c(2020,1), frequency=12) model <- auto.arima(ts_data) forecast(model, h=12) # Fits an ARIMA model to the time series data and forecasts future values
-
Text Mining: Preprocess and analyze text data to extract meaningful insights using the
tm
package.RCopy code library(tm) corpus <- Corpus(VectorSource(text_vector)) corpus <- tm_map(corpus, content_transformer(tolower)) dtm <- DocumentTermMatrix(corpus) # Creates a text corpus, preprocesses the text, and creates a document-term matrix
-
Shiny App: Create interactive web applications for data visualization and analysis using the
shiny
package.RCopy code library(shiny) ui <- fluidPage( titlePanel("Simple Shiny App"), sidebarLayout( sidebarPanel( sliderInput("bins", "Number of bins:", 1, 50, 30) ), mainPanel( plotOutput("distPlot") ) ) ) server <- function(input, output) { output$distPlot <- renderPlot({ x <- faithful[, 2] bins <- seq(min(x), max(x), length.out = input$bins + 1) hist(x, breaks = bins, col = 'darkgray', border = 'white') }) } shinyApp(ui = ui, server = server) # Sets up a simple Shiny app with a slider input and a histogram plot
-
Tidyverse: The
tidyverse
is a collection of R packages designed for data science, includingggplot2
,dplyr
,tidyr
,readr
,purrr
, andtibble
. Load all tidyverse packages at once.RCopy code library(tidyverse) # Loads all core tidyverse packages in a single command
-
Data Manipulation:
dplyr
,tidyr
for transforming and cleaning data. -
Data Visualization:
ggplot2
,lattice
for creating various types of plots. -
Machine Learning:
caret
,randomForest
,e1071
for building and evaluating models. -
Time Series:
forecast
,zoo
for analyzing and forecasting time series data. -
Text Mining:
tm
,text2vec
for preprocessing and analyzing text data. -
Web Applications:
shiny
for building interactive web apps.
-
Matrices: Two-dimensional arrays that hold elements of the same type. Perform operations on matrices.
RCopy code m <- matrix(1:9, nrow=3, ncol=3) # Creates a 3x3 matrix m_transpose <- t(m) # Transposes the matrix m_multiply <- m %*% m_transpose # Multiplies the matrix by its transpose
-
Arrays: Multi-dimensional generalizations of vectors and matrices.
RCopy code arr <- array(1:12, dim=c(2, 3, 2)) # Creates a 2x3x2 array arr[1,,2] # Accesses the first row of the second matrix
-
data.table: An enhanced version of
data.frame
for fast data manipulation.RCopy code library(data.table) dt <- data.table(id=1:5, name=c("John", "Jane", "Tom", "Lucy", "Anna")) dt[, .(mean_id = mean(id)), by = name] # Calculates the mean of 'id' grouped by 'name' dt[id > 2, .(name, id)] # Selects 'name' and 'id' where 'id' is greater than 2
-
Advanced
dplyr
: Perform complex data manipulation tasks usingdplyr
.RCopy code library(dplyr) df <- data.frame(id=1:5, name=c("John", "Jane", "Tom", "Lucy", "Anna"), score=c(88, 92, 85, 90, 95)) df %>% group_by(name) %>% summarise(mean_score = mean(score), max_score = max(score)) %>% filter(mean_score > 90) # Groups by 'name', calculates mean and max score, and filters by mean score > 90
-
Advanced
ggplot2
: Create complex and customized visualizations.RCopy code library(ggplot2) p <- ggplot(df, aes(x=id, y=score, color=name)) + geom_point(size=3) + geom_smooth(method="lm", se=FALSE) + theme_minimal() + labs(title="Scores by ID and Name", x="ID", y="Score") p + theme(axis.text.x = element_text(angle=45, hjust=1)) # Rotates x-axis text
-
plotly: Create interactive plots.
RCopy code library(plotly) plot_ly(df, x = ~id, y = ~score, type = 'scatter', mode = 'lines+markers', color = ~name) # Creates an interactive scatter plot
-
Mixed Models: Fit mixed-effects models using
lme4
.RCopy code library(lme4) model <- lmer(score ~ (1 | name) + (1 | id), data = df) summary(model) # Fits a mixed-effects model and summarizes the results
-
Bayesian Analysis: Perform Bayesian data analysis using
brms
.RCopy code library(brms) bayesian_model <- brm(score ~ (1 | name) + (1 | id), data = df, family = gaussian()) summary(bayesian_model) # Fits a Bayesian mixed-effects model and summarizes the results
-
XGBoost: Implement extreme gradient boosting for predictive modeling.
RCopy code library(xgboost) data_matrix <- as.matrix(df[, -c(1,2)]) # Converts data to matrix labels <- df$score dtrain <- xgb.DMatrix(data = data_matrix, label = labels) params <- list(objective = "reg:squarederror", eta = 0.1, max_depth = 3) xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 100) print(xgb_model) # Trains an XGBoost model
-
Neural Networks: Implement neural networks using
keras
.RCopy code library(keras) model <- keras_model_sequential() %>% layer_dense(units = 32, activation = 'relu', input_shape = c(ncol(data_matrix))) %>% layer_dense(units = 1) model %>% compile( loss = 'mse', optimizer = optimizer_rmsprop() ) model %>% fit(data_matrix, labels, epochs = 50, batch_size = 10)
-
Text Mining: Process and analyze text data.
RCopy code library(tm) text <- c("This is a sample text", "Another example of text data") corpus <- Corpus(VectorSource(text)) corpus <- tm_map(corpus, content_transformer(tolower)) dtm <- DocumentTermMatrix(corpus) inspect(dtm) # Creates a document-term matrix and inspects it
-
quanteda: Advanced text analysis.
RCopy code library(quanteda) dfm <- dfm(corpus) topfeatures(dfm, n = 5) # Displays the top 5 most frequent terms
-
Topic Modeling: Identify topics in text data.
RCopy code library(topicmodels) lda_model <- LDA(dtm, k = 2, control = list(seed = 1234)) topics <- terms(lda_model, 5) print(topics) # Fits an LDA model and prints the top 5 terms in each topic
-
Advanced Time Series Forecasting: Use the
forecast
package for advanced time series analysis.RCopy code library(forecast) ts_data <- ts(df$score, frequency = 12) arima_model <- auto.arima(ts_data) forecasted_values <- forecast(arima_model, h = 12) plot(forecasted_values) # Fits an ARIMA model and plots the forecasted values
-
prophet: Perform time series forecasting with
prophet
.RCopy code library(prophet) df_prophet <- data.frame(ds = as.Date('2000-01-01') + 0:29, y = rnorm(30)) prophet_model <- prophet(df_prophet) future <- make_future_dataframe(prophet_model, periods = 365) forecast <- predict(prophet_model, future) plot(prophet_model, forecast) # Fits a prophet model and plots the forecast
-
sparklyr: Interface for Apache Spark.
RCopy code library(sparklyr) spark_install(version = "3.0.0") sc <- spark_connect(master = "local") sdf <- copy_to(sc, df, "df_spark", overwrite = TRUE) spark_df <- sdf %>% filter(score > 90) %>% group_by(name) %>% summarise(mean_score = mean(score)) %>% collect() print(spark_df) # Connects to Spark, processes data, and collects results
-
parallel: Use parallel processing to speed up computations.
RCopy code library(parallel) cl <- makeCluster(detectCores() - 1) clusterExport(cl, "df") results <- parLapply(cl, 1:10, function(x) mean(df$score) + x) stopCluster(cl) print(results) # Uses parallel processing to compute results
-
Advanced Shiny: Create interactive web applications.
RCopy code library(shiny) ui <- fluidPage( titlePanel("Advanced Shiny App"), sidebarLayout( sidebarPanel( sliderInput("bins", "Number of bins:", 1, 50, 30), selectInput("variable", "Variable:", choices = colnames(df)) ), mainPanel( plotOutput("distPlot"), tableOutput("summaryTable") ) ) ) server <- function(input, output) { output$distPlot <- renderPlot({ x <- df[[input$variable]] bins <- seq(min(x), max(x), length.out = input$bins + 1) hist(x, breaks = bins, col = 'darkgray', border = 'white') }) output$summaryTable <- renderTable({ summary(df[[input$variable]]) }) } shinyApp(ui = ui, server = server) # Sets up an advanced Shiny app with interactive plots and tables
-
AWS S3: Interact with AWS S3 for data storage.
RCopy code library(aws.s3) bucketlist() # Lists all buckets in S3 s3write_using(df, FUN = write.csv, object = "s3://my-bucket/df.csv") # Writes a dataframe to S3
-
Google Cloud Storage: Interact with Google Cloud Storage for data storage.
RCopy code library(googleCloudStorageR) gcs_auth("my-auth-file.json") gcs_upload(file = "local_file.csv", bucket = "my-bucket") # Uploads a file to Google