R Cheatsheet for Data Science - ElleCoding/Data_Science_Cheatsheets GitHub Wiki

R Cheatsheet for Data Science

Hi, I'm Elle (ellecoding). Here's an R cheatsheet for Data Science and Machine Learning I've made. Hope it helps!

Basic Syntax

  • Assignment: Assign values to variables to store data.

    RCopy code
    x <- 5  # Assigns the value 5 to the variable x
    y = 10  # Assigns the value 10 to the variable y
    
    

Data Structures

  • Vectors: Vectors are basic data structures that hold elements of the same type. Create a vector using the c() function.

    RCopy code
    v <- c(1, 2, 3, 4)  # Creates a numeric vector with elements 1, 2, 3, 4
    
    
  • Matrices: Matrices are two-dimensional arrays that hold elements of the same type. Create a matrix with matrix(), specifying the data, number of rows, and columns.

    RCopy code
    m <- matrix(1:9, nrow=3, ncol=3)  # Creates a 3x3 matrix with values from 1 to 9
    
    
  • Lists: Lists can contain elements of different types. Use list() to create a list.

    RCopy code
    l <- list(a=1, b="text", c=TRUE)  # Creates a list with elements of different types
    
    
  • Data Frames: Data frames are table-like structures that can hold different types of data in each column. Create a data frame with data.frame(), combining vectors of equal length.

    RCopy code
    df <- data.frame(id=1:4, name=c("John", "Doe", "Jane", "Smith"))  # Creates a data frame with columns id and name
    
    
  • Factors: Factors are used to handle categorical data. Use factor() to create a factor.

    RCopy code
    f <- factor(c("male", "female", "male"))  # Creates a factor for the categorical variable gender
    
    

Data Import

  • CSV: Read CSV files, which are commonly used for storing tabular data, using read.csv().

    RCopy code
    df <- read.csv("file.csv")  # Reads data from a CSV file into a data frame
    
    
  • Excel: Import Excel files, which are widely used for data storage and analysis, with read_excel() from the readxl package.

    RCopy code
    library(readxl)
    df <- read_excel("file.xlsx")  # Reads data from an Excel file into a data frame
    
    
  • Database: Connect to a database and query data, facilitating data retrieval from structured databases, using DBI and RSQLite packages.

    RCopy code
    library(DBI)
    conn <- dbConnect(RSQLite::SQLite(), "database.sqlite")  # Establishes a connection to an SQLite database
    df <- dbGetQuery(conn, "SELECT * FROM table")  # Executes a SQL query and returns the result as a data frame
    
    

Data Manipulation

  • dplyr: dplyr is a grammar of data manipulation, providing a consistent set of verbs for data manipulation tasks like filtering, selecting, mutating, and arranging data.

    RCopy code
    library(dplyr)
    df %>%
      filter(column > 10) %>%
      select(column1, column2) %>%
      mutate(new_column = column1 * 2) %>%
      arrange(desc(column1))  # Chains multiple data manipulation functions using dplyr
    
    
  • tidyr: tidyr helps tidy up data, making it easier to work with, by transforming data between wide and long formats.

    RCopy code
    library(tidyr)
    df <- df %>%
      gather(key, value, -id) %>%
      spread(key, value)  # Converts data from wide to long format and vice versa
    
    

Data Visualization

  • ggplot2: ggplot2 is a powerful and flexible system for creating complex and customizable plots. Create plots with ggplot2.

    RCopy code
    library(ggplot2)
    ggplot(df, aes(x=column1, y=column2)) +
      geom_point() +
      geom_smooth(method="lm") +
      theme_minimal()  # Creates a scatter plot with a linear regression line and minimal theme
    
    

Statistical Analysis

  • Descriptive Statistics: Summarize and understand your data using descriptive statistics with summary().

    RCopy code
    summary(df)  # Provides summary statistics of the data frame
    
    
  • Correlation: Measure the strength and direction of the relationship between two variables with cor().

    RCopy code
    cor(df$column1, df$column2)  # Calculates the correlation between two columns
    
    
  • t-test: Compare means between two groups using a t-test with t.test().

    RCopy code
    t.test(column1 ~ group, data=df)  # Performs a t-test to compare means between groups
    
    

Machine Learning

  • Linear Regression: Perform linear regression to understand the relationship between variables using lm().

    RCopy code
    model <- lm(column1 ~ column2 + column3, data=df)
    summary(model)  # Fits a linear regression model and summarizes the results
    
    
  • Generalized Linear Models (GLM): GLM extends linear models to support non-normal distributions. Use glm() to fit these models.

    RCopy code
    model <- glm(column1 ~ column2 + column3, family=binomial, data=df)
    summary(model)  # Fits a generalized linear model (e.g., logistic regression) and summarizes the results
    
    
    • Difference between lm and glm: While lm is used for linear regression models assuming a normal distribution of errors, glm can handle various types of distributions (e.g., binomial, Poisson) by specifying a family argument.
  • Random Forest: Build a random forest model, an ensemble learning method, with the randomForest package.

    RCopy code
    library(randomForest)
    model <- randomForest(column1 ~ column2 + column3, data=df)
    print(model)  # Trains a random forest model and prints the model summary
    
    
  • Cross-Validation: Use caret for cross-validation, a technique to assess the performance of a model, and model training.

    RCopy code
    library(caret)
    train_control <- trainControl(method="cv", number=10)
    model <- train(column1 ~ column2 + column3, data=df, method="lm", trControl=train_control)
    print(model)  # Performs cross-validation and trains a model using caret
    
    

Time Series Analysis

  • ARIMA: Fit and forecast time series data using ARIMA models with the forecast package.

    RCopy code
    library(forecast)
    ts_data <- ts(df$column, start=c(2020,1), frequency=12)
    model <- auto.arima(ts_data)
    forecast(model, h=12)  # Fits an ARIMA model to the time series data and forecasts future values
    
    

Text Processing

  • Text Mining: Preprocess and analyze text data to extract meaningful insights using the tm package.

    RCopy code
    library(tm)
    corpus <- Corpus(VectorSource(text_vector))
    corpus <- tm_map(corpus, content_transformer(tolower))
    dtm <- DocumentTermMatrix(corpus)  # Creates a text corpus, preprocesses the text, and creates a document-term matrix
    
    

Advanced Topics

  • Shiny App: Create interactive web applications for data visualization and analysis using the shiny package.

    RCopy code
    library(shiny)
    ui <- fluidPage(
      titlePanel("Simple Shiny App"),
      sidebarLayout(
        sidebarPanel(
          sliderInput("bins", "Number of bins:", 1, 50, 30)
        ),
        mainPanel(
          plotOutput("distPlot")
        )
      )
    )
    
    server <- function(input, output) {
      output$distPlot <- renderPlot({
        x <- faithful[, 2]
        bins <- seq(min(x), max(x), length.out = input$bins + 1)
        hist(x, breaks = bins, col = 'darkgray', border = 'white')
      })
    }
    
    shinyApp(ui = ui, server = server)  # Sets up a simple Shiny app with a slider input and a histogram plot
    
    

Tidyverse

  • Tidyverse: The tidyverse is a collection of R packages designed for data science, including ggplot2, dplyr, tidyr, readr, purrr, and tibble. Load all tidyverse packages at once.

    RCopy code
    library(tidyverse)  # Loads all core tidyverse packages in a single command
    
    

Helpful Libraries

  • Data Manipulation: dplyr, tidyr for transforming and cleaning data.
  • Data Visualization: ggplot2, lattice for creating various types of plots.
  • Machine Learning: caret, randomForest, e1071 for building and evaluating models.
  • Time Series: forecast, zoo for analyzing and forecasting time series data.
  • Text Mining: tm, text2vec for preprocessing and analyzing text data.
  • Web Applications: shiny for building interactive web apps.

R Programming Cheatsheet for Data Scientists - Part 2 (Intermediate/Advanced)

Advanced Data Structures

Matrices and Arrays

  • Matrices: Two-dimensional arrays that hold elements of the same type. Perform operations on matrices.

    RCopy code
    m <- matrix(1:9, nrow=3, ncol=3)  # Creates a 3x3 matrix
    m_transpose <- t(m)  # Transposes the matrix
    m_multiply <- m %*% m_transpose  # Multiplies the matrix by its transpose
    
    
  • Arrays: Multi-dimensional generalizations of vectors and matrices.

    RCopy code
    arr <- array(1:12, dim=c(2, 3, 2))  # Creates a 2x3x2 array
    arr[1,,2]  # Accesses the first row of the second matrix
    
    

Advanced Data Manipulation

data.table

  • data.table: An enhanced version of data.frame for fast data manipulation.

    RCopy code
    library(data.table)
    dt <- data.table(id=1:5, name=c("John", "Jane", "Tom", "Lucy", "Anna"))
    dt[, .(mean_id = mean(id)), by = name]  # Calculates the mean of 'id' grouped by 'name'
    dt[id > 2, .(name, id)]  # Selects 'name' and 'id' where 'id' is greater than 2
    
    

dplyr Advanced

  • Advanced dplyr: Perform complex data manipulation tasks using dplyr.

    RCopy code
    library(dplyr)
    df <- data.frame(id=1:5, name=c("John", "Jane", "Tom", "Lucy", "Anna"), score=c(88, 92, 85, 90, 95))
    df %>%
      group_by(name) %>%
      summarise(mean_score = mean(score), max_score = max(score)) %>%
      filter(mean_score > 90)  # Groups by 'name', calculates mean and max score, and filters by mean score > 90
    
    

Advanced Data Visualization

ggplot2 Advanced

  • Advanced ggplot2: Create complex and customized visualizations.

    RCopy code
    library(ggplot2)
    p <- ggplot(df, aes(x=id, y=score, color=name)) +
      geom_point(size=3) +
      geom_smooth(method="lm", se=FALSE) +
      theme_minimal() +
      labs(title="Scores by ID and Name", x="ID", y="Score")
    p + theme(axis.text.x = element_text(angle=45, hjust=1))  # Rotates x-axis text
    
    

plotly

  • plotly: Create interactive plots.

    RCopy code
    library(plotly)
    plot_ly(df, x = ~id, y = ~score, type = 'scatter', mode = 'lines+markers', color = ~name)  # Creates an interactive scatter plot
    
    

Advanced Statistical Analysis

Mixed Models

  • Mixed Models: Fit mixed-effects models using lme4.

    RCopy code
    library(lme4)
    model <- lmer(score ~ (1 | name) + (1 | id), data = df)
    summary(model)  # Fits a mixed-effects model and summarizes the results
    
    

Bayesian Analysis

  • Bayesian Analysis: Perform Bayesian data analysis using brms.

    RCopy code
    library(brms)
    bayesian_model <- brm(score ~ (1 | name) + (1 | id), data = df, family = gaussian())
    summary(bayesian_model)  # Fits a Bayesian mixed-effects model and summarizes the results
    
    

Advanced Machine Learning

XGBoost

  • XGBoost: Implement extreme gradient boosting for predictive modeling.

    RCopy code
    library(xgboost)
    data_matrix <- as.matrix(df[, -c(1,2)])  # Converts data to matrix
    labels <- df$score
    dtrain <- xgb.DMatrix(data = data_matrix, label = labels)
    params <- list(objective = "reg:squarederror", eta = 0.1, max_depth = 3)
    xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 100)
    print(xgb_model)  # Trains an XGBoost model
    
    

Neural Networks

  • Neural Networks: Implement neural networks using keras.

    RCopy code
    library(keras)
    model <- keras_model_sequential() %>%
      layer_dense(units = 32, activation = 'relu', input_shape = c(ncol(data_matrix))) %>%
      layer_dense(units = 1)
    model %>% compile(
      loss = 'mse',
      optimizer = optimizer_rmsprop()
    )
    model %>% fit(data_matrix, labels, epochs = 50, batch_size = 10)
    
    

Natural Language Processing (NLP)

Text Mining with tm and quanteda

  • Text Mining: Process and analyze text data.

    RCopy code
    library(tm)
    text <- c("This is a sample text", "Another example of text data")
    corpus <- Corpus(VectorSource(text))
    corpus <- tm_map(corpus, content_transformer(tolower))
    dtm <- DocumentTermMatrix(corpus)
    inspect(dtm)  # Creates a document-term matrix and inspects it
    
    
  • quanteda: Advanced text analysis.

    RCopy code
    library(quanteda)
    dfm <- dfm(corpus)
    topfeatures(dfm, n = 5)  # Displays the top 5 most frequent terms
    
    

Topic Modeling with topicmodels

  • Topic Modeling: Identify topics in text data.

    RCopy code
    library(topicmodels)
    lda_model <- LDA(dtm, k = 2, control = list(seed = 1234))
    topics <- terms(lda_model, 5)
    print(topics)  # Fits an LDA model and prints the top 5 terms in each topic
    
    

Time Series Analysis

forecast Advanced

  • Advanced Time Series Forecasting: Use the forecast package for advanced time series analysis.

    RCopy code
    library(forecast)
    ts_data <- ts(df$score, frequency = 12)
    arima_model <- auto.arima(ts_data)
    forecasted_values <- forecast(arima_model, h = 12)
    plot(forecasted_values)  # Fits an ARIMA model and plots the forecasted values
    
    

prophet

  • prophet: Perform time series forecasting with prophet.

    RCopy code
    library(prophet)
    df_prophet <- data.frame(ds = as.Date('2000-01-01') + 0:29, y = rnorm(30))
    prophet_model <- prophet(df_prophet)
    future <- make_future_dataframe(prophet_model, periods = 365)
    forecast <- predict(prophet_model, future)
    plot(prophet_model, forecast)  # Fits a prophet model and plots the forecast
    
    

Big Data and Parallel Computing

sparklyr

  • sparklyr: Interface for Apache Spark.

    RCopy code
    library(sparklyr)
    spark_install(version = "3.0.0")
    sc <- spark_connect(master = "local")
    sdf <- copy_to(sc, df, "df_spark", overwrite = TRUE)
    spark_df <- sdf %>%
      filter(score > 90) %>%
      group_by(name) %>%
      summarise(mean_score = mean(score)) %>%
      collect()
    print(spark_df)  # Connects to Spark, processes data, and collects results
    
    

parallel

  • parallel: Use parallel processing to speed up computations.

    RCopy code
    library(parallel)
    cl <- makeCluster(detectCores() - 1)
    clusterExport(cl, "df")
    results <- parLapply(cl, 1:10, function(x) mean(df$score) + x)
    stopCluster(cl)
    print(results)  # Uses parallel processing to compute results
    
    

Interactive Applications

shiny Advanced

  • Advanced Shiny: Create interactive web applications.

    RCopy code
    library(shiny)
    ui <- fluidPage(
      titlePanel("Advanced Shiny App"),
      sidebarLayout(
        sidebarPanel(
          sliderInput("bins", "Number of bins:", 1, 50, 30),
          selectInput("variable", "Variable:", choices = colnames(df))
        ),
        mainPanel(
          plotOutput("distPlot"),
          tableOutput("summaryTable")
        )
      )
    )
    
    server <- function(input, output) {
      output$distPlot <- renderPlot({
        x <- df[[input$variable]]
        bins <- seq(min(x), max(x), length.out = input$bins + 1)
        hist(x, breaks = bins, col = 'darkgray', border = 'white')
      })
      output$summaryTable <- renderTable({
        summary(df[[input$variable]])
      })
    }
    
    shinyApp(ui = ui, server = server)  # Sets up an advanced Shiny app with interactive plots and tables
    
    

Cloud Computing for Data Science

AWS and Google Cloud Integration

  • AWS S3: Interact with AWS S3 for data storage.

    RCopy code
    library(aws.s3)
    bucketlist()  # Lists all buckets in S3
    s3write_using(df, FUN = write.csv, object = "s3://my-bucket/df.csv")  # Writes a dataframe to S3
    
    
  • Google Cloud Storage: Interact with Google Cloud Storage for data storage.

    RCopy code
    library(googleCloudStorageR)
    gcs_auth("my-auth-file.json")
    gcs_upload(file = "local_file.csv", bucket = "my-bucket")  # Uploads a file to Google
    
    
⚠️ **GitHub.com Fallback** ⚠️