R Packages - Statistics-and-Machine-Learning-with-R/Statistical-Methods-and-Machine-Learning-in-R GitHub Wiki

R Packages

  • Many useful R functions come in packages
  • They increase the power of R by improving existing base R functionalities
  • They bundle together code, data, documentation, and tests
  • There are thousands of packages available on the Comprehensive R Archive Network, or CRAN
  • Commonly used R Packages are: dplyr, tidyr, ggplot2, shiny etc.

Installing R Packages

To install an R Package, type the following in the command line

R package 1

After it is installed, you can make its contents available by running:

R package 2

You can also get help on them by

help(package =“<the package’s name>”)

Example: tidyr

Click for RScript

What is Tidy Data?

A data set is called tidy when:

  • Each column represents a variable, and
  • Each row represents an observation

The opposite of tidy data is messy data, which corresponds to any other arrangement of the data.

Package to Manipulate Tidy-Data: tidyr()

Organize (or reshape) your data in order to make the analysis easier. This process is called tidying your data.

  • gather(): gather (collapse) columns into rows
  • spread(): spread rows into columns.
  • separate(): separate one column into multiple
  • unite(): unite multiple columns into one

Example of tidyr functions

Let us consider the following dataset

Dataset

gather()

Gather

spread() : reversing the gather() function

Spread

unite()

Unite

separate() : reversing the unite() function

Separate

Implementing tidyr()

  • Installing : install.packages("tidyr")

  • Loading : library("tidyr")

  • Load-Dataset : my-data <- USArrests[c(1, 10, 20, 30), ] # [c(1,10,20,30), ] means specific rows and all columns


  • Row names are states, so let’s use the function cbind() to add a column named “state”
  • In the data. This will make the data tidy and the analysis easier.

My_data <- cbind(state = rownames(my_data), my_data)

my_data


gather() : collapse columns into rows [Replacement of MELT(reshape2)]

Simplified format:

	            gather(data, key, value, ...)

                    data: A data frame
                    key: Names of key
                    value: Value columns to create in output
                    …: Specification of columns to gather. 
                    Allowed values are: variable names
  • if you want to select all variables between a and e, use a:e
  • if you want to exclude a column name y use -y
  • for more options, see: dplyr::select()

spread() : spread two columns into multiple columns [Replacement of CAST(reshape2)]

Features:

  • The function spread() does the reverse of gather().
  • It takes two columns (key and value) and Spreads into multiple columns. It produces a “wide” data format from a “long” one.

Simplified format:

	              spread(data, key, value, ...)

                      data: A data frame
                      key: The (unquoted) name of the column whose values will be used as column headings
                      value: The (unquoted) names of the column whose values will populate the cells
                      …: Specification of columns to gather. 
                      Allowed values are: variable names

unite() : The function unite() takes multiple columns and paste them together into one.

Simplified format:

                      unite(data, col, ..., sep = "_")

                      data: A data frame
                      col: The new (unquoted) name of column to add.
                      sep: Separator to use between values
                      …: Specification of columns to gather. 
                      Allowed values are: variable names

separate() : separate one column into multiple : The function sperate() is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns.

Simplified format:

            separate(data, col, into, sep = "_")

            data: A data frame
            col: Unquoted column names
            into: Character vector specifying the names of new variables to be created.
            sep: Separator between columns: