R_oldnotes - JasonLocklin/jasonlocklin.github.com GitHub Wiki

R is a Free implementation of the language S. R is a complete environment for statistical analysis and includes packages covering nearly every statistical procedure ever invented. In fact, most statistics are developed first for R, then find their way into packages like SPSS later. An excellent list of resources are on the R resources page. This page is a work in progress, but it includes various notes Jason has taken while using R for data analysis.

+--------------------------------------------------------------------------+

Contents

- 1 Basics
- 2 Data Management
- 2.1 Data Import
- 2.2 Data Formatting
- 2.3 Useful Tricks
- 2.4 Common Problems
- 3 Statistics
- 3.1 Descriptives
- 3.2 Graphics
- 3.3 T-tests
- 3.3.1 Distributions
- 3.4 Regression and ANOVA
- 3.4.1 Testing Assumptions
- 3.4.1.1 Equality of Variance
- 3.4.1.2 Normality
- [3.4.1.3 Influential Points (Cook's
Distance)](#Influential_Points_.28Cook.27s_Distance.29)
- 3.4.2 Comparing Models
- 3.4.3 One-way ANOVA
- 3.4.4 Linear Contrasts
- 3.4.5 Factorial ANOVA
- [3.4.5.1 Plots of Factorial
Data](#Plots_of_Factorial_Data)
- 3.4.6 Post-hoc Tests
- [3.4.7 Randomized Block Design (Repeated Measures
ANOVA)](#Randomized_Block_Design_.28Repeated_Measures_ANOVA.
29)
- [3.4.8 Split-Plot (Mixed within- and between-subjects
ANOVA)](#Split-Plot_.28Mixed_within-_and_between-subjects_AN
OVA.29)
- 3.4.9 ANCOVA
- 3.4.10 MANOVA
+--------------------------------------------------------------------------+

Basics

R can be run by writing a syntax file and executing it (as with SPSS), but it can also be run interactively. When run interactively, a prompt appears:

>_

Commands can by typed at this prompt one by one, and the output can either be printed to the screen, to a file, or saved as a new object or variable. Try the following example:

> ls()
character(0)
> x = 5
> y = 3
> x + y
[1] 8
> z = x + y
> ls()
[1] "x" "y" "z"
> z
[1] 8
> 10*x + y*z
[1] 74

This is basically acting as an overgrown calculator (note: that is the last time I will include the prompt ">" character in examples). Notice how the result of an operation can be saved to a new variable and used again. This powerful principle is part of everything in R, not simply tacked on like the "compute variable" in SPSS.

R contains built in functions and data sets. The standard install also contains packages of more functions and example (but real) data sets, and they can be loaded into the current working environment with the command:

library(package)

On many systems the R install also includes a graphical browser for packages. The on-line archives at CRAN[1] contain literally thousands more packages for virtually every statistical operation ever conceived. If you need a package that is not already installed on your system, it can be installed by running

install.packages("package")

R will then download, potentially compile, and install everything you need to load the package. I suggest installing the R commander package

install.packages("Rcmdr")
library(Rcmdr)

R commander includes a simple graphical user interface (GUI) with menus and wizards for doing routine things like loading data, editing data, running simple tests and making graphics. Every operation is added to a syntax file, so you can learn as you use it. R commander was created for use in statistics courses so it is useful for learning many important statistical procedures in R.

The current working environment in R contains variables which contain data, and functions which do things (usually) to data. The simplest function is

ls()

It lists all of the variables and loaded functions/datasets in the current workspace. Try it, then type

x = 5

Then try it again. You will see that a new variable (x) is in the list -you can print it by typing x, or remove it with

rm(x)

With any function, adding a leading question mark will open help information on the topic

?ls

These help pages should be the first place to look when learning to use a function. If interested, type the function name without brackets (i.e., ls rather than ls()) and R will often print out the basic syntax of what the function does. I suggest printing out the R reference card[2] and using the "?" trick to investigate whatever function seems applicable to the problem at hand.

Data Management

The basic variable that holds data in R is the vector. A simple vector can be created with the c() function

x <- c(1,2,3)

(note: many references use "=" instead of "<-" for assigning a variable, it works but is a bad habit that will cause problems later). Vectors like "(1,2,3)" are considered by R to be numeric in nature, but can represent factors, such as three distinct groups with:

y = factor(c(1,2,3))

Vectors filled with text strings (i.e.,"pos", "neg", "neu") and boolean values (i.e., TRUE, FALSE, FALSE) are always considered factors.

Typing the variable name at the prompt will cause R to print out the vector.

mean(x)
var(x)
sqrt(var(x))
summary(x)

will print the mean, variance, standard deviation, and quartile summary of x respectively.

x*x

The above will square each value in x and print it to the screen. When one vector is shorter (such as with a single value) that vector is reused for each value of the longest value in the equation. For example, the following

multiplies each value in x by 5:

x*5

Inequalities and the double "==" allow boolean factors to be created. The following two commands will print the vectors "true, false, false" and "false, true, false" respectively:

x < 2
x == 2

One of the biggest advantages of R is that any time you get a new value, you can print it (as done above), or you can save it in a new variable. For example the following prints x centred on it's mean, and creates a new centred vector respectively

x - mean(x)
x.centre = x - mean(x)

At this point, there are two variables in the working environment, x, and the centred form of the x data (x.centre). Even complex output like ANOVA tables can be saved to a variable and used later.

Each vector can be thought to be analogous to a column in an SPSS data table. Of course, it can get messy to have many unrelated variables "strewed about," so a new type of R variable, the "data frame" is basically a collection of related variables composing a "table of data" (like SPSS). When working in R, you may have several variables that are related to each other (x,y,z,subject). They can be collected together in a single data frame as

df = data.frame(subject, y, x,z)
df$x

When data is imported from a file, it is usually imported in this form. Using several data frames allows you to keep data from separate experiments, or analyses separate, without having to load and unload data.

"An Introduction to R" ([#References|references]) is a good place to get a good walk-through of manipulating variables and data frames. For a good reference see "Introductory Statistics with R ([#References|references])".

Data Import

To read in a .csv data file, use read.table():

df <- read.table(data.csv, header=TRUE, sep=',', na.strings = "NA")
#Or simply:
df <- read.csv(data.csv)

Because CSV files can be formatted a variety of ways, read.table() and it's relatives have a lot of options. Remember to run ?read.table to see the help file.

To open SPSS data files: (potentially) install and then load the package foreign:

install.packages("foreign")
library(foreign)
?read.spss

Though, it's usually best to export data files from SPSS or other programs as comma separated value (CSV) format. CSV is simple text, and therefore a relatively universal format.

Data Formatting

Analysis is sometimes easier with one subject per line (wide format), and sometimes with one sample per line (i.e., many lines for each subject -long format). R has several built in functions for doing this type of manipulation, but they can be difficult and in-consistent. The library "reshape" can always give you a simplified, properly formated data frame for a given analysis in 2 lines. Read the introduction[3] to get a handle on it, or watch the video on the author's web site: http://had.co.nz/reshape. Quick-R[4] also has a quick walk-through.

Useful Tricks

scale(y) standardizes the vector y (Useful for calling within a linear model). R doesn't include a function that calculates the standard error of a vector - to create one, enter the following:

se <- function(x) { 
    y <- x[!is.na(x)] # remove the missing values
    sqrt(var(as.vector(y))/length(y))
    }

Common Problems

attach() and detach() can allow you to access variables from a dataframe without calling the data frame, thus saving typing (i.e., var instead of dataframe.var). However, if you modify an attached variable, R does weird and unpredictable things. Just don't do it. Only use attach() if you want quick READ ONLY access to variables in a data frame.

Statistics

Descriptives

Any variable passed to summary(x) will output a set of appropriate summary statistics -data frames will result in the summary statistics for each variable. mean(x), sd(x), var(x), range(x) also produce self explanatory statistics. To create a table of means with two factors (x1 and x2), use:

tapply(y, list(x1,x2), mean)

Graphics

Look at plot() and boxplot() for quick plots of vectors. R has a plot called lowess() that renders a smooth "line of best fit" which can be handy for visualizing data over time. R has a very powerful graphics engine, actually several, so graphics are a "rabbit hole" thing, you can go as far as you want. Take a look at the R graph gallery for some pretty pictures. I haven't yet found a really good reference on graphics, I have mainly worked out of the help files. Just for kicks, run:

library(rgl)
demo(rgl)

3D, dynamic graphics! Make sure to click and drag, scroll, and say "aaah."

T-tests

To test whether a pair of related variables are different: t.test(x,y, paired=TRUE). The default includes the Welsh correction for inequality of variance, so for a normal related t-test: <source lang="text"> t.test(x,y, paired=TRUE, var.equal=TRUE) </source>

Of course, if the data are not dependent, you can leave paired=TRUE out, or set it to false. However, commonly, unrelated measures are in the form of a dependent variable, and a grouping factor. Such a test is performed as: <source lang="text"> t.test(dv ~ factor, var.equal=TRUE) </source>

The ~ indicates that the dependent variable dv is broken up, or described by factor. This is the type of setup and notation used in Regression and ANOVA, even for "within subjects" manipulations. The book "Introductory Statistics with R ([#References|references])", and of course, "An Introduction to R ([#References|references])" would be valuable if any of the above is not immediately apparent.

Use Levene's test for inequality of variance. If significant, use the Welch corrected t-test.

Distributions

To avoid t-tables when doing manual calculations, use qt(0.95, 10) to return the critical T for 10 degrees of freedom and an alpha of .05. To get an exact p value, use the following for a two tailed test with a t value of 1.67, and 10 degrees of freedom: <source lang="text"> (1- pt(1.67, df = 10))

More generally:

p_value <- pt(tvalue, df, lower.tail=TRUE) p_value <- pf(fvalue, df1, df2) </source>

Similar functions are available for all the common distributions, type help(qt) for more information.

Regression and ANOVA

Regression, and particularly ANOVA are done very differently in R than SPSS. R's method is both more flexible, and potentially daunting to new users. A nice quick overview is at Quick-R ([#References|references]).

\verb|lm(y ~ x1 + x2)| runs a simple regression where \verb|x1| and \verb|x2| are the IV's or predictor variables, and \verb|y| is the dependent variable. Ensure that if the IV's are categorical, they need to be factors (i.e., \verb|x1 = factor(x1)|) -otherwise, R will consider them numerical variables. \verb|lm()| returns a model object, not the typical "output" that most packages print. If you desire a regression output, you must print the summary of the \verb|lm()| object with \verb|summary(lm(...))|. If you desire an ANOVA table, use \verb|anova(lm(...))|. Two models can also be compared (i.e., multiple regression):
model1 <- lm(y1 ˜ x1 + x2) model2 <- lm(y1 ˜ x1 + x2 + x3 + x4) anova(model1,model2)
Now, R does not present a p value for the difference (like SPSS). The usual way to simplify models in R is based on the AIC (see stepAIC and the chapter in MASS ([#References|references])).

A much more fluidly progressing explanation of ANOVA than what is present here, with full examples, can be found in the excellent "Using R for Psychological Research ([#References|references])."

After a linear regression, plot the residuals vs predictor variable. The plot should show even scatter of points along the predictor's range, if the residuals are not evenly scattered, it may indicate non-linearity. fitted(model) gives a vector of the fitted (aka predicted, estimated, expected) values, and residuals(model) gives a vector of residuals.

Testing Assumptions

Linear models assume that residuals are independent, normally distributed and have equal variance. Independence is satisfied with proper experimental design (random sampling), while normality and equality of variance can be checked and corrected if necessary. After fitting a model, run plot(model) to see a series of diagnostic plots.

Equality of Variance

One of the diagnostic plots from plot(model) is residuals against fitted values. This plot should show scattered data. Clustering, particularly at one or both ends indicates inequality of variance. A transformation of the data may be required to remedy. Re-run diagnostic plots to check if the transformation solved the issue.

Normality

Another plot that will be created from plot(model) is the Normal Q-Q plot. If the data is normal, the points will generally fall along the y=x line. Some variation at the tails is typical. Be sure to re-check normality after any transformations.

Influential Points (Cook's Distance)

Highly influential points, or outliers, may be accidents or errors in data collection, or they may simply be unusual points from the tails of the population distribution. In the first case they should be removed, in the second case, it may be best to run the analysis with, and without, these points. Technically, if they are real data and belong to the population, removing them biases your estimates of the population, however, it's never a good idea to to have results determined by only a few data points.

Running plot(model1, which=1:4) will result in several plots. The fourth plot graphs the cooks distance for each data point. Highly influential data points, or outliers, will fall strikingly outside of the range of most of the data. A rough rule of thumb is that a cook's number higher than F(0.50, k, n-k), where n=number of observations, and k=number of covariates plus 1, indicates an outlier. Generally though, if you see one or two points far above the rest of the data, check out those points.

Alternatively, the cook's distance can be compted for any model using cooks.distance and then plotted. <source lang="text"> >cooks.distance(model, ...)

1. S3 method for class 'lm':

>cooks.distance(model, infl = lm.influence(model, do.coef = FALSE), res = weighted.residuals(model), sd = sqrt(deviance(model)/df.residual(model)), hat = infl$hat, ...)

1. S3 method for class 'glm':

>cooks.distance(model, infl = influence(model, do.coef = FALSE),res = infl$pear.res, dispersion = summary(model)$dispersion, hat = infl$hat, ...) </source>

Also for an lm model you can use: <source lang="text"> >plot.lm('lm‐model',which=4) </source>

As a last resort, converting data to rank data will remove the influence of outliers and may also alleviate problems of non-constant variance and non-normality. Consider ranking when data is very problematic.

Comparing Models

Sometimes it is prudent to compare two linear models, for example, when deciding whether a reduced model satisfactorily describes a set of data in multiple regression. Create two linear models (e.g.,model_full and model_reduced), and use anova to compare them: <source lang="text"> > anova(model_full, model_reduced) </source>

One-way ANOVA

A one-way analysis of variance can be performed as
anova(lm(y~x), data= df) #where df is the data frame with y and x tapply(y, x, mean) #means table

or alternatively, using the more powerfull aov():

model = aov(y~x, data = df) summary(model) #show summary table print(model.tables(model, "means"), digits=3) #report means \

Note: where a predictor variable (x) contains numbers (i.e, 1,2,3), R will automatically treat x as a quantitative variable unless otherwise specified. If the numbers represent categorical levels, you must change the variable from numeric to factor before analysing. <source lang="text"> > x = c(1,2,3,4) > class(x) [1] "numeric" > x <- factor(x) > class(x) [1] "factor" </source> Ignoring this distinction for factors with >2 levels will lead to the incorrect df in your analyses!

Using a correction from Welsh for inequality of variance, the test can be performed as \verb|oneway.test(y~x)|. To run pairwise post-hoc tests, use \verb|pairwise.t.test(y,x)|. Run \verb|?pairwise.t.test| to learn how to select the type of family-wise error correction, the default is due to Holm, but several are available.

\subparagraph{Graphics:}

From Quick-R ([#References|references]):

Plot Means with Error Bars

library(gplots) attach(mtcars) cyl <- factor(cyl) plotmeans(mpg~cyl,xlab="Number of Cylinders", ylab="Miles Per Gallon", main="Mean Plot\nwith 95 \

Linear Contrasts

The aov model can be partitioned into orthogonal linear contrasts. Consider a model where rt is measured, and target size manipulated [ aov(rt~size) ]. Run contrasts(size) to see the default (treatment) contrast matrix used in the ANOVA. Run summary.lm(model) to see the results of the treatment contrasts.

If size has three levels, there can be k-1= 2 orthogonal contrasts. To compare the smaller two and larger two, we create a new contrast matrix: <source lang="text"> > contrastmatrix<-cbind(c(0,1,-1),c(-1,1,0)) > contrastmatrix

     [,1] [,2]

[1,] 0 -1 [2,] 1 1 [3,] -1 0

> contrasts(size) <- contrastmatrix > model <- aov(rt~size) > summary.lm(model) </source> The last command will print an ANOVA table which includes the linear contrasts.

Factorial ANOVA

A Factorial ANOVA has more than one predictor. Sometimes a predictor is uninteresting, and just there to "absorb" some of the variance (differences between experimenters, data collected over different time periods, or some other variable that may influence results). This type of experiment is called a randomized block design and the other variable should have been randomized in each block if possible (e.g., each experimenter runs the same number of participants in each condition). In this case, we don't care about interactions with block.

For example, consider the collection of reaction time data from a task involving stimuli of various colours in a data frame called exp. To test the effect of colour and size, use

<source lang="text"> model <- aov(rt~block + colour, data=exp) summary(model) </source>

note: in R, aov() and anova() are different. aov() fits an analysis of variance model, while anova() tests the results of a fitted linear model (lm()). anova() can also be used to compare two linear models.

More commonly in experimental work, more than one factor is manipulated and all of them are interesting. When both factors are randomized independently, this is called a completely randomized design, and interactions are usually considered important. Consider an example where colour and size of stimuli are manipulated: <source lang="text"> model <- aov(rt~colour * size, data=exp) summary(model) </source> The only difference from the randomized block design is the use of a * in stead of a +. This short-hand includes the interaction in the ANOVA table.

Plots of Factorial Data

Visualizing factorial data can be difficult. For 2-way factorial data, a reasonable scatterplot can be achieved with scatterplot(). For a y = a + b two way factorial: <source lang="text"> library(car) scatterplot(y~a | b, data=dataframe, boxplots=FALSE, reg.line=FALSE) </source>

A simple interaction plot can be created with interaction.plot(a,b,y).

Post-hoc Tests

Fisher's LSD.test() (use only when the omnibus ANOVA is significant) and Tukey's HSD.test() are available in the agricolae package. <source lang="text"> library('agricolae') help(LSD.test) </source>

Tukey's studentized range can be calculated with: <source lang="text"> qtukey(0.95, numMeans, df) * sqrt( MSE * 1/n ) qtukey(0.95, numMeans, df) * sqrt( MSE * 1/2 * (1/n1 + 1/n2) ) </source> Where the MSE and df(MSE) come from the ANOVA table and n= cell size. This gives the smallest significant difference between means.

Randomized Block Design (Repeated Measures ANOVA)

Beyond two-way anova, the more powerfull \verb|aov()| is used instead of \verb|anova(lm(...))|. It is important to note that R requires the DV to be a single vector/column (i.e., a row in the data frame for every observation), so there will be multiple rows for each subject, and a factor variable indicating the within subject manipulation or time point. A basic example using two fixed, within subjects factors (\verb|x1| and \verb|x2|), and one DV (\verb|rt|) can be run as:
summary( aov(rt ~ x1 * x2 + Error(subject/ (x1*x2)), data=df) )
<source lang="text">

Create Data

iv <- c(rep(1, 10), rep(2, 10), rep(3,10), rep(4,10)) dv <- rnorm(40) sj <- rep(seq(1:10), 4) data = data.frame(sj, iv, dv) data$sj <- factor(data$sj) data$iv <- factor(data$iv) summary(data)

aov style ANOVA

aovModel <- aov(dv~iv+Error(sj/iv), data=data) summary(aovModel) summary.lm(aovModel) #error, cannot do with within-sj </source>

where the variables are in the data frame df. The \verb|Error| term is important here, as leaving it out results in an analysis as-if it were between subject factors. If the within subject factors are random (i.e., represent a sample from a population), then they should not be treated the same in the error term -although I'm not sure how yet. See "Notes on use of R ([#References|references])" page 32 for a better explanation.

The nlme package provides more functionality and is robust to unbalancedness. <source lang="text">

Same thing with nlme

library(nlme) library(lattice) gd <- groupedData(dv~iv | sj, data=data) summary(gd) lmeModel <- lme(dv ~ iv, random=~ 1|sj, data=gd) anova(lmeModel) contrasts(gd$iv) #default contrasts summary(lmeModel)

Define new orthogonal polynomials:

contrasts(gd$iv) <- cbind(c(1,1,-1,-1), c(1,-1,1,-1), c(1,-1,-1,1)) contrasts(gd$iv) lmeModel <- lme(dv ~ iv, random=~ 1|sj, data=gd) summary(lmeModel) </source>

\subparagraph{Post-hoc tests:} For now, just an example...
require(MASS) ## for oats data set require(nlme) ## for lme() require(multcomp) ## for multiple comparison stuff

Aov.mod <- aov(Y ~ N + V + Error(B/V), data = oats) Lme.mod <- lme(Y ~ N + V, random = ~1 | B/V, data = oats)

summary(Aov.mod) anova(Lme.mod)

\subparagraph{Warning:} R defaults to calculating the SS type I, where SPSS defaults to using the SS type III (marginal sum of squares). I don't really fully understand the issue, but the problem seems to arise when SPSS reports a main effect in the presence of an interaction (i.e. the main effect is actually driven by an interaction, and makes no sense). This is a violation of the marginality principle. See "Obtaining the same ANOVA results in R as in SPSS ([#References|references])" for an example, and Quick-R ([#References|references]) for a bit more information. The problem is explained thoroughly in "Exegeses on Linear Models ([#References|references])".

\subparagraph{Graphics:} \verb|interaction.plot(time, subject, y)| Gives a decent line graph for repeated measures over time.

\subparagraph{Manual Calculations} When doing manual calculations, a handy function, \verb|qf(.95, 2, 27)| returns the critical F statistic for an alpha of 0.5, with 2 and 27 degrees of freedom. Also, \verb|1 - pf(4.05, 2, 27)| will return the p value for an F of 4.05, with 2 and 27 degrees of freedom.

Split-Plot (Mixed within- and between-subjects ANOVA)

Poor Example: 3 pesticides are randomly applied to each field, and half of each field is randomly selected for irrigation. rep is a variable that indicated field-number for each repetition of the pesticide (The subject version, would be subject number within a given group.)

model <- aov(yield ~ pesticide + irrigation + irrigation*pesticide + Error(rep/pesticide/irrigation), data=pesticideDataFrame) summary(model)

rep is the random effect of field (like subject), pesticide is a "between field" fixed effect and irrigation is "within-field" fixed effect.

Visualizing this type of "nested" data can be difficult. The library "car" has a method called "splitplot()" which does a decent job of plotting the raw data: scatterplot(y ~ as.numeric(WX) | BX, data=dataframe, boxplots=FALSE, reg.line=FALSE)

Where WX is the within-sj factor and BX is the between-subjects factor. I use the as.numeric() so that it plots the data-points instead of boxplots of the data.

I havn't found an extremely informative resource on mixed design, but try "using R for psychological research ([#References|references])." A mixed design can be calculated in one shot and produces rather extensive output:
model <- aov(y~(W1*W2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),

   data=mydataframe)

summary(model) #tables print(model.tables(model, "means"), digits=3) #report means
with two within factors (W1 W2), and two between factors (B1 B2).

\subparagraph{Diagnostic plots:} Run checks for heteroscedasticity, normality, and influential observerations:
layout(matrix(c(1,2,3,4),2,2)) # optional layout plot(model) # diagnostic plots\

ANCOVA

Analysis of covariance is calculated similar to ANOVA as: \verb|fit <- aov(y ~ A + x, data=mydataframe)|, where \verb|x| is the covariate.

MANOVA

With more than one dependent variable (say 3: \verb|y1,y2,y3|), the matrix columns need to be "bound" first, then the MANOVA can be calculated similarly to ANOVA:

2x2 Factorial MANOVA with 3 Dependent Variables.

Y <- cbind(y1,y2,y3) fit <- manova(Y ~ A*B) summary(fit, test="Pillai")
From Quick-r ([#References|references]): \begin{quote}

Other test options are "Wilks", "Hotelling-Lawley", and "Roy". Use \verb|summary.aov( )| to get univariate statistics. \verb|TukeyHSD( )| and \verb|plot( )| will not work with a MANOVA fit. Run each dependent variable separately to obtain them. Like ANOVA, MANOVA results in R are based on Type I SS. To obtain Type III SS, vary the order of variables in the model and rerun the analysis. For example, fit \verb|y~A*B| for the TypeIII B effect and \verb|y~B*A| for the Type III A effect.

\end{quote}

\bibliographystyle{unsrt} \bibliography{R_resources}

\end{document}