R IV: Statistics in R - BDC-training/VT25 GitHub Wiki
Course: VT25 R programming (SC00035)
Here, we will use the built-in R data set named ToothGrowth
that contains data from an experiment which was used to investigate if vitamin C delivery method has an effect on tooth growth. The
response is the length of odontoblasts (cells responsible for tooth
growth) in 60 guinea pigs. Each animal received one of three dose levels
of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods,
orange juice or ascorbic acid (a form of vitamin C and coded as VC).
- Start by assigning the object
ToothGrowth
to a new variable calledmy_data
Solution
my_data <- ToothGrowth
- Some statistical tests assumes that the data is normally distributed i.e. t-test and the ANOVA test. To check this visually you can create a histogram and/or a qqplot. First, create a histogram of the length variable
len
in the data framemy_data
to check if the data follows the normal distribution or not. Another way to check if the data is normally distributed is to use theqqnorm
function by comparing the variablelen
with a normal distribution. HINT: Use the functionsqqnorm
followed byqqline
on the variablemy_data$len
Solution
hist(my_data$len)
qqnorm(my_data$len)
qqline(my_data$len)
- We can test the normality assumption using a statistical test. Now use the
shapiro.test()
function to test if the variablemy_data$len
is normally distributed. The test rejects the null hypothesis of normality when the p-value is less than or equal to the significance level of 0.05. Is it suitable to assume normal distribution when you analyze this dataset?
Solution
shapiro.test(my_data$len)
# p-value = 0.1091 which is above the significance level of 0.05 and therefore one can assume that the data follows a normal distribution.
We are going to compare the length of odontoblasts in guinea pigs which had their vitamin C delivered by orange juice (OJ) to those which had their vitamin C from ascorbic acid (VC). That is, we want to see if the variable my_data$len
depends on the variable my_data$supp
. We can write the relation as an R formula: my_data$len ~ my_data$supp
.
- Using the formula above and the
boxplot()
function, create a boxplot of the length variablemy_data$len
split into groups according to the supplement typemy_data$supp
Solution
boxplot(my_data$len ~ my_data$supp)
-
Based on the result of your previous check of normality choose to make either a t-test or a wilcoxon test. Test if the mean length of odontoblasts are equal for both supplements OJ and VC.
Hint: You can use the formula
my_data$len ~ my_data$supp
within the test function.
Solution
t.test(my_data$len ~ my_data$supp)
# p-value = 0.06063, hence the difference in means is not statistically significant (at the significance level 0.05)
Here we will demonstrate how to conduct a power calculation for a two-sample t-test. We will use the pwr.t.test()
function from the pwr
package. See the vignette to learn more and for examples of how to use the package. Read the first paragraph in which the author describes the basic idea on how all of the functions in the pwr
package work. Without diving too much into details, we will calculate what sample size is needed to detect a medium effect size of d=0.5
at the significance level 0.05 and with power 0.80.
install.packages("pwr")
library(pwr)
pwr.t.test(
d = 0.5,
power = 0.8,
sig.level = 0.05,
type = "two.sample",
alternative = "two.sided"
)
The function tells us we should have 63.76561 samples in each group, which we round up to 64.
- Now create a sample size vector containing all numbers that are multiples of 5 between 5 and 100 (i.e 5, 10, 15, ..., 100) by defining the object
sample_size <- seq(5, 100, by = 5)
. Then call thepwr.t.test()
function with the argumentsd = 0.5
,n = sample_size
,sig.level = 0.05
,type = "two.sample"
,alternative = "two.sided"
and store the result in an object namedpower_data
Solution
sample_size <- seq(5, 100, by = 5)
power_data <- pwr.t.test(
d = 0.5,
n = sample_size,
sig.level = 0.05,
type = "two.sample",
alternative = "two.sided"
)
power_data # contains the power for all sample sizes in the sample_size vector
# An alternative (shorter) way is to specify n = seq(5, 100, by = 5) directly
# within the pwr.t.test call:
power_data <- pwr.t.test(
d = 0.5,
n = seq(5, 100, by = 5),
sig.level = 0.05,
type = "two.sample",
alternative = "two.sided"
)
- Now call the
plot()
function withx = power_data$n
andy = power_data$power
as arguments together with the third argumenttype = "b"
to generate a power curve for the sample sizes.
Solution
plot(
x = power_data$n,
y = power_data$power,
type = "b"
)
grid() # this function is called to add grid lines to the plot
Here we continue with the same dataset as in the previous part. Above we compared the two types of supplements. Here we will instead compare the length of odontoblasts between the three different doses (0.5, 1.0 and 2.0 mg/d). Which dose has the highest effect on the growth of odontoblasts?
- Make a boxplot of the variable
len
in the dataframemy_data
, use the variabledose
as grouping variable.
Solution
boxplot(my_data$len ~ my_data$dose)
# there seems to be a dose response as the median increases with the dose.
# 2.0 mg/d has the highest effect on the growth
-
We are going to make a one-way anova with
dose
as independent variable. Sincedose
is a numerical variable we need to specify to the ANOVA function that it should be a factor variable. This can be done within the formula:len ~ factor(dose)
.Use the
aov()
function to do the ANOVA and pipe it to thesummary()
function to look at the results. Is there a significant difference between the tooth lengths based on the dose?
Solution
aov(len ~ factor(dose), data = my_data) %>%
summary
- Now add the
supp
variable to the model in the ANOVA to make it a 2-way ANOVA. What do you observe now?
Solution
aov(len ~ factor(dose) + supp, data = my_data) %>%
summary
In the one-way ANOVA above we saw that there is a difference in odontoblast length based on the dose. From the output of one-way ANOVA we can't determine where the possible difference occurs. To identify between which doses there is a significant difference we need to perform a post-hoc test.
- Make post-hoc test using the Tukey's honest signficant difference test
TukeyHSD()
. Between which doses can we see a significant difference?
Solution
aov(len ~ factor(dose), data = my_data) %>%
TukeyHSD
# all pairwise comparisons are significant since p < 0.05 for all of them
Here we will use the datset survey
which is found in the package MASS
, this dataset contains the responses of 237 Statistics I students at the University of Adelaide to a number of questions.
We will examine if there is any association between frequency of exercise (the column Exer
) and gender (the column Sex
).
- Load the
survey
data from theMASS
package and store in an object calledsurvey
by running the following code
survey <- MASS::survey
- To get an overview of the data, create a cross table with the two variables, survey$Exer and survey$Sex, using the
table()
function. You should get a table similar to the one below (with the number of observations filled in):
Female | Male | |
---|---|---|
Freq | ||
None | ||
Some |
Solution
table(survey$Exer, survey$Sex)
- Test if there is an association between exercise and gender using a Chi-square test by piping the table from above to
chisq.test
. The test rejects the null hypothesis of independence when the p-value is less than or equal to the significance level of 0.05.
Solution
table(survey$Exer, survey$Sex) %>%
chisq.test
- Historically, a Chi-square test has been preferred since it is easier to compute than the Fisher exact test since the Chi-square test relies on approximations. Today when computing power is no longer an issue the Fisher exact test is preferred especially for small sample sizes. Make a Fisher exact test with the same variables by piping the same table to
fisher.test
. Compare the p-value with the result from the Chi-square test.
Solution
table(survey$Exer, survey$Sex) %>%
fisher.test
# no big difference as the sample size is not that small
Linear and logistic regression are used to describe data and to explain the relationship between the indpendent and dependent variables. In linear regression, the dependent variable is continuous (i.e. height and weight) while in logistic regression the dependent variable is binary (i.e. yes/no, case/control, male/female).
Here we will use the built-in dataset mtcars. The dataset consists of data from 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
-
Get to know the dataset
mtcars
by typing?mtcars
andhead(mtcars)
-
Make a scatterplot to visualize the relation car weight (1000 lbs) and plasma fuel consumption (Miles/(US) gallon) using the variables
wt
andmpg
. How does the increase in weight influence the fuel consumption?
Solution
plot(mtcars$wt, mtcars$mpg)
# An an increase in weight leads to less miles per gallon.
- Now make a linear regression using the function
lm()
withmpg
as response andwt
as predictor and pipe it to thesummary
function to get coefficient estimates and p-values. What is the slope? Is there a significant effect of weight on fuel consumption?
Solution
lm(mpg ~ wt, data = mtcars) %>%
summary
# The slope is -5.3445 which suggests that, on average, a one-unit increase in wt is associated with a 5.3445-unit decrease in mpg, assuming a linear relationship
# Yes, there is a significant effect of weight on fuel consumption since p = 1.29e-10 is less than the significance level of 0.05
In the previous part we had a continuous response variable. Here we will instead consider the binary outcome variable diabetes
.
For binary outcome we can use logistic regression which can be performed using the glm()
function in R. The abbrevation glm stands for generalized linear models and as the name indicates, this function can do many different regressions. You need to specify the family
argument to tell R which type of data you have. The default is family="gaussian"
which is the same as a linear regression using lm()
. To make a logistic regression you need to set family="binomial"
. Have a look at the help file for glm()
to see other options for family
.
- Now using the built-in dataset mtcars we will perform a logistic regression to evaluate if the fuel consumption (
mpg
) is associated with if a car has automatic or manual transmission. To do this we use theglm
command with the argumentfamily="binomial
to perform the logistic regression. Start by piping mtcars to theglm
function, usingam
as the dependent variable andmpg
as the independent variable, and pipe it further to thesummary
function to get coefficient estimates and p-values. HINT:data = .
within theglm
command tells it to use the data on the left side of the pipe.
Solution
mtcars %>%
glm(am ~ mpg,
data = .,
family = "binomial") %>%
summary()
- Does
mpg
have a significant association with whether a car is automatic?
Solution
Yes, there is a significant association between mpg and and the dependent variable am (whether a car is automatic), since
p = 0.00751 is less than the significance level of 0.05
- Now make a new logistic regression where you also include the variable weight (
wt
) to the previous model. Is mpg still significant and what is the p-value?
Solution
mtcars %>%
glm(am ~ mpg + wt,
data = .,
family = "binomial") %>%
summary()
# mpg is not significant anymore since p > 0.05 (=0.176)
The output from the glm()
function gives effect estimates of log odds. Often it's preferred to present odds ratio. Log odds can be translated to odds ratios by taking the exponent of the log odds. To do this using base R is tedious therefore we will use the clever help function tidy
from the broom
package. To translate the output from the glm()
function into odds ratios and confidence intervals we set the arguments exponentiate=TRUE
and conf.int=TRUE
within the tidy
function. HINT: If you want extra information about how to use tidy in conjunction with glm use ?tidy.glm
- Install and load the
broom
package.
install.packages("broom")
- Now, extract odds ratios and confidence intervals for the model in exercise 4.
Solution
mtcars %>%
glm(am ~ mpg + wt,
data = .,
family = "binomial") %>%
tidy(exponentiate = TRUE, conf.int = TRUE)
One can use a likelihood ratio test to test if a relatively more complex model is significantly better than a simpler model. This is done by specifying the two fitted models as two separate arguments to the anova()
function together with the third argument test="LRT"
, i.e. anova(mod0, mod1, test="LRT")
.
- Use the
anova()
function to test if the more complex model from exercise 4 is significantly better than the simpler model from exercise 1. HINT: Start by storing the models in objects calledmod0
andmod1
and then call theanova()
function with them as two separate arguments together with the third argumenttest="LRT"
.
Solution
mod0 <- mtcars %>%
glm(am ~ mpg,
data = .,
family = "binomial")
mod1 <- mtcars %>%
glm(am ~ mpg + wt,
data = .,
family = "binomial")
anova(mod0, mod1, test = "LRT")
# p=0.0004089, so inclusion of the variable wt provides a significantly better fit
# to the data compared to the simpler model.
In survival analysis we want to find variables associated with the time to some event (i.e. death, disease ocurrence etc.). We can use Kaplan-Meier curves to compare survival curves between groups and we can use Proportional Hazards Cox Regression to test for association with survival time.
Here we will be using the R package survival
, and the dataset ovarian
that comes with the package. The dataset contains survival data from a randomized trial comparing two treatments for ovarian cancer. We are going to study if the two treatments have different effect on survival.
- Install and load the
survival
package.
install.packages("survival")
library(survival)
-
Get to know the dataset by typing
?ovarian
and use the functionhead
on the datasetovarian
. -
Find out how many patients are included in the data.
Solution
nrow(ovarian)
-
The variables
resid.ds
,rx
andecog.ps
are binary variables, but they are stored as numeric (1s and 2s). For the values to make sense we will recode them to factors.rx
contains information of which treatment the patient had. Create a new variable,rx_cat
, within the dataset by changing them to factors withovarian$rx_cat <- factor(ovarian$rx, levels = c("1", "2"), labels = c("A", "B"))
Similarly recode resid.ds to resid_cat
and ecog.ps to ecog_cat
according to the specification below:
-
resid.ds
(regression of tumors) 1 -> no and 2 -> yes -
ecog.ps
(patients´ performance) 1 -> good and 2 -> bad
Solution
ovarian$resid_cat <- factor(ovarian$resid.ds,
levels = c("1", "2"),
labels = c("no", "yes"))
ovarian$ecog_cat <- factor(ovarian$ecog.ps,
levels = c("1", "2"),
labels = c("good", "bad"))
The variable futime
contains the follow-up time for the patients and fustat
contains information about event and censoring status (1 = event, 0 = censored). In survival analysis the survival time together with the event status is the response variable. To do the analyses we need to create a survival object that we can use as a response variable in our analyses. This is done by using the Surv function with time and event variable as arguments.
- Use the
Surv()
function with follow-up time (futime
) and event (fustat
) as arguments. Store this object in a variable namedsurv
.
Solution
surv <- Surv(ovarian$futime, ovarian$fustat)
-
How many patients are censored?. HINT: Use the
table
function on the variablefustat
and look for the number of zeroes.
Solution
table(ovarian$fustat)
# 14 patients are censored
A Kaplan Meier plot is good visualization of the survival in the two treatment groups. The Kaplan Meier plot function needs a fitted model to create the plot and we will use a function called survfit()
to fit the model. Check the help on the survfit()
function for details.
- Fit a model with your
surv
object as response and group variablerx_cat
from theovarian
data as predictor
Solution
survfit(surv ~ rx_cat, data=ovarian)
# Note that a shorter way is to specify the Surv object within the survfit call:
# survfit(Surv(futime, fustat) ~ rx_cat, data=ovarian)
- Create a simple Kaplan Meier plot by piping the previous code from exercise 9 to the
plot
function
Solution
survfit(surv ~ rx_cat, data=ovarian) %>%
plot
- The plot in exercise 10 is not visually appealing and not informative. To make better Kaplan Meier plots, install and load the
survminer
package. This package contains the functionggsurvplot
to plot Kaplan Meier curve.
install.packages("survminer")
library(survminer)
- Create a Kaplan Meier plot by piping the code from exercise 9 to the
ggsurvplot
function. To display a p-value within the plot, add the argumentpval = TRUE
within theggsurvplot
function. Which treatment is better and is the difference statistically significant?
Solution
survfit(surv ~ rx_cat, data=ovarian) %>%
ggsurvplot(pval=TRUE)
# Treatment B is better but not significantly better since p > 0.05 (p = 0.3)
Kaplan Meier is a non-parametric method that is limited to study one factor variable at the time. An alternative to this is the Cox proportional hazards regression which is a semi-parametric method for which both factors and continuous can be included in the model. This makes it possible to adjust for certain (confounding) variables.
- Use the
coxph()
function to make a Cox regression. Let yoursurv
object be response and add the predictorsrx_cat
andecog_cat
and pipe it to the functiontidy
from thebroom
package similar to what we did in exercise 6 under Logistic regression. Set the argumentsexponentiate = TRUE
andconf.int = TRUE
so that it outputs hazard ratios and matching confidence intervals. What is the p-value for rx_cat after adjusting for ecog_cat?
Solution
coxph(surv ~ rx_cat + ecog_cat, data=ovarian) %>%
tidy(exponentiate=TRUE, conf.int=TRUE)
# The rx_cat variable is not statistically significant at the significance level 0.05 (p = 0.325)
- One way to display the results from a Cox proportional hazards regression is to visualize them using forestplot. This could be done by using the function
ggforest
from the packagesurvminer
. Replace the code after the last pipe in the previous exercise withggforest(data=ovarian)
Solution
coxph(surv ~ rx_cat + ecog_cat, data=ovarian) %>%
ggforest(data=ovarian)
Home: R programming
Developed by Malin Östensson, 2017. Modified by Maria Nethander, 2018. Modified by Björn Andersson and Jari Martikainen, 2024.