Time Series - rFronteddu/general_wiki GitHub Wiki

The following code assumes AP is a an object of type ts (class(AP) == "ts":

data(AirPassengers)
AP <- AirPassengers
start(AP); end(AP); frequency(AP)

Prints:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
...

[1] 1949 1
[1] 1960 12
[1] 12

A simple plot of passengers over time can be obtained by:

plot(AP, ylab = "Passengers (1000's)")

The aggregate function can be used to remove seasonal effects:

plot(aggregate(AP))

To read a table from the web and convert it to a t.s.:

www <- "http://www.massey.ac.nz/~pscowper/ts/Maine.dat"
Maine.month <- read.table(www, header = TRUE)
attach(Maine.month)
Maine.month.ts <- ts(unemploy, start = c(1996, 1), freq = 12)
Maine.annual.ts <- aggregate(Maine.month.ts)/12

layout(1:2)
plot(Maine.month.ts, ylab = "unemployed (%)")
plot(Maine.annual.ts, ylab = "unemployed (%)")

We can calculate monthly variations and compare them to the average using the window command:

Maine.Feb <- window(Maine.month.ts, start = c(1996,2), freq = TRUE)
Maine.Aug <- window(Maine.month.ts, start = c(1996,8), freq = TRUE)
Feb.ratio <- mean(Maine.Feb) / mean(Maine.month.ts)
Aug.ratio <- mean(Maine.Aug) / mean(Maine.month.ts)

Feb.ratio or Aug.ratio to print the ratio will give 1.223 (22% higher) for example

You can create multiple time series from data:

www <- "http://www.massey.ac.nz/~pscowper/ts/cbe.dat"
CBE <- read.table(www, header = T)

> CBE[1:4, ]
choc beer elec
1 1451 96.3 1497
2 2037 84.4 1463
3 2477 91.2 1648
4 2785 81.9 1595
...

The data above contains three series plotted below:

Elec.ts <- ts(CBE[, 3], start = 1958, freq = 12)
Beer.ts <- ts(CBE[, 2], start = 1958, freq = 12)
Choc.ts <- ts(CBE[, 1], start = 1958, freq = 12)
plot(cbind(Elec.ts, Beer.ts, Choc.ts))

You can plot two time series (and find their intersection) in the same plot as follows:

AP.elec <- ts.intersect(AP, Elec.ts)
start(AP.elec)
[1] 1958 1

end(AP.elec)
[1] 1960 12

AP <- AP.elec[,1]; Elec <- AP.elec[,2]

layout(1:2)
plot(AP, main = "", ylab = "Air passengers / 1000's")
plot(Elec, main = "", ylab = "Electricity production / MkWh")
plot(as.vector(AP), as.vector(Elec),
xlab = "Air passengers / 1000's",
ylab = "Electricity production / MWh")
abline(reg = lm(Elec ~ AP))

And find the correlation between two using:

cor(AP, Elec) 
// remember correlation does not imply causation,
// remember to remove trends and seasonal effects to avoid thinking that things are correlated when they are not

Financial data is often in quarterly periods of three months, with the fiorst quarter being January to March and the last quarter being October to December. They can be read into R from the book website and converted to a quarterly time series as follows:

www <- "http://www.massey.ac.nz/~pscowper/ts/pounds_nz.dat"
Z <- read.table(www, header = T)
Z[1:4, ]
[1] 2.92 2.94 3.17 3.25

Z.ts <- ts(Z, st = 1991, fr = 4)
plot(Z.ts, xlab = "time / years", ylab = "Quarterly exchange rate in $NZ / pound")

The window function can be used to extract subseries:

Z.92.96 <- window(Z.ts, start = c(1992, 1), end = c(1996, 1))
Z.96.98 <- window(Z.ts, start = c(1996, 1), end = c(1998, 1))

layout (1:2)
plot(Z.92.96, ylab = "Exchange rate in $NZ/pound", xlab = "Time (years)" )
plot(Z.96.98, ylab = "Exchange rate in $NZ/pound", xlab = "Time (years)" )

Another good data is the global temperature series:

www <- "http://www.massey.ac.nz/~pscowper/ts/global.dat"
Global <- scan(www)
Global.ts <- ts(Global, st = c(1856, 1), end = c(2005, 12), fr = 12)
Global.annual <- aggregate(Global.ts, FUN = mean) // It is the trend that is of most concern, so the aggregate function is used to remove any seasonal effects within each year and produce an annual series of mean temperatures for the period 1856 to 2005. We can avoid explicitly dividing by 12 if we specify FUN=mean in the aggregate function.
plot(Global.ts)
plot(Global.annual)

New.series <- window(Global.ts, start=c(1970, 1), end=c(2005, 12))
New.time <- time(New.series)
plot(New.series); abline(reg=lm(New.series ~ New.time))

// We cannot simply attribute correlations because two unrelated time series will be correlated if they both contain a trend.

We can estimate trends and seasonal effects using a moving average method with decompose.

plot(decompose(Elec.ts))
Elec.decom <- decompose(Elec.ts, type = "mult")
plot(Elec.decom)
Trend <- Elec.decom$trend
Seasonal <- Elec.decom$seasonal
ts.plot(cbind(Trend, Trend * Seasonal), lty = 1:2)

If a original series shows increasing variance a multiplicative model is adequate. If it also shows increasing random component a log model may be better.

The cycle function can be used to extract the seasons for each item of data. The plots can be put in a single graphics window using the layout function, which takes as input a vector (or matrix) for the location of each plot in the display window. The resulting boxplot and annual series are shown in

layout(1:2)
plot(aggregate(AP))
boxplot(AP ~ cycle(AP))

The autocorrelations of x are stored in the vector acf(x)$acf, with the lag k autocorrelation located in acf(x)$acf[k+1]. For example, the lag 1 autocorrelation for waveht is

> acf(waveht)$acf[2]

The first entry, acf(waveht)$acf[1], is r0 and equals 1.

Although we want to know about trends and seasonal patterns in a time series, we do not necessarily rely on the correlogram to identify them. The main use of the correlogram is to detect autocorrelations in the time series after we have removed an estimate of the trend and seasonal variation. In the code below, the air passenger series is seasonally adjusted and the trend removed using decompose. To plot the random component and draw the correlogram, we need to remember that a consequence of using a centred moving average of 12 months to smooth the time series, and thereby estimate the trend, is that the first six and last six terms in the random component cannot be calculated and are thus stored in R as NA.

> data(AirPassengers)
> AP <- AirPassengers
> AP.decom <- decompose(AP, "multiplicative")
> plot(ts(AP.decom$random[7:138]))
> acf(AP.decom$random[7:138])

The correlogram suggests either a damped cosine shape that is characteristic of an autoregressive model of order 2 or that the seasonal adjustment has not been entirely effective.

The latter explanation is unlikely because the decomposition does estimate twelve independent monthly indices. If we investigate further, we see that the standard deviation of the original series from July until June is 109, the standard deviation of the series after subtracting the trend estimate is 41, and the standard deviation after seasonal adjustment is just 0.03. The reduction in the standard deviation shows that the seasonal adjustment has been very effective.

sd(AP[7:138])
[1] 109
> sd(AP[7:138] - AP.decom$trend[7:138])
[1] 41.1
> sd(AP.decom$random[7:138])
[1] 0.0335

Monthly effective inflows (m3s−1) to the Font Reservoir in Northumberland for the period from January 1909 until December 1980 have been provided by Northumbrian Water PLC. There is a statistically significant correlation at lag 1. The physical interpretation is that the inflow next month is more likely than not to be above average if the inflow this month is above average. Similarly, if the inflow this month is below average it is more likely than not that next month’s inflow will be below average. The explanation is that the groundwater supply can be thought of as a slowly discharging reservoir. If groundwater is high one month it will augment inflows, and is likely to do so next month as well. Given this explanation, you may be surprised that the lag 1 correlation is not higher. The explanation for this is that most of the inflow is runoff following rainfall, and in Northumberland there is little correlation between seasonally adjusted rainfall in consecutive months. An exponential decay in the correlogram is typical of a first-order autoregressive model. The correlogram of the adjusted inflows is consistent with an exponential decay. However, given the sampling errors for a time series of this length, estimates of autocorrelation at higher lags are unlikely to be statistically significant. This is not a practical limitation because such low correlations are inconsequential. When we come to identify suitable models, we should remember that there is no one correct model and that there will often be a choice of suitable models. We may make use of a specific statistical criterion such as Akaike’s information criterion

www <- "http://www.massey.ac.nz/~pscowper/ts/Fontdsdt.dat"
> Fontdsdt.dat <- read.table(www, header=T)
> attach(Fontdsdt.dat)
> plot(ts(adflow), ylab = 'adflow')
> acf(adflow, xlab = 'lag (months)', main="")

When R complains about NA (caused for example by moving average) you can omit them from the calculation:

cf(na.omit(Font.deco$random))

The ts.union function binds time series with a common frequency, padding with ‘NA’s to the union of their time coverages. If ts.union is used within the acf command, R returns the correlograms for the two variables and the cross-correlograms in a single figure.

acf(ts.union(App.ts, Act.ts))

The ccf can be calculated for any two time series that overlap, but if they both have trends or similar seasonal effects, these will dominate Here we remove the trend using decompose, which uses a centered moving average of the four quarters

app.ran <- decompose(App.ts)$random
> app.ran.ts <- window (app.ran, start = c(1996, 3) )
> act.ran <- decompose (Act.ts)$random
> act.ran.ts <- window (act.ran, start = c(1996, 3) )
> acf (ts.union(app.ran.ts, act.ran.ts))
> ccf (app.ran.ts, act.ran.ts)
print(acf(ts.union(app.ran.ts, act.ran.ts)))

Fitting a BASS model to yearly sales of VCRs in the US home market between 1980 and 1989 (Bass website) using the R non-linear least squares function nls. The variable T79 is the year from 1979, and the variable Tdelt is the time from 1979 at a finer resolution of 0.1 year for plotting the Bass curves. The cumulative sum function cumsum is useful for monitoring changes in the mean level of the process.

T79 <- 1:10
> Tdelt <- (1:100) / 10
> Sales <- c(840,1470,2110,4000, 7590, 10950, 10530, 9470, 7790, 5890)
> Cusales <- cumsum(Sales)
> Bass.nls <- nls(Sales ~ M * ( ((P+Q)^2 / P) * exp(-(P+Q) * T79) ) /
(1+(Q/P)*exp(-(P+Q)*T79))^2, start = list(M=60630, P=0.03, Q=0.38))
> summary(Bass.nls)

Bcoef <- coef(Bass.nls)
m <- Bcoef[1]
p <- Bcoef[2]
q <- Bcoef[3]
ngete <- exp(-(p+q) * Tdelt)
Bpdf <- m * ( (p+q)^2 / p ) * ngete / (1 + (q/p) * ngete)^2
plot(Tdelt, Bpdf, xlab = "Year from 1979",
ylab = "Sales per year", type='l')
points(T79, Sales)
Bcdf <- m * (1 - ngete)/(1 + (q/p)*ngete)
plot(Tdelt, Bcdf, xlab = "Year from 1979",
ylab = "Cumulative sales", type='l')
points(T79, Cusales)

In R, simulation is usu- ally straightforward, and most standard statistical distributions are simulated using a function that has an abbreviated name for the distribution prefixed with an ‘r’ (for ‘random’).1 For example, rnorm(100) is used to simulate 100 independent standard normal variables, which is equivalent to simulating a Gaussian white noise series of length 100.

Other prefixes are also available to calculate properties for standard distributions; e.g., the prefix ‘d’ is used to calculate the probability (density) function.

The function set.seed is used to provide a starting point (or seed) in the simulations, thus ensuring that the simulations can be reproduced.

set.seed(1)
w <- rnorm(100)
plot(w, type = "l")

histogram of a Gaussian white noise series

x <- seq(-3,3, length = 1000)
> hist(rnorm(100), prob = T); points(x, dnorm(x), type = "l")

The following commands can be used to simulate random walk data for x:

The first command above places a white noise series into w and uses this series to initialise x. The ‘for’ loop then generates the random walk

> x <- w <- rnorm(1000)
> for (t in 2:1000) x[t] <- x[t - 1] + w[t]
> plot(x, type = "l"

The first-order differences of a random walk are a white noise series, so the correlogram of the series of differences can be used to assess whether a given series is reasonably modelled as a random walk.

acf(diff(x))

Pacf can be used to print partial autocorr to find the order of an AR process (AR(p) will have no significant partial correlation after lag p):

> set.seed(1)
> x <- w <- rnorm(100)
> for (t in 2:100) x[t] <- 0.7 * x[t - 1] + w[t]
> plot(x, type = "l")
> acf(x)
> pacf(x)

n AR(p) model can be fitted to data in R using the ar function. In the code below, the autoregressive model x.ar is fitted to the simulated series of the last section and an approximate 95% confidence interval for the underlying

parameter is given, where the (asymptotic) variance of the parameter estimate is extracted using x.ar$asy.var:

x.ar <- ar(x, method = "mle") x.ar$order [1] 1 x.ar$ar 84 4 Basic Stochastic Models [1] 0.601 x.ar$ar + c(-2, 2) * sqrt(x.ar$asy.var) [1] 0.4404 0.7615

The method “mle” used in the fitting procedure above is based on max- imising the likelihood function (the probability of obtaining the data given the model) with respect to the unknown parameters. The order p of the process is chosen using the Akaike Information Criterion penalises models with too many parameter

In the function ar, the model with the smallest AIC is selected as the best- fitting AR model.

Z.ar <- ar(Z.ts)

mean(Z.ts) [1] 2.823 Z.ar$order [1] 1 Z.ar$ar [1] 0.8903 Z.ar$ar + c(-2, 2) * sqrt(Z.ar$asy.var) [1] 0.7405 1.0400 acf(Z.ar$res[-1])

In the code above, a “−1” is used in the vector of residuals to remove the first item from the residual series (Fig. 4.13). (For a fitted AR(1) model, the first item has no predicted value because there is no observation at t = 0; in general, the first p values will be ‘not available’ (NA) in the residual series of a fitted AR(p) model.) By default, the mean is subtracted before the parameters are estimated, so a predicted value ˆzt at time t based on the output above is given by ˆzt = 2.8 + 0.89(zt−1 − 2.8)

Consider the following AR model fitted to the mean annual temperature series:

www = "http://www.massey.ac.nz/~pscowper/ts/global.dat" Global = scan(www) Global.ts = ts(Global, st = c(1856, 1), end = c(2005, 12), fr = 12) 86 4 Basic Stochastic Models Global.ar <- ar(aggregate(Global.ts, FUN = mean), method = "mle") mean(aggregate(Global.ts, FUN = mean)) [1] -0.1383 Global.ar$order [1] 4 Global.ar$ar [1] 0.58762 0.01260 0.11117 0.26764 acf(Global.ar$res[-(1:Global.ar$order)], lag = 50)

Based on the output above a predicted mean annual temperature ˆxt at time t is given by ˆxt = −0.14 + 0.59(xt−1 + 0.14) + 0.013(xt−2 + 0.14) +0.11(xt−3 + 0.14) + 0.27(xt−4 + 0.14) (4.24) The correlogram of the residuals has only one (marginally) significant value at lag 27, so the underlying residual series could be white noise (Fig. 4.14). Thus the fitted AR(4) model (Equation (4.24)) provides a good fit to the data. As the AR model has no deterministic trend component, the trends in the data can be explained by serial correlation and random variation, implying that it is possible that these trends are stochastic (or could arise from a purely 4.8 Exercises 87 stochastic process).

In time series regression, it is common for the error series {zt} in Equation (5.1) to be autocorrelated. In the code below a time series with an increas- ing straight-line trend (50 + 3t) with autocorrelated errors is simulated and plotted:

set.seed(1) z <- w <- rnorm(100, sd = 20) for (t in 2:100) z[t] <- 0.8 * z[t - 1] + w[t] Time <- 1:100 x <- 50 + 3 * Time + z plot(x, xlab = "time", type = "l")

Linear models are usually fitted by minimising the sum of squared errors, � z2 t = �(xt − α0 − α1u1,t − . . . − αmum,t)2, which is achieved in R using the function lm:

x.lm <- lm(x ~ Time) coef(x.lm) (Intercept) Time 58.55 3.06 sqrt(diag(vcov(x.lm))) (Intercept) Time 4.8801 0.0839

In the code above, the estimated parameters of the linear model are extracted using coef The standard errors are extracted using the square root of the diagonal elements obtained from vcov, although these standard errors are likely to be underestimated because of autocorrelation in the residuals. The function summary can also be used to obtain this information but tends to give additional information, for example t-tests, which may be incorrect for a time series regression analysis due to autocorrelation in the residuals.

After fitting a regression model, we should consider various diagnostic plots. In the case of time series regression, an important diagnostic plot is the correlogram of the residuals: 5.3 Fitted models 95

acf(resid(x.lm)) pacf(resid(x.lm))

As expected, the residual time series is autocorrelated only the lag 1 partial autocorrelation is significant, which suggests that the residual series follows an AR(1) process. Again this should be as expected, given that an AR(1) process was used to simulate these residuals.

The follow- ing regression model is fitted to the global temperature over this period, 96 5 Regression and approximate 95% confidence intervals are given for the parameters us- ing confint. The explanatory variable is the time, so the function time is used to extract the ‘times’ from the ts temperature object.

www <- "http://www.massey.ac.nz/~pscowper/ts/global.dat" Global <- scan(www) Global.ts <- ts(Global, st = c(1856, 1), end = c(2005, 12), fr = 12) temp <- window(Global.ts, start = 1970) temp.lm <- lm(temp ~ time(temp)) coef(temp.lm) (Intercept) time(temp) -34.9204 0.0177 confint(temp.lm) 2.5 % 97.5 % (Intercept) -37.2100 -32.6308 time(temp) 0.0165 0.0188 acf(resid(lm(temp ~ time(temp))))

The confidence interval for the slope does not contain zero, which would pro- vide statistical evidence of an increasing trend in global temperatures if the autocorrelation in the residuals is negligible. However, the residual series is positively autocorrelated at shorter lags leading to an underesti- mate of the standard error and too narrow a confidence interval for the slope. Intuitively, the positive correlation between consecutive values reduces the effective record length because similar values will tend to occur together. The following section illustrates the reasoning behind this but may be omitted, without loss of continuity, by readers who do not require the mathematical details.

The following example illustrates how to fit a regression model to the simu- lated series of §5.2.3 using generalised least squares:

library(nlme) x.gls <- gls(x ~ Time, cor = corAR1(0.8)) coef(x.gls) (Intercept) Time 58.23 3.04 sqrt(diag(vcov(x.gls))) (Intercept) Time 11.925 0.202

A lag 1 autocorrelation of 0.8 is used above because this value was used to simulate the data. For historical series, the lag 1 autocorrelation would need to be estimated from the correlogram of the residuals of a fitted linear model; i.e., a linear model should first be fitted by ordinary least squares (OLS) and the lag 1 autocorrelation read off from a correlogram plot of the residuals of the fitted model.

In the example above, the standard errors of the parameters are considerably greater than those obtained from OLS using lm and are more accurate as they take the autocorrelation into account. The parameter estimates from GLS will generally be slightly different from those obtained with OLS, because of the weighting. For example, the slope is estimated as 3.06 using lm but 3.04 using gls. In principle, the GLS estimators are preferable because they have smaller standard errors.

To calculate an approximate 95% confidence interval for the trend in the global temperature series (1970–2005), GLS is used to estimate the standard error accounting for the autocorrelation in the residual series. In the gls function, the residual series is approximated as an AR(1) process with a lag 1 autocorrelation of 0.7, which is used as a parameter in the gls function:

> temp.gls <- gls(temp ~ time(temp), cor = corAR1(0.7))
> confint(temp.gls)
2.5 % 97.5 %
(Intercept) -39.8057 -28.4966
time(temp) 0.0144 0.0201

Although the confidence intervals above are now wider, zero is not contained in the intervals, which implies that the estimates are statistically significant, and, in particular, that the trend is significant. Thus, there is statistical evidence of an increasing trend in global temperatures over the period 1970–2005, so that, if current conditions persist, temperatures may be expected to continue to rise in the future.

The parameters of a straight-line trend with additive seasonal indices can be estimated for the temperature series (1970–2005) as follows:

> Seas <- cycle(temp)
> Time <- time(temp)
> temp.lm <- lm(temp ~ 0 + Time + factor(Seas))
> coef(temp.lm)
Time factor(Seas)1 factor(Seas)2 factor(Seas)3
0.0177 -34.9973    -34.9880      -35.0100

factor(Seas)4 factor(Seas)5 factor(Seas)6 factor(Seas)7
-35.0123      -35.0337      -35.0251      -35.0269

factor(Seas)8 factor(Seas)9 factor(Seas)10 factor(Seas)11
-35.0248      -35.0383      -35.0525       -35.0656

factor(Seas)12 
-35.0487

A zero is used within the formula to ensure that the model does not have an intercept. If the intercept is included in the formula, one of the seasonal terms will be dropped and an estimate for the intercept will appear in the output. However, the fitted models, with or without an intercept, would be equivalent, as can be easily verified by rerunning the algorithm above without the zero in the formula. The parameters can also be estimated by GLS by replacing lm with gls in the code above. Using the above fitted model, a two-year-ahead future prediction for the temperature series is obtained as follows:

> new.t <- seq(2006, len = 2 * 12, by = 1/12)
> alpha <- coef(temp.lm)[1]
> beta <- rep(coef(temp.lm)[2:13], 2)
> (alpha * new.t + beta)[1:4]
factor(Seas)1 factor(Seas)2 factor(Seas)3 factor(Seas)4
0.524         0.535         0.514         0.514

Alternatively, the predict function can be used to make forecasts provided the new data are correctly labelled within a data.frame:

> new.dat <- data.frame(Time = new.t, Seas = rep(1:12, 2))
> predict(temp.lm, new.dat)[1:24]
1     2     3     4     5     6     7     8     9     10    11    12
0.524 0.535 0.514 0.514 0.494 0.504 0.503 0.507 0.495 0.482 0.471 0.489
13    14    15    16    17    18    19    20    21    22    23    24
0.542 0.553 0.532 0.531 0.511 0.521 0.521 0.525 0.513 0.500 0.488 0.507

The code below illustrates just one of many possible combinations of harmon- ics that could be used to model a wide range of possible underlying seasonal pattern Then the second plot given below might be a more realistic underlying seasonal pattern than the first plot, as it perturbs the standard sine wave by adding another two harmonic terms of frequencies 2/12 and 4/12:

TIME <- seq(1, 12, len = 1000) plot(TIME, sin(2 * pi * TIME/12), type = "l") plot(TIME, sin(2 * pi * TIME/12) + 0.2 * sin(2 * pi * 2 * TIME/12) + 0.1 * sin(2 * pi * 4 * TIME/12) + 0.1 * cos(2 * pi * 4 * TIME/12), type = "l")

It is straightforward to simulate a series based on the harmonic model given by Equation (5.10). For example, suppose the underlying model is xt = 0.1 + 0.005t + 0.001t2 + sin(2πt/12)+ 0.2 sin(4πt/12) + 0.1 sin(8πt/12) + 0.1 cos(8πt/12) + wt (5.11) where {wt} is Gaussian white noise with standard deviation 0.5. This model has the same seasonal harmonic components as the model represented in Fig- ure 5.5b but also contains an underlying quadratic trend. Using the code below, a series of length 10 years is simulated, and it is shown in Figure 5.6.

set.seed(1) TIME <- 1:(10 * 12) w <- rnorm(10 * 12, sd = 0.5) Trend <- 0.1 + 0.005 * TIME + 0.001 * TIME^2 Seasonal <- sin(2piTIME/12) + 0.2sin(2pi2TIME/12) + 0.1sin(2pi4TIME/12) + 0.1cos(2pi4TIME/12) x <- Trend + Seasonal + w plot(x, type = "l")

it would seem reasonable to place the harmonic variables in matrices, which can be achieved as follows:

SIN <- COS <- matrix(nr = length(TIME), nc = 6) for (i in 1:6) { COS[, i] <- cos(2 * pi * i * TIME/12) SIN[, i] <- sin(2 * pi * i * TIME/12) }

In most cases, the order of the harmonics and polynomial trend will be un- known. However, the harmonic coefficients are known to be independent, which means that all harmonic coefficients that are not statistically signif- icant can be dropped. It is largely a subjective decision on the part of the statistician to decide what constitutes a significant variable. An approximate t-ratio of magnitude 2 is a common choice and corresponds to an approximate 5% significance level. This t-ratio can be obtained by dividing the estimated coefficient by the standard error of the estimate. The following example illus- trates the procedure applied to the simulated series of the last section:

x.lm1 <- lm(x ~ TIME + I(TIME^2) + COS[, 1] + SIN[, 1] + COS[, 2] + SIN[, 2] + COS[, 3]

SIN[, 3] + COS[, 4] + SIN[, 4] + COS[, 5] + SIN[, 5]
COS[, 6] + SIN[, 6])

coef(x.lm1)/sqrt(diag(vcov(x.lm1))) (Intercept) TIME I(TIME^2) COS[, 1] SIN[, 1] COS[, 2] 1.239 1.125 25.933 0.328 15.442 -0.515 SIN[, 2] COS[, 3] SIN[, 3] COS[, 4] SIN[, 4] COS[, 5] 3.447 0.232 -0.703 0.228 1.053 -1.150 SIN[, 5] COS[, 6] SIN[, 6] 0.857 -0.310 0.382

The preceding output has three significant coefficients. These are used in the following model:

x.lm2 <- lm(x ~ I(TIME^2) + SIN[, 1] + SIN[, 2]) coef(x.lm2)/sqrt(diag(vcov(x.lm2))) (Intercept) I(TIME^2) SIN[, 1] SIN[, 2] 4.63 111.14 15.79 3.49 As can be seen in the output from the last command, the coefficients are all significant. The estimated coefficients of the best-fitting model are given by coef(x.lm2) (Intercept) I(TIME^2) SIN[, 1] SIN[, 2] 0.28040 0.00104 0.90021 0.19886

The coefficients above give the following model for predictions at time t: ˆxt = 0.280 + 0.00104t2 + 0.900 sin(2πt/12) + 0.199 sin(4πt/12) (5.12) The AIC can be used to compare the two fitted models:

AIC(x.lm1) [1] 165 AIC(x.lm2) [1] 150

As expected, the last model has the smallest AIC and therefore provides the best fit to the data. Due to sampling variation, the best-fitting model is not identical to the model used to simulate the data, as can easily be verified by taking the AIC of the known underlying model:

AIC(lm(x ~ TIME +I(TIME^2) +SIN[,1] +SIN[,2] +SIN[,4] +COS[,4])) [1] 153 In R, the algorithm step can be used to automate the selection of the best- fitting model by the AIC. For the example above, the appropriate command is step(x.lm1), which contains all the predictor variables in the form of the first model.

A best fit can equally well be based on choosing the model that leads to the smallest estimated standard deviations of the errors, provided the degrees of freedom are taken into account.

Harmonic model fitted to temperature series (1970–2005) In the code below, a harmonic model with a quadratic trend is fitted to the temperature series (1970–2005) The units for the ‘time’ variable are in ‘years’, so the divisor of 12 is not needed when creating the harmonic variables. To reduce computation error in the OLS procedure due to large numbers, the TIME variable is standardized after the COS and SIN predictors have been calculated.

SIN <- COS <- matrix(nr = length(temp), nc = 6) for (i in 1:6) { COS[, i] <- cos(2 * pi * i * time(temp)) SIN[, i] <- sin(2 * pi * i * time(temp)) } TIME <- (time(temp) - mean(time(temp)))/sd(time(temp)) mean(time(temp)) [1] 1988 sd(time(temp)) [1] 10.4 temp.lm1 <- lm(temp ~ TIME + I(TIME^2) + COS[,1] + SIN[,1] + COS[,2] + SIN[,2] + COS[,3] + SIN[,3] + COS[,4] + SIN[,4] + COS[,5] + SIN[,5] + COS[,6] + SIN[,6]) coef(temp.lm1)/sqrt(diag(vcov(temp.lm1))) (Intercept) TIME I(TIME^2) COS[, 1] SIN[, 1] COS[, 2] 18.245 30.271 1.281 0.747 2.383 1.260 SIN[, 2] COS[, 3] SIN[, 3] COS[, 4] SIN[, 4] COS[, 5] 1.919 0.640 0.391 0.551 0.168 0.324 SIN[, 5] COS[, 6] SIN[, 6] 0.345 -0.409 -0.457 temp.lm2 <- lm(temp ~ TIME + SIN[, 1] + SIN[, 2]) coef(temp.lm2) (Intercept) TIME SIN[, 1] SIN[, 2] 0.1750 0.1841 0.0204 0.0162 AIC(temp.lm) [1] -547 AIC(temp.lm1) [1] -545 AIC(temp.lm2) [1] -561 Again, the AIC is used to compare the fitted models, and only statistically significant terms are included in the final model. To check the adequacy of the fitted model, it is appropriate to create a time plot and correlogram of the residuals because the residuals form a time series

The time plot is used to detect patterns in the series. For example, if a higher-ordered polynomial is required, this would show up as a curve in the time plot. The purpose of the correlogram is to determine whether there is autocorrelation in the series, which would require a further model. 5.6 Harmonic seasonal models 107

plot(time(temp), resid(temp.lm2), type = "l") abline(0, 0, col = "red") acf(resid(temp.lm2)) pacf(resid(temp.lm2))

there is no discernible curve in the series, which implies that a straight line is an adequate description of the trend. A tendency for the series to persist above or below the x-axis implies that the series is positively autocorrelated. This is verified in the correlogram of the residuals, which shows a clear positive autocorrelation at lags 1–10

The correlogram in Figure 5.7 is similar to that expected of an AR(p) process (§4.5.5). This is verified by the plot of the partial autocorrelations, in which only the lag 1 and lag 2 autocorrelations are statistically significant (Fig. 5.7). In the code below, an AR(2) model is fitted to the residual series:

res.ar <- ar(resid(temp.lm2), method = "mle") res.ar$ar [1] 0.494 0.307 sd(res.ar$res[-(1:2)]) [1] 0.0837 acf(res.ar$res[-(1:2)]) The correlogram of the residuals of the fitted AR(2) model is given in Figure 5.8, from which it is clear that the residuals are approximately white noise. Hence, the final form of the model provides a good fit to the data. The fitted model for the monthly temperature series can be written as xt = 0.175 + 0.184(t − 1988) 10.4

0.0204 sin(2πt) + 0.0162 sin(4πt) + zt (5.13) where t is ‘time’ measured in units of ‘years’, the residual series {zt} follow an AR(2) process given by zt = 0.494zt−1 + 0.307zt−2 + wt (5.14) and {wt} is white noise with mean zero and standard deviation 0.0837. If we require an accurate assessment of the standard error, we should refit the model using gls, allowing for an AR(2) structure for the errors (Exer- cise 6).

Consider the air passenger series from §1.4.1. Time plots of the original series and the natural logarithm of the series can be obtained using the code below and are shown in Figure 5.9.

data(AirPassengers) AP <- AirPassengers plot(AP) plot(log(AP)) In Figure 5.9(a), the variance can be seen to increase as t increases, whilst after the logarithm is taken the variance is approximately constant over the period of the record (Fig. 5.9b). Therefore, as the number of people using the airline can also only be positive, the logarithm would be appropriate in the model formulation for this time series. In the following code, a harmonic model with polynomial trend is fitted to the air passenger series. The function time is used to extract the time and create a standardised time variable TIME. SIN <- COS <- matrix(nr = length(AP), nc = 6) for (i in 1:6) { SIN[, i] <- sin(2 * pi * i * time(AP)) COS[, i] <- cos(2 * pi * i * time(AP)) } TIME <- (time(AP) - mean(time(AP)))/sd(time(AP)) mean(time(AP))

[1] 1955

sd(time(AP)) [1] 3.48 AP.lm1 <- lm(log(AP) ~ TIME + I(TIME^2) + I(TIME^3) + I(TIME^4) + SIN[,1] + COS[,1] + SIN[,2] + COS[,2] + SIN[,3] + COS[,3] + SIN[,4] + COS[,4] + SIN[,5] + COS[,5] + SIN[,6] + COS[,6]) coef(AP.lm1)/sqrt(diag(vcov(AP.lm1))) (Intercept) TIME I(TIME^2) I(TIME^3) I(TIME^4) SIN[, 1] 744.685 42.382 -4.162 -0.751 1.873 4.868 COS[, 1] SIN[, 2] COS[, 2] SIN[, 3] COS[, 3] SIN[, 4] -26.055 10.395 10.004 -4.844 -1.560 -5.666 COS[, 4] SIN[, 5] COS[, 5] SIN[, 6] COS[, 6] 1.946 -3.766 1.026 0.150 -0.521 AP.lm2 <- lm(log(AP) ~ TIME + I(TIME^2) + SIN[,1] + COS[,1] + SIN[,2] + COS[,2] + SIN[,3] + SIN[,4] + COS[,4] + SIN[,5]) coef(AP.lm2)/sqrt(diag(vcov(AP.lm2)))

(Intercept) TIME I(TIME^2) SIN[, 1] COS[, 1] SIN[, 2] 922.63 103.52 -8.24 4.92 -25.81 10.36 COS[, 2] SIN[, 3] SIN[, 4] COS[, 4] SIN[, 5] 9.96 -4.79 -5.61 1.95 -3.73

AIC(AP.lm1) [1] -448 AIC(AP.lm2) [1] -451 acf(resid(AP.lm2))

The residual correlogram indicates that the data are positively autocorre- lated (Fig. 5.10). As mentioned in §5.4, the standard errors of the parameter estimates are likely to be under-estimated if there is positive serial corre- lation in the data. This implies that predictor variables may falsely appear ‘significant’ in the fitted model. In the code below, GLS is used to check the significance of the variables in the fitted model, using the lag 1 autocorrelation (approximately 0.6) from Figure

AP.gls <- gls(log(AP) ~ TIME + I(TIME^2) + SIN[,1] + COS[,1] + SIN[,2] + COS[,2] + SIN[,3] + SIN[,4] + COS[,4] + SIN[,5], cor = corAR1(0.6)) coef(AP.gls)/sqrt(diag(vcov(AP.gls))) (Intercept) TIME I(TIME^2) SIN[, 1] COS[, 1] SIN[, 2] 398.84 45.85 -3.65 3.30 -18.18 11.77 COS[, 2] SIN[, 3] SIN[, 4] COS[, 4] SIN[, 5] 11.43 -7.63 -10.75 3.57 -7.92 In Figure 5.10(b), the partial autocorrelation plot suggests that the resid- ual series follows an AR(1) process, which is fitted to the series below: AP.ar <- ar(resid(AP.lm2), order = 1, method = "mle") AP.ar$ar [1] 0.641 acf(AP.ar$res[-1]) The correlogram of the residuals of the fitted AR(1) model might be taken for white noise given that only one autocorrelation is significant (Fig. 5.11). However, the lag of this significant value corresponds to the seasonal lag (12) in the original series, which implies that the fitted model has failed to fully account for the seasonal variation in the data. Understandably, the reader might regard this as curious, given that the data were fitted using the full seasonal harmonic model. However, seasonal effects can be stochastic just as trends can, and the harmonic model we have used is deterministic.

Example of a simulated and fitted non-linear series As non-linear models are generally fitted when the underlying non-linear func- tion is known, we will simulate a non-linear series based on Equation (5.18) with c0 = 0 and compare parameters estimated using nls with those of the known underlying function. Below, a non-linear series with AR(1) residuals is simulated and plotted (Fig. 5.12):

set.seed(1) w <- rnorm(100, sd = 10) 114 5 Regression z <- rep(0, 100) for (t in 2:100) z[t] <- 0.7 * z[t - 1] + w[t] Time <- 1:100 f <- function(x) exp(1 + 0.05 * x) x <- f(Time) + z plot(x, type = "l") abline(0, 0)

The series plotted in Figure 5.12 has an apparent increasing exponential trend but also contains negative values, so that a direct log-transformation cannot be used and a non-linear model is needed. In R, a non-linear model is fitted by specifying a formula with the parameters and their starting values contained in a list:

x.nls <- nls(x ~ exp(alp0 + alp1 * Time), start = list(alp0 = 0.1, alp1 = 0.5)) summary(x.nls)$parameters Estimate Std. Error t value Pr(>|t|) alp0 1.1764 0.074295 15.8 9.20e-29 alp1 0.0483 0.000819 59.0 2.35e-78 The estimates for α0 and α1 are close to the underlying values that were used to simulate the data, although the standard errors of these estimates are likely to be underestimated because of the autocorrelation in the residuals The generalised least squares function gls can be used to fit non-linear mod- els with autocorrelated residuals. However, in practice, computational difficulties often arise when using this function with non-linear models.

The generic function for making predictions in R is predict. The function essentially takes a fitted model and new data as parameters. The key to using this function with a regression model is to ensure that the new data are properly defined and labelled in a data.frame. In the code below, we use this function in the fitted regression model of §5.7.2 to forecast the number of air passengers travelling for the 10-year period that follows the record (Fig. 5.13). The forecast is given by applying the exponential function (anti-log) to predict because the regression model was fitted to the logarithm of the series:

new.t <- time(ts(start = 1961, end = c(1970, 12), fr = 12)) TIME <- (new.t - mean(time(AP)))/sd(time(AP)) SIN <- COS <- matrix(nr = length(new.t), nc = 6) for (i in 1:6) { COS[, i] <- cos(2 * pi * i * new.t) SIN[, i] <- sin(2 * pi * i * new.t) } SIN <- SIN[, -6] new.dat <- data.frame(TIME = as.vector(TIME), SIN = SIN, COS = COS) AP.pred.ts <- exp(ts(predict(AP.lm2, new.dat), st = 1961, fr = 12)) ts.plot(log(AP), log(AP.pred.ts), lty = 1:2) ts.plot(AP, AP.pred.ts, lty = 1:2)

The bias in the means arises as a result of applying the inverse transform to a residual series. For example, if the time series are Gaussian white noise {wt}, with mean zero and standard deviation σ, then the distribution of the inverse-transform (the anti-log) of the series is log-normal with mean e 1 2 σ2. This can be verified theoretically, or empirically by simulation as in the code below:

set.seed(1) sigma <- 1 w <- rnorm(1e+06, sd = sigma) mean(w) [1] 4.69e-05 5.10 Inverse transform and bias correction 117 mean(exp(w)) [1] 1.65 exp(sigma^2/2) [1] 1.65 The code above indicates that the mean of the anti-log of the Gaussian white noise and the expected mean from a log-normal distribution are equal. Hence, for a Gaussian white noise residual series, a correction factor of e 1 2 σ2 should be applied to the forecasts of means. The importance of this correction factor really depends on the value of σ2. If σ2 is very small, the correction factor will hardly change the forecasts at all and so could be neglected with- out major concern, especially as errors from other sources are likely to be significantly greater.

For the airline series, the forecasts can be adjusted by multiplying the predic- tions by e 1 2 σ2, where σ is the standard deviation of the residuals, or using an empirical correction factor as follows:

summary(AP.lm2)$r.sq [1] 0.989 sigma <- summary(AP.lm2)$sigma lognorm.correction.factor <- exp((1/2) * sigma^2) empirical.correction.factor <- mean(exp(resid(AP.lm2))) 118 5 Regression lognorm.correction.factor [1] 1.001171 empirical.correction.factor [1] 1.001080 AP.pred.ts <- AP.pred.ts * empirical.correction.factor The adjusted forecasts in AP.pred.ts allow for the bias in taking the anti-log of the predictions. However, the small σ (and R2 = 0.99) results in a small correction factor (of the order 0.1%), which is probably negligible compared with other sources of errors that exist in the forecasts. Whilst in this example the correction factor is small, there is no reason why it will be small in general.

Moving average Models

An MA(q) model can be fitted to data in R using the arima function with the order function parameter set to c(0,0,q). Unlike the function ar, the function arima does not subtract the mean by default and estimates an in- tercept term. MA models cannot be expressed in a multiple regression form, and, in general, the parameters are estimated with a numerical algorithm. The function arima minimises the conditional sum of squares to estimate values of the parameters and will either return these if method=c("CSS") is specified or use them as initial values for maximum likelihood estimation.

x.ma <- arima(x, order = c(0, 0, 3)) x.ma Call: arima(x = x, order = c(0, 0, 3)) Coefficients: ma1 ma2 ma3 intercept 0.790 0.566 0.396 -0.032 s.e. 0.031 0.035 0.032 0.090 sigma^2 estimated as 1.07: log likelihood = -1452, aic = 2915 It is possible to set the value for the mean to zero, rather than estimate the intercept, by using include.mean=FALSE within the arima function. This option should be used with caution and would only be appropriate if you wanted {xt} to represent displacement from some known fixed mean.

It is possible to set the value for the mean to zero, rather than estimate the intercept, by using include.mean=FALSE within the arima function. This option should be used with caution and would only be appropriate if you wanted {xt} to represent displacement from some known fixed mean.

In the code below, an MA(1) model is fitted to the exchange rate series.

> www <- "http://www.massey.ac.nz/~pscowper/ts/pounds_nz.dat"
> x <- read.table(www, header = T)
> x.ts <- ts(x, st = 1991, fr = 4)
> x.ma <- arima(x.ts, order = c(0, 0, 1))
> x.ma

acf(x.ma$res[-1])

The ARMA process, and the more general ARIMA processes discussed in the next chapter, can be simulated using the R function arima.sim, which takes a list of coefficients representing the AR and MA parameters. An ARMA(p, q) model can be fitted using the arima function with the order function param- eter set to c(p, 0, q). The fitting algorithm proceeds similarly to that for an MA process. Below, data from an ARMA(1, 1) process are simulated for α = −0.6 and β = 0.5 (Equation (6.7)), and an ARMA(1, 1) model fitted to the simulated series. As expected, the sample estimates of α and β are close to the underlying model parameters.

set.seed(1) x <- arima.sim(n = 10000, list(ar = -0.6, ma = 0.5)) coef(arima(x, order = c(1, 0, 1))) ar1 ma1 intercept -0.59697 0.50270 -0.00657

In §6.3, a simple MA(1) model failed to provide an adequate fit to the exchange rate series. In the code below, fitted MA(1), AR(1) and ARMA(1, 1) models are compared using the AIC. The ARMA(1, 1) model provides the best fit to the data, followed by the AR(1) model, with the MA(1) model providing the poorest fit. The correlogram in Figure 6.4 indicates that the residuals of the fitted ARMA(1, 1) model have small autocorrelations, which is consistent with a realisation of white noise and supports the use of the model.

x.ma <- arima(x.ts, order = c(0, 0, 1)) x.ar <- arima(x.ts, order = c(1, 0, 0)) x.arma <- arima(x.ts, order = c(1, 0, 1)) AIC(x.ma) [1] -3.53 AIC(x.ar) 130 6 Stationary Models [1] -37.4 AIC(x.arma) [1] -42.3 x.arma Call: arima(x = x.ts, order = c(1, 0, 1)) Coefficients: ar1 ma1 intercept 0.892 0.532 2.960 s.e. 0.076 0.202 0.244 sigma^2 estimated as 0.0151: log likelihood = 25.1, aic = -42.3 acf(resid(x.arma))

Furthermore, the variance increases with time, which suggests a log-transformation may be appropriate (Fig. 1.5). A regression model is fitted to the logarithms of the original series in the code below. 6.6 ARMA models: Empirical analysis 131

www <- "http://www.massey.ac.nz/~pscowper/ts/cbe.dat" CBE <- read.table(www, header = T) Elec.ts <- ts(CBE[, 3], start = 1958, freq = 12) Time <- 1:length(Elec.ts) Imth <- cycle(Elec.ts) Elec.lm <- lm(log(Elec.ts) ~ Time + I(Time^2) + factor(Imth)) acf(resid(Elec.lm)) The correlogram of the residuals appears to cycle with a period of 12 months suggesting that the monthly indicator variables are not sufficient to account for the seasonality in the series this can be accounted for using a non-stationary model with a stochastic seasonal component.

he best fitting ARMA(p, q) model can be chosen using the smallest AIC either by trying a range of combinations of p and q in the arima function or using a for loop with upper bounds on p and q – taken as 2 in the code shown below. To start with, best.aic is initialised to infinity (Inf). After the loop is complete, the best model can be found in best.order, and in this case the best model turns out to be an AR(2) model.

best.order <- c(0, 0, 0) best.aic <- Inf for (i in 0:2) for (j in 0:2) { fit.aic <- AIC(arima(resid(Elec.lm), order = c(i, 0, j))) if (fit.aic < best.aic) { best.order <- c(i, 0, j) best.arma <- arima(resid(Elec.lm), order = best.order) best.aic <- fit.aic } 132 6 Stationary Models } best.order [1] 2 0 0 acf(resid(best.arma))

The predict function can be used both to forecast future values from the fitted regression model and forecast the future errors associated with the regression model using the ARMA model fitted to the residuals from the regression. These two forecasts can then be summed to give a forecasted value of the logarithm for electricity production, which would then need to be anti- logged and perhaps adjusted using a bias correction factor. As predict is a generic R function, it works in different ways for different input objects and classes. For a fitted regression model of class lm, the predict function requires the new set of data to be in the form of a data frame

For a fitted ARMA model of class arima, the predict function requires just the number of time steps ahead for the desired forecast. In the latter case, predict produces an object that has both the predicted values and their standard errors, which can be extracted using pred and se, respectively. In the code below, the electricity production for each month of the next three years is predicted.

new.time <- seq(length(Elec.ts), length = 36) new.data <- data.frame(Time = new.time, Imth = rep(1:12, 3)) predict.lm <- predict(Elec.lm, new.data) predict.arma <- predict(best.arma, n.ahead = 36) elec.pred <- ts(exp(predict.lm + predict.arma$pred), start = 1991, freq = 12) ts.plot(cbind(Elec.ts, elec.pred), lty = 1:2)

The plot of the forecasted values suggests that the predicted values for winter may be underestimated by the fitted model

The data in the file wave.dat are the surface height of water (mm), relative to the still water level, measured using a capacitance probe positioned at the centre of a wave tank. The continuous voltage signal from this capacitance probe was sampled every 0.1 second over a 39.6-second period. The objective is to fit a suitable ARMA(p, q) model that can be used to generate a realistic wave input to a mathematical model for an ocean-going tugboat in a computer simulation. The results of the computer simulation will be compared with tests using a physical model of the tugboat in the wave tank. The pacf suggests that p should be at least 2 (Fig. 6.8). The best-fitting ARMA(p, q) model, based on a minimum variance of residuals, was obtained with both p and q equal to 4. The acf and pacf of the residuals from this model are consistent with the residuals being a realisation of white noise (Fig. 6.9).

www <- "http://www.massey.ac.nz/~pscowper/ts/wave.dat" wave.dat <- read.table(www, header = T) attach (wave.dat) layout(1:3) plot (as.ts(waveht), ylab = 'Wave height') acf (waveht) pacf (waveht) wave.arma <- arima(waveht, order = c(4,0,4)) acf (wave.arma$res[-(1:4)]) pacf (wave.arma$res[-(1:4)]) hist(wave.arma$res[-(1:4)], xlab='height / mm', main='')