Aggregate Time Series Data - vaibhavdobriyal/R_Tutorials GitHub Wiki

This is complicated since we are talking about a time series related aggregation. Please explore if you can use aggregate, apply,lapply or sapply before going into loops. By the way if you are proficient with SQL then sqldf may be your friend also !

Interesting article on loops https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r

Example showing aggregation of data on time dimension.

> names(dat) <- c("utcTimestamp","start", "end", "usage","reading_interval", "status", "meterID", "alarm","is_imputed" ,"localAcquisitionTime")
> head(dat)
          utcTimestamp  start end usage reading_interval status meterID alarm is_imputed localAcquisitionTime
1 2015-11-12T00:00:00Z +00:00   0  3558               15    192 334_101    NA      false 2015-11-12T00:00:00Z
2 2015-11-12T00:15:00Z +00:00   0  4048               15    192 334_101    NA      false 2015-11-12T00:15:00Z
3 2015-11-12T00:30:00Z +00:00   0  3609               15    192 334_101    NA      false 2015-11-12T00:30:00Z
4 2015-11-12T00:45:00Z +00:00   0  3706               15    192 334_101    NA      false 2015-11-12T00:45:00Z
5 2015-11-12T01:00:00Z +00:00   0  3867               15    192 334_101    NA      false 2015-11-12T01:00:00Z
6 2015-11-12T01:15:00Z +00:00   0  3000               15    192 334_101    NA      false 2015-11-12T01:15:00Z
> 
> dat <- subset(dat,meterID=="334_4")
> 
> ##################aggregating data to 30 minutes level to reduce some data###################
> dat$localAcquisitionTime <- strptime(dat$localAcquisitionTime,format = "%Y-%m-%dT%H:%M:%SZ")
> library(xts)
> dat.xts <- xts(x = dat[, c(4)], order.by = dat[, "localAcquisitionTime"])
> head(dat.xts)
                    [,1]
2014-10-27 00:00:00 5153
2014-10-27 00:10:00 9608
2014-10-27 00:20:00 8955
2014-10-27 00:30:00 9522
2014-10-27 00:40:00 9446
2014-10-27 00:50:00 9653
> names(dat.xts) <- c("usage")
> #defining the level to aggregate - 30 minutes in our case
> ep <- endpoints(dat.xts, "minutes", 30)
> #finally performing the aggregation
> dat.xts30mins <- period.apply(dat.xts,ep,sum)
> head(dat.xts30mins)
                    usage
2014-10-27 00:20:00 23716
2014-10-27 00:50:00 28621
2014-10-27 01:20:00 28994
2014-10-27 01:50:00 28176
2014-10-27 02:20:00 27271
2014-10-27 02:50:00 27560
> dat30min <- (data.frame(localDateTime=index(dat.xts30mins), coredata(dat.xts30mins)))

Aggregating on two dimension

Ran into another issue when the logic of algo changed and required aggregation for every meter and date.

for (i in  1: nrow(meterIDList))
{
  subsetDf <- subset(dat[,c("meterID","localDateTime","usage")],meterID==meterIDList[i,])
  subsetDf <- subsetDf[,c("localDateTime","usage")]
  subsetDf$localDateTime <- strptime(subsetDf$localDateTime,format = "%Y-%m-%dT%H:%M:%SZ")
  #converting to XTS object
  subset_xts <- xts(x = subsetDf[, c("usage")], order.by = subsetDf[, "localDateTime"])
  names(subset_xts) <- c("usage")
  #defining the level to aggregate - 30 minutes in our case
  ep <- endpoints(subset_xts, "minutes", 30)
  #finally performing the aggregation
  subset_xts_30agg <- period.apply(subset_xts,ep,sum,na.rm=T)
  
  #converting back to dataframe from timeseries
  subset_dat30min <- (data.frame(localDateTime=index(subset_xts_30agg), coredata(subset_xts_30agg)))
  subset_dat30min$meterID=meterIDList[i,]
  
  #adding to the datframe with data for all meters -> contains meterID, usage and localtime
  aggregated_dat <- rbind(aggregated_dat,subset_dat30min)
}

Sample input data

            localDateTime usage meterID
2015-10-01T00:00:00Z     4  334_20
2015-10-01T00:15:00Z  4330  334_20
2015-10-01T00:30:00Z  4465  334_20
2015-10-01T00:45:00Z  5404  334_20
2015-10-01T01:00:00Z  5577  334_20
2015-10-01T01:15:00Z  5134  334_20
2015-10-01T01:30:00Z  4857  334_20
2015-10-01T01:45:00Z  4859  334_20
2015-10-01T02:00:00Z  4946  334_20
2015-10-01T02:15:00Z  5182  334_20
2015-10-01T02:30:00Z  5388  334_20

Sample Output data

2015-10-01 00:15:00  4334
2015-10-01 00:45:00  9869
2015-10-01 01:15:00 10711
2015-10-01 01:45:00  9716
2015-10-01 02:15:00 10128

Not a very clean solution but it works. Definitely there will be a better way of doing this - faster and efficient