Greatly varying processing times of aggregating data in R
While exploring the kaggle Rossmann competition (https://www.kaggle.com/c/rossmann-store-sales) with our class I stumbled upon this script: https://www.kaggle.com/shearerp/rossmann-store-sales/interactive-sales-visualization
It is a great introduction to the syntax that comes with the package dplyr.
The “model” proposed is simply the average sales conditioned on three variables: ‘Store’,‘DayOfWeek’,‘Promo’.
What I found remarkable is that the processing of this aggregation step is much faster than with the base function aggregate, which led me to benchmarking five different ways of conditional function computation:
- base R
- aggregate
- by
- tapply
- dplyr package: group_by_ %>% summarise
- data.table package: [,by=]
We use the library microbenchmark to time the computation of average sales conditional on three factors in the Rossmann data set.
require(dplyr)
#read train and test data
train = read.csv('D:/kaggle/Rossmann/input/train.csv.gz',as.is=T)
#only keep those rows for which sales > 0
train = train[train$Sales>0,]
dim(train)
## [1] 844338 9
#define the variables we want to condition on
preds=c('Store','DayOfWeek','Promo')
library(microbenchmark)
In summary, the differences are staggering:
Benchmark = c(aggregate=mean(AggFormula[,2]), by=mean(BY[,2]),
tapply=mean(TA[,2]), group_by=mean(GrpByPipes[,2]),
data.table=mean(DTgroup[,2]))/10^6
round(Benchmark)
## aggregate by tapply group_by data.table
## 13169 835 319 123 16
barplot(Benchmark, log="y", ylab = "CPU time [ms]", main = "Benchmarks")
The logarithmic plot hides the enormous speed differences: the data table grouping is 835 times faster than aggregate!
GrpByPipes=microbenchmark(
"mdl" = train %>% group_by_(.dots=preds) %>% summarise(PredSales=mean(Sales)) %>% ungroup(), times = 20)
print(GrpByPipes)
AggFormula =microbenchmark(
"mdl2" = aggregate(Sales ~ Store + DayOfWeek+Promo, data=train,FUN=mean), times = 20)
print(AggFormula)
Does it help to bypass the formula overhead:?
AggNoFormula =microbenchmark(
"mdl3" = aggregate(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
print(AggNoFormula)
Hardly any difference.
Are the alternatives to aggregate such as by, tapply faster?
TA =microbenchmark(
"mdl3b" = tapply(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
BY =microbenchmark(
"mdl3c" = by(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
print(TA, eval=F)
## Unit: milliseconds
## expr min lq mean median uq max neval
## mdl3b 293.1418 305.0216 318.9019 310.9083 314.4047 415.5317 20
How does data.table compare to this? (slightly unfair to leave the indexing outside the benchmarking)
library(data.table)
DT= data.table(train)
system.time(setkey(DT,Store,DayOfWeek,Promo))
DTgroup =microbenchmark(
"mdl4" = DT[, mean(Sales), by="Store,DayOfWeek,Promo"], times = 20)
print(DTgroup, eval=F)
## Unit: milliseconds
## expr min lq mean median uq max neval
## mdl4 15.68308 15.69807 15.77953 15.70956 15.84216 16.13794 20
And a linear model is not competitive at all when it comes to computing interactions only:
train$Store=factor(train$Store)
train$DayOfWeek=factor(train$DayOfWeek)
train$Promo=factor(train$Promo)
system.time(lm(Sales ~ Store:DayOfWeek:Promo, data=train))
print("Error: cannot allocate vector of size 98.2 Gb")
## [1] "Error: cannot allocate vector of size 98.2 Gb"
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });