#### Greatly varying processing times of aggregating data in R

While exploring the kaggle Rossmann competition (https://www.kaggle.com/c/rossmann-store-sales) with our class I stumbled upon this script: https://www.kaggle.com/shearerp/rossmann-store-sales/interactive-sales-visualization

It is a great introduction to the syntax that comes with the package *dplyr*.

The “model” proposed is simply the average sales conditioned on three variables: ‘Store’,‘DayOfWeek’,‘Promo’.

What I found remarkable is that the processing of this aggregation step is much faster than with the base function *aggregate*, which led me to benchmarking five different ways of conditional function computation:

- base R
- aggregate
- by
- tapply

- dplyr package: group_by_ %>% summarise
- data.table package: [,by=]

We use the library *microbenchmark* to time the computation of average sales conditional on three factors in the Rossmann data set.

```
require(dplyr)
#read train and test data
train = read.csv('D:/kaggle/Rossmann/input/train.csv.gz',as.is=T)
#only keep those rows for which sales > 0
train = train[train$Sales>0,]
dim(train)
```

`## [1] 844338 9`

```
#define the variables we want to condition on
preds=c('Store','DayOfWeek','Promo')
library(microbenchmark)
```

In summary, the differences are staggering:

```
Benchmark = c(aggregate=mean(AggFormula[,2]), by=mean(BY[,2]),
tapply=mean(TA[,2]), group_by=mean(GrpByPipes[,2]),
data.table=mean(DTgroup[,2]))/10^6
round(Benchmark)
```

```
## aggregate by tapply group_by data.table
## 13169 835 319 123 16
```

`barplot(Benchmark, log="y", ylab = "CPU time [ms]", main = "Benchmarks")`

The logarithmic plot hides the enormous speed differences: the data table grouping is **835 times faster** than aggregate!

```
GrpByPipes=microbenchmark(
"mdl" = train %>% group_by_(.dots=preds) %>% summarise(PredSales=mean(Sales)) %>% ungroup(), times = 20)
```

`print(GrpByPipes)`

```
AggFormula =microbenchmark(
"mdl2" = aggregate(Sales ~ Store + DayOfWeek+Promo, data=train,FUN=mean), times = 20)
```

`print(AggFormula)`

Does it help to bypass the formula overhead:?

```
AggNoFormula =microbenchmark(
"mdl3" = aggregate(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
```

`print(AggNoFormula)`

Hardly any difference.

Are the alternatives to aggregate such as by, tapply faster?

```
TA =microbenchmark(
"mdl3b" = tapply(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
BY =microbenchmark(
"mdl3c" = by(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
```

`print(TA, eval=F)`

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## mdl3b 293.1418 305.0216 318.9019 310.9083 314.4047 415.5317 20
```

How does data.table compare to this? (slightly unfair to leave the indexing outside the benchmarking)

```
library(data.table)
DT= data.table(train)
system.time(setkey(DT,Store,DayOfWeek,Promo))
DTgroup =microbenchmark(
"mdl4" = DT[, mean(Sales), by="Store,DayOfWeek,Promo"], times = 20)
```

`print(DTgroup, eval=F)`

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## mdl4 15.68308 15.69807 15.77953 15.70956 15.84216 16.13794 20
```

And a linear model is not competitive at all when it comes to computing interactions only:

```
train$Store=factor(train$Store)
train$DayOfWeek=factor(train$DayOfWeek)
train$Promo=factor(train$Promo)
system.time(lm(Sales ~ Store:DayOfWeek:Promo, data=train))
```

`print("Error: cannot allocate vector of size 98.2 Gb")`

`## [1] "Error: cannot allocate vector of size 98.2 Gb"`

// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });