Aggregating data in R, benchmarks






Greatly varying processing times of aggregating data in R

While exploring the kaggle Rossmann competition (https://www.kaggle.com/c/rossmann-store-sales) with our class I stumbled upon this script: https://www.kaggle.com/shearerp/rossmann-store-sales/interactive-sales-visualization

It is a great introduction to the syntax that comes with the package dplyr.

The “model” proposed is simply the average sales conditioned on three variables: ‘Store’,‘DayOfWeek’,‘Promo’.

What I found remarkable is that the processing of this aggregation step is much faster than with the base function aggregate, which led me to benchmarking five different ways of conditional function computation:

  1. base R
    • aggregate
    • by
    • tapply
  2. dplyr package: group_by_ %>% summarise
  3. data.table package: [,by=]

We use the library microbenchmark to time the computation of average sales conditional on three factors in the Rossmann data set.

require(dplyr)
#read train and test data
train = read.csv('D:/kaggle/Rossmann/input/train.csv.gz',as.is=T)

#only keep those rows for which sales > 0 
train = train[train$Sales>0,]
dim(train)
## [1] 844338      9
#define the variables we want to condition on
preds=c('Store','DayOfWeek','Promo')

library(microbenchmark)

In summary, the differences are staggering:

Benchmark = c(aggregate=mean(AggFormula[,2]),  by=mean(BY[,2]),  
              tapply=mean(TA[,2]), group_by=mean(GrpByPipes[,2]), 
              data.table=mean(DTgroup[,2]))/10^6
round(Benchmark)
##  aggregate         by     tapply   group_by data.table 
##      13169        835        319        123         16
barplot(Benchmark, log="y", ylab = "CPU time [ms]", main = "Benchmarks")

The logarithmic plot hides the enormous speed differences: the data table grouping is 835 times faster than aggregate!

GrpByPipes=microbenchmark(
  "mdl" = train %>% group_by_(.dots=preds) %>% summarise(PredSales=mean(Sales))  %>% ungroup(), times = 20)
print(GrpByPipes)
AggFormula =microbenchmark(
  "mdl2" =  aggregate(Sales ~ Store + DayOfWeek+Promo, data=train,FUN=mean), times = 20)
print(AggFormula)

Does it help to bypass the formula overhead:?

AggNoFormula =microbenchmark(
  "mdl3" =  aggregate(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
print(AggNoFormula)

Hardly any difference.

Are the alternatives to aggregate such as by, tapply faster?

TA =microbenchmark(
  "mdl3b" =  tapply(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)

BY =microbenchmark(
  "mdl3c" =  by(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
print(TA, eval=F)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval
##  mdl3b 293.1418 305.0216 318.9019 310.9083 314.4047 415.5317    20

How does data.table compare to this? (slightly unfair to leave the indexing outside the benchmarking)

library(data.table)
DT= data.table(train)
system.time(setkey(DT,Store,DayOfWeek,Promo))
DTgroup =microbenchmark(
  "mdl4" =  DT[, mean(Sales), by="Store,DayOfWeek,Promo"], times = 20)
print(DTgroup, eval=F)
## Unit: milliseconds
##  expr      min       lq     mean   median       uq      max neval
##  mdl4 15.68308 15.69807 15.77953 15.70956 15.84216 16.13794    20

And a linear model is not competitive at all when it comes to computing interactions only:

train$Store=factor(train$Store)
train$DayOfWeek=factor(train$DayOfWeek)
train$Promo=factor(train$Promo)

system.time(lm(Sales ~ Store:DayOfWeek:Promo, data=train))
print("Error: cannot allocate vector of size 98.2 Gb")
## [1] "Error: cannot allocate vector of size 98.2 Gb"



Interpretation of the coefficients in logistic regression



Interpretation of the coefficients in logistic regression




Fitting a logistic regression (LR) model (with Age, Sex and Pclass as predictors) to the survival outcome in the Titanic data yields a summary such as this one:

## 
## Call:
## glm(formula = Survived ~ Age + Sex + Pclass, family = binomial(link = logit), 
##     data = NoMissingAge)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7303  -0.6780  -0.3953   0.6485   2.4657  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.777013   0.401123   9.416  < 2e-16 ***
## Age         -0.036985   0.007656  -4.831 1.36e-06 ***
## Sexmale     -2.522781   0.207391 -12.164  < 2e-16 ***
## Pclass2     -1.309799   0.278066  -4.710 2.47e-06 ***
## Pclass3     -2.580625   0.281442  -9.169  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 647.28  on 709  degrees of freedom
## AIC: 657.28
## 
## Number of Fisher Scoring iterations: 5
##          1          2          3          4          5          7 
## -0.4716467  0.4224490  1.0794775  0.4005642 -0.3747404 -0.8821915
##          1          2          3          4          5          7 
## -0.4716467  0.4224490  1.0794775  0.4005642 -0.3747404 -0.8821915

For “normal regression” we know that the value of \(\beta_j\) simply gives us \(\Delta y\) if \(x_j\) is increased by one unit.

In order to fully understand the exact meaning of the coefficients for a LR model we need to first warm up to the definition of a link function and the concept of probability odds.

Using linear regression as a starting point \[
y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \ldots +\beta_k x_{k,i} + \epsilon_i
\]
we modify the right hand side such that (i) the model is still basically a linear combination of the \(x_j\)s but (ii) the output is -like a probability- bounded between 0 and 1. This is achieved by “wrapping” a sigmoid function \(s(z) = 1/(1+exp(z))\) around the weighted sum of the \(x_j\)s: \[
y_i = s(\beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \ldots +\beta_k x_{k,i} + \epsilon_i)
\]
The sigmoid function, depicted below to the left, transforms the real axis to the interval \((0;1)\) and can be interpreted as a probability.

The inverse of the sigmoid is the logit (depicted above to the right), which is defined as \(log(p/(1-p))\). For the case where p is a probability we call the ratio \(p/(1-p)\) the probability odds. Thus, the logit is the log of the odds and logistic regression models these log-odds as a linear combination of the values of x.

Finally, we can interpret the coefficients directly: the odds of a positive outcome are multiplied by a factor of \(exp(\beta_j)\) for every unit change in \(x_j\). (In that light, logistic regression is reminiscient of linear regression with logarithmically transformed dependent variable which also leads to multiplicative rather than additive effects.)

As an example, the coefficient for Pclass 3 is -2.5806, which means that the odds of survival compared to the reference level Pclass 1 are reduced by a factor of \(exp(-2.5806) = 0.0757\);with all other input variables unchanged. How does the relative change in odds translate into probabilities? It is relatively easy to memorize the to and forth relationship between odds and probability \(p\): \[
odds = \frac{p}{1-p} \Leftrightarrow p = \frac{odds}{1 + odds}
\]
So the intercept 3.777 (= log-odds!) translates into odds of \(exp(3.777) = 43.685\) which yields a base probability of survival (for female Pclass1 of age 0) of \(43.685/(1 + 43.685) = 0.978\)

Why make life so complicated?

I am often asked by students why we have to go through the constant trouble of (i) exponentiating the coefficients, (ii) multiplying the odds and finally (iii) compute the resulting probabilities? Can we not simply transform the coefficients from the summary table such that we can read off their effects directly – just like in linear regression?

The trouble is that there is no simple way to translate the coefficients into an additive or even a multiplicative adjustment of the probabilities. (One simple way to see this is the impossibility of keeping the output bounded between 0 and 1) The following graph shows the effect of various coefficient values on a base/reference probability.

We immediately see that there is no straightforward additive or multiplicative effect that could be quantified.



Is logistic regression resistant to outliers?



Is logistic regression resistant to outliers?




We speculate that the S-shaped sigmoid function is forgiving of outliers in x as long as one is “on the right side”, i.e. if the class label does not contradict the general trend of the variable. For example, in the Titanic data we have seen that survival probability tended to decline with increasing age. What if we added a \(500\)-year old person to the data set who did not survive? For linear regression such an outlier would likely distort the coefficient estimates significantly.

fit = glm(Survived ~ Age , data= NoMissingAge, family = binomial(link=logit))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit), 
##     data = NoMissingAge)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1488  -1.0361  -0.9544   1.3159   1.5908  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.05672    0.17358  -0.327   0.7438  
## Age         -0.01096    0.00533  -2.057   0.0397 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 960.23  on 712  degrees of freedom
## AIC: 964.23
## 
## Number of Fisher Scoring iterations: 4
dataWithOutlier = NoMissingAge
dataWithOutlier[1,c("Age", "Survived")] = c(500,0) 
fit = glm(Survived ~ Age , data= dataWithOutlier, family = binomial(link=logit))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit), 
##     data = dataWithOutlier)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1515  -1.0373  -0.9545   1.3145   1.5928  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.050245   0.172124  -0.292   0.7704  
## Age         -0.011100   0.005271  -2.106   0.0352 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 959.12  on 712  degrees of freedom
## AIC: 963.12
## 
## Number of Fisher Scoring iterations: 4
#what do the diagnostic plots tell us?
#plot(fit)
#and now we change the label which leads to a large residual:
dataWithOutlier[1,"Survived"] = 1 
fit = glm(Survived ~ Age , data= dataWithOutlier, family = binomial(link=logit))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit), 
##     data = dataWithOutlier)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.039  -1.026  -1.015   1.336   1.631  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.332154   0.131543  -2.525   0.0116 *
## Age         -0.001383   0.003553  -0.389   0.6970  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 965.27  on 713  degrees of freedom
## Residual deviance: 965.11  on 712  degrees of freedom
## AIC: 969.11
## 
## Number of Fisher Scoring iterations: 4

Wow, one data point changes the entire fit significantly in the second case but left it virtually alone in the first.