Categorical Variables in Trees I








I find it remarkable that very few of the current implementations of tree algorithms in R exploit an important “trick” which was mentioned in the original seminal paper in 1984!

“For a two-class problem, the search for the best categorical split can be reduced to M steps using the Gini criterion”

Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees , Wadsworth International Group

See also http://www.math.ccu.edu.tw/~yshih/papers/spl.pdf

and p. 11 in http://www.cs.ubbcluj.ro/~csatol/mestint/pdfs/Gehrke_Loh_DecTree.pdf

(The effect of dummifying categorical variables on performance is nicely elaborated upon in this post Are categorical variables getting lost in your random forests?)

This implies that instead of having to search \(2^k-1\) combinations of the \(k\) levels it is sufficient to try just k of them !

The practical consequences of this exponential vs. linear scaling are quite grave, especially with modern datasets that often contain lots of categorical variables with many (not rare to see \(k> 1000\)) levels.

Of the popular packages that I tried, only rpart, gbm and RWeka avoid the “exponential trap” and pose (almost) no limit on the number of levels:

#read in the Rossman data set from this kaggle competition:
#  https://www.kaggle.com/c/rossmann-store-sales
#train = read.csv('H:/kaggle/Rossmann/input/train.csv.gz',as.is=T)

#  oh well, fake data serve to illustrate the point just as well:
N=5*10^5
train = cbind.data.frame(Sales= 0, Store=sample(1:1000,N,replace=TRUE))
train$Sales = 0.1*train$Store +rnorm(N)
train$Store = factor(train$Store)

cat ("There are ", length(levels(train$Store)),"stores/levels in this dataset.\n")
## There are  1000 stores/levels in this dataset.

Libraries that fail to exploit the linear search strategy:

The tree package

library(tree, quietly=TRUE)
try({fit = tree(Sales ~ Store, data = train)})

The party package

library(party, quietly=TRUE)
try({fit = ctree(Sales ~ Store, data = train)})
detach("package:party", unload=TRUE)
## Warning: Error in matrix(0, nrow = mi, ncol = nl) : 
##   invalid 'nrow' value (too large or NA)
## In addition: Warning message:
## In matrix(0, nrow = mi, ncol = nl) :
##   NAs introduced by coercion to integer range

The partykit package

library(partykit, quietly=TRUE)
try({fit = lmtree(Sales ~ Store, data = train)})
detach("package:partykit", unload=TRUE)
## Warning: Error in matrix(0, nrow = mi, ncol = nl) : 
##   invalid 'nrow' value (too large or NA)
## In addition: Warning message:
## In matrix(0, nrow = mi, ncol = nl) :
##   NAs introduced by coercion to integer range

The randomForest package

library(randomForest, quietly=TRUE)
try({fit = randomForest(Sales ~ Store, data = train,  ntree=1)})
## Warning: Error in randomForest.default(m, y, ...) : 
##   Can not handle categorical predictors with more than 53 categories.

base lm runs out of memory

try({fit0 = lm(Sales ~ Store, data = train)})
## Warning: Cannot allocate vector of size...

Libraries that shine:

The rpart package

library(rpart, quietly=TRUE)
fit = rpart(Sales ~ Store, data = train)

The gbm package

Well, almost. At least the max number of levels is very high (1024):

library(gbm, quietly=TRUE)
fit = gbm(Sales ~ Store, data = train, interaction.depth = 8, n.trees=1)
## Distribution not specified, assuming gaussian ...

(Not clear to me why there is a limit at all)

The RWeka package

library(RWeka, quietly=TRUE)
#handles only classification problems:
train$Sales2 = train$Sales > mean(train$Sales)
fit = J48(Sales2 ~ Store, data = train)

The h2o package



Grouping Duplicates






Both base R and data.table offer their duplicated() functions which are useful if your main goals is to identify and possibly delete duplicates.

For example

BirthDay <- as.Date(c(rep("1970-08-08",2), "1950-03-01", "1963-12-10",rep("1968-10-18",2)))
HomeTown = c(rep("New Brunswick,NJ",2),"Springfield,OH","Buffalo,NY",rep("Portland,OR",2))

df <-data.frame(BirthDay,HomeTown)

duplicated(df)
## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE
df[duplicated(df), ]
##     BirthDay         HomeTown
## 2 1970-08-08 New Brunswick,NJ
## 6 1968-10-18      Portland,OR
df[!duplicated(df), ]
##     BirthDay         HomeTown
## 1 1970-08-08 New Brunswick,NJ
## 3 1950-03-01   Springfield,OH
## 4 1963-12-10       Buffalo,NY
## 5 1968-10-18      Portland,OR

However, in my work most of the time this is not sufficient for what I would like to achieve. More often than not the duplicates of interest pertain only to a subset of the columns (“the key”), and I need to group those rows with duplicate keys;typically to investigate them further.

Instead of obtaining Booleans that indicate the non-specific notion “I gave seen this key/row before, so it is a duplicate but I will not tell you its first occurrence”, I would like a function that groups the rows/keys and so includes the “first row/key”. Here is an example of a data set with “noisy duplicates” in the name field, with the first 2 columns serving as an identifier:

SurName = c("Levine","Levin","Surflaw","Smith","Blandford","Jackson")

df <-data.frame(BirthDay,HomeTown,SurName)

Instead of

duplicated(df[,1:2])
## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE

I would like to return the groups of rows that are duplicates. This can be achieved in base R and with various packages, but I first want to propose an elegant solution offered by the ingenious data.table library:

library(data.table)
DT = data.table(df, key=c("BirthDay","HomeTown"))

(DTgrouped = DT[, list(list(.I)) ,by=key(DT)])
##      BirthDay         HomeTown  V1
## 1: 1950-03-01   Springfield,OH   1
## 2: 1963-12-10       Buffalo,NY   2
## 3: 1968-10-18      Portland,OR 3,4
## 4: 1970-08-08 New Brunswick,NJ 5,6
(DTrows = DTgrouped[sapply(DTgrouped[,V1],length)>1,V1])
## [[1]]
## [1] 3 4
## 
## [[2]]
## [1] 5 6

And if you wanted not just the row numbers but the actual data:

lapply(DTrows, function(i) return(df[i,]))
## [[1]]
##     BirthDay       HomeTown SurName
## 3 1950-03-01 Springfield,OH Surflaw
## 4 1963-12-10     Buffalo,NY   Smith
## 
## [[2]]
##     BirthDay    HomeTown   SurName
## 5 1968-10-18 Portland,OR Blandford
## 6 1968-10-18 Portland,OR   Jackson

base R

Here is the base R solution

tmp=by(df, list(df$BirthDay,df$HomeTown), function(x) if (nrow(x)>1) return(x))
tmp[!sapply(tmp,is.null)]
## [[1]]
##     BirthDay         HomeTown SurName
## 1 1970-08-08 New Brunswick,NJ  Levine
## 2 1970-08-08 New Brunswick,NJ   Levin
## 
## [[2]]
##     BirthDay    HomeTown   SurName
## 5 1968-10-18 Portland,OR Blandford
## 6 1968-10-18 Portland,OR   Jackson

plyr package

And here is the plyr way:

library(plyr)
tmp=dlply(df, .(BirthDay,HomeTown), function(x) if (nrow(x)>1) return(x))
tmp[!sapply(tmp,is.null)]
## $`1968-10-18.Portland,OR`
##     BirthDay    HomeTown   SurName
## 1 1968-10-18 Portland,OR Blandford
## 2 1968-10-18 Portland,OR   Jackson
## 
## $`1970-08-08.New Brunswick,NJ`
##     BirthDay         HomeTown SurName
## 1 1970-08-08 New Brunswick,NJ  Levine
## 2 1970-08-08 New Brunswick,NJ   Levin

Benchmarks will come in a separate post.



Aggregating data in R, benchmarks






Greatly varying processing times of aggregating data in R

While exploring the kaggle Rossmann competition (https://www.kaggle.com/c/rossmann-store-sales) with our class I stumbled upon this script: https://www.kaggle.com/shearerp/rossmann-store-sales/interactive-sales-visualization

It is a great introduction to the syntax that comes with the package dplyr.

The “model” proposed is simply the average sales conditioned on three variables: ‘Store’,‘DayOfWeek’,‘Promo’.

What I found remarkable is that the processing of this aggregation step is much faster than with the base function aggregate, which led me to benchmarking five different ways of conditional function computation:

  1. base R
    • aggregate
    • by
    • tapply
  2. dplyr package: group_by_ %>% summarise
  3. data.table package: [,by=]

We use the library microbenchmark to time the computation of average sales conditional on three factors in the Rossmann data set.

require(dplyr)
#read train and test data
train = read.csv('D:/kaggle/Rossmann/input/train.csv.gz',as.is=T)

#only keep those rows for which sales > 0 
train = train[train$Sales>0,]
dim(train)
## [1] 844338      9
#define the variables we want to condition on
preds=c('Store','DayOfWeek','Promo')

library(microbenchmark)

In summary, the differences are staggering:

Benchmark = c(aggregate=mean(AggFormula[,2]),  by=mean(BY[,2]),  
              tapply=mean(TA[,2]), group_by=mean(GrpByPipes[,2]), 
              data.table=mean(DTgroup[,2]))/10^6
round(Benchmark)
##  aggregate         by     tapply   group_by data.table 
##      13169        835        319        123         16
barplot(Benchmark, log="y", ylab = "CPU time [ms]", main = "Benchmarks")

The logarithmic plot hides the enormous speed differences: the data table grouping is 835 times faster than aggregate!

GrpByPipes=microbenchmark(
  "mdl" = train %>% group_by_(.dots=preds) %>% summarise(PredSales=mean(Sales))  %>% ungroup(), times = 20)
print(GrpByPipes)
AggFormula =microbenchmark(
  "mdl2" =  aggregate(Sales ~ Store + DayOfWeek+Promo, data=train,FUN=mean), times = 20)
print(AggFormula)

Does it help to bypass the formula overhead:?

AggNoFormula =microbenchmark(
  "mdl3" =  aggregate(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
print(AggNoFormula)

Hardly any difference.

Are the alternatives to aggregate such as by, tapply faster?

TA =microbenchmark(
  "mdl3b" =  tapply(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)

BY =microbenchmark(
  "mdl3c" =  by(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
print(TA, eval=F)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval
##  mdl3b 293.1418 305.0216 318.9019 310.9083 314.4047 415.5317    20

How does data.table compare to this? (slightly unfair to leave the indexing outside the benchmarking)

library(data.table)
DT= data.table(train)
system.time(setkey(DT,Store,DayOfWeek,Promo))
DTgroup =microbenchmark(
  "mdl4" =  DT[, mean(Sales), by="Store,DayOfWeek,Promo"], times = 20)
print(DTgroup, eval=F)
## Unit: milliseconds
##  expr      min       lq     mean   median       uq      max neval
##  mdl4 15.68308 15.69807 15.77953 15.70956 15.84216 16.13794    20

And a linear model is not competitive at all when it comes to computing interactions only:

train$Store=factor(train$Store)
train$DayOfWeek=factor(train$DayOfWeek)
train$Promo=factor(train$Promo)

system.time(lm(Sales ~ Store:DayOfWeek:Promo, data=train))
print("Error: cannot allocate vector of size 98.2 Gb")
## [1] "Error: cannot allocate vector of size 98.2 Gb"