Grouping Duplicates






Both base R and data.table offer their duplicated() functions which are useful if your main goals is to identify and possibly delete duplicates.

For example

BirthDay <- as.Date(c(rep("1970-08-08",2), "1950-03-01", "1963-12-10",rep("1968-10-18",2)))
HomeTown = c(rep("New Brunswick,NJ",2),"Springfield,OH","Buffalo,NY",rep("Portland,OR",2))

df <-data.frame(BirthDay,HomeTown)

duplicated(df)
## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE
df[duplicated(df), ]
##     BirthDay         HomeTown
## 2 1970-08-08 New Brunswick,NJ
## 6 1968-10-18      Portland,OR
df[!duplicated(df), ]
##     BirthDay         HomeTown
## 1 1970-08-08 New Brunswick,NJ
## 3 1950-03-01   Springfield,OH
## 4 1963-12-10       Buffalo,NY
## 5 1968-10-18      Portland,OR

However, in my work most of the time this is not sufficient for what I would like to achieve. More often than not the duplicates of interest pertain only to a subset of the columns (“the key”), and I need to group those rows with duplicate keys;typically to investigate them further.

Instead of obtaining Booleans that indicate the non-specific notion “I gave seen this key/row before, so it is a duplicate but I will not tell you its first occurrence”, I would like a function that groups the rows/keys and so includes the “first row/key”. Here is an example of a data set with “noisy duplicates” in the name field, with the first 2 columns serving as an identifier:

SurName = c("Levine","Levin","Surflaw","Smith","Blandford","Jackson")

df <-data.frame(BirthDay,HomeTown,SurName)

Instead of

duplicated(df[,1:2])
## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE

I would like to return the groups of rows that are duplicates. This can be achieved in base R and with various packages, but I first want to propose an elegant solution offered by the ingenious data.table library:

library(data.table)
DT = data.table(df, key=c("BirthDay","HomeTown"))

(DTgrouped = DT[, list(list(.I)) ,by=key(DT)])
##      BirthDay         HomeTown  V1
## 1: 1950-03-01   Springfield,OH   1
## 2: 1963-12-10       Buffalo,NY   2
## 3: 1968-10-18      Portland,OR 3,4
## 4: 1970-08-08 New Brunswick,NJ 5,6
(DTrows = DTgrouped[sapply(DTgrouped[,V1],length)>1,V1])
## [[1]]
## [1] 3 4
## 
## [[2]]
## [1] 5 6

And if you wanted not just the row numbers but the actual data:

lapply(DTrows, function(i) return(df[i,]))
## [[1]]
##     BirthDay       HomeTown SurName
## 3 1950-03-01 Springfield,OH Surflaw
## 4 1963-12-10     Buffalo,NY   Smith
## 
## [[2]]
##     BirthDay    HomeTown   SurName
## 5 1968-10-18 Portland,OR Blandford
## 6 1968-10-18 Portland,OR   Jackson

base R

Here is the base R solution

tmp=by(df, list(df$BirthDay,df$HomeTown), function(x) if (nrow(x)>1) return(x))
tmp[!sapply(tmp,is.null)]
## [[1]]
##     BirthDay         HomeTown SurName
## 1 1970-08-08 New Brunswick,NJ  Levine
## 2 1970-08-08 New Brunswick,NJ   Levin
## 
## [[2]]
##     BirthDay    HomeTown   SurName
## 5 1968-10-18 Portland,OR Blandford
## 6 1968-10-18 Portland,OR   Jackson

plyr package

And here is the plyr way:

library(plyr)
tmp=dlply(df, .(BirthDay,HomeTown), function(x) if (nrow(x)>1) return(x))
tmp[!sapply(tmp,is.null)]
## $`1968-10-18.Portland,OR`
##     BirthDay    HomeTown   SurName
## 1 1968-10-18 Portland,OR Blandford
## 2 1968-10-18 Portland,OR   Jackson
## 
## $`1970-08-08.New Brunswick,NJ`
##     BirthDay         HomeTown SurName
## 1 1970-08-08 New Brunswick,NJ  Levine
## 2 1970-08-08 New Brunswick,NJ   Levin

Benchmarks will come in a separate post.



Aggregating data in R, benchmarks






Greatly varying processing times of aggregating data in R

While exploring the kaggle Rossmann competition (https://www.kaggle.com/c/rossmann-store-sales) with our class I stumbled upon this script: https://www.kaggle.com/shearerp/rossmann-store-sales/interactive-sales-visualization

It is a great introduction to the syntax that comes with the package dplyr.

The “model” proposed is simply the average sales conditioned on three variables: ‘Store’,‘DayOfWeek’,‘Promo’.

What I found remarkable is that the processing of this aggregation step is much faster than with the base function aggregate, which led me to benchmarking five different ways of conditional function computation:

  1. base R
    • aggregate
    • by
    • tapply
  2. dplyr package: group_by_ %>% summarise
  3. data.table package: [,by=]

We use the library microbenchmark to time the computation of average sales conditional on three factors in the Rossmann data set.

require(dplyr)
#read train and test data
train = read.csv('D:/kaggle/Rossmann/input/train.csv.gz',as.is=T)

#only keep those rows for which sales > 0 
train = train[train$Sales>0,]
dim(train)
## [1] 844338      9
#define the variables we want to condition on
preds=c('Store','DayOfWeek','Promo')

library(microbenchmark)

In summary, the differences are staggering:

Benchmark = c(aggregate=mean(AggFormula[,2]),  by=mean(BY[,2]),  
              tapply=mean(TA[,2]), group_by=mean(GrpByPipes[,2]), 
              data.table=mean(DTgroup[,2]))/10^6
round(Benchmark)
##  aggregate         by     tapply   group_by data.table 
##      13169        835        319        123         16
barplot(Benchmark, log="y", ylab = "CPU time [ms]", main = "Benchmarks")

The logarithmic plot hides the enormous speed differences: the data table grouping is 835 times faster than aggregate!

GrpByPipes=microbenchmark(
  "mdl" = train %>% group_by_(.dots=preds) %>% summarise(PredSales=mean(Sales))  %>% ungroup(), times = 20)
print(GrpByPipes)
AggFormula =microbenchmark(
  "mdl2" =  aggregate(Sales ~ Store + DayOfWeek+Promo, data=train,FUN=mean), times = 20)
print(AggFormula)

Does it help to bypass the formula overhead:?

AggNoFormula =microbenchmark(
  "mdl3" =  aggregate(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
print(AggNoFormula)

Hardly any difference.

Are the alternatives to aggregate such as by, tapply faster?

TA =microbenchmark(
  "mdl3b" =  tapply(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)

BY =microbenchmark(
  "mdl3c" =  by(train$Sales, list(train$Store, train$DayOfWeek, train$Promo),FUN=mean), times = 20)
print(TA, eval=F)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval
##  mdl3b 293.1418 305.0216 318.9019 310.9083 314.4047 415.5317    20

How does data.table compare to this? (slightly unfair to leave the indexing outside the benchmarking)

library(data.table)
DT= data.table(train)
system.time(setkey(DT,Store,DayOfWeek,Promo))
DTgroup =microbenchmark(
  "mdl4" =  DT[, mean(Sales), by="Store,DayOfWeek,Promo"], times = 20)
print(DTgroup, eval=F)
## Unit: milliseconds
##  expr      min       lq     mean   median       uq      max neval
##  mdl4 15.68308 15.69807 15.77953 15.70956 15.84216 16.13794    20

And a linear model is not competitive at all when it comes to computing interactions only:

train$Store=factor(train$Store)
train$DayOfWeek=factor(train$DayOfWeek)
train$Promo=factor(train$Promo)

system.time(lm(Sales ~ Store:DayOfWeek:Promo, data=train))
print("Error: cannot allocate vector of size 98.2 Gb")
## [1] "Error: cannot allocate vector of size 98.2 Gb"