Grouping Duplicates - Code and Stats

Both base R and data.table offer their duplicated() functions which are useful if your main goals is to identify and possibly delete duplicates.

For example

BirthDay <- as.Date(c(rep("1970-08-08",2), "1950-03-01", "1963-12-10",rep("1968-10-18",2)))
HomeTown = c(rep("New Brunswick,NJ",2),"Springfield,OH","Buffalo,NY",rep("Portland,OR",2))

df <-data.frame(BirthDay,HomeTown)

duplicated(df)

## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE

df[duplicated(df), ]

##     BirthDay         HomeTown
## 2 1970-08-08 New Brunswick,NJ
## 6 1968-10-18      Portland,OR

df[!duplicated(df), ]

##     BirthDay         HomeTown
## 1 1970-08-08 New Brunswick,NJ
## 3 1950-03-01   Springfield,OH
## 4 1963-12-10       Buffalo,NY
## 5 1968-10-18      Portland,OR

However, in my work most of the time this is not sufficient for what I would like to achieve. More often than not the duplicates of interest pertain only to a subset of the columns (“the key”), and I need to group those rows with duplicate keys;typically to investigate them further.

Instead of obtaining Booleans that indicate the non-specific notion “I gave seen this key/row before, so it is a duplicate but I will not tell you its first occurrence”, I would like a function that groups the rows/keys and so includes the “first row/key”. Here is an example of a data set with “noisy duplicates” in the name field, with the first 2 columns serving as an identifier:

SurName = c("Levine","Levin","Surflaw","Smith","Blandford","Jackson")

df <-data.frame(BirthDay,HomeTown,SurName)

Instead of

duplicated(df[,1:2])

## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE

I would like to return the groups of rows that are duplicates. This can be achieved in base R and with various packages, but I first want to propose an elegant solution offered by the ingenious data.table library:

library(data.table)
DT = data.table(df, key=c("BirthDay","HomeTown"))

(DTgrouped = DT[, list(list(.I)) ,by=key(DT)])

##      BirthDay         HomeTown  V1
## 1: 1950-03-01   Springfield,OH   1
## 2: 1963-12-10       Buffalo,NY   2
## 3: 1968-10-18      Portland,OR 3,4
## 4: 1970-08-08 New Brunswick,NJ 5,6

(DTrows = DTgrouped[sapply(DTgrouped[,V1],length)>1,V1])

## [[1]]
## [1] 3 4
## 
## [[2]]
## [1] 5 6

And if you wanted not just the row numbers but the actual data:

lapply(DTrows, function(i) return(df[i,]))

## [[1]]
##     BirthDay       HomeTown SurName
## 3 1950-03-01 Springfield,OH Surflaw
## 4 1963-12-10     Buffalo,NY   Smith
## 
## [[2]]
##     BirthDay    HomeTown   SurName
## 5 1968-10-18 Portland,OR Blandford
## 6 1968-10-18 Portland,OR   Jackson

base R

Here is the base R solution

tmp=by(df, list(df$BirthDay,df$HomeTown), function(x) if (nrow(x)>1) return(x))
tmp[!sapply(tmp,is.null)]

## [[1]]
##     BirthDay         HomeTown SurName
## 1 1970-08-08 New Brunswick,NJ  Levine
## 2 1970-08-08 New Brunswick,NJ   Levin
## 
## [[2]]
##     BirthDay    HomeTown   SurName
## 5 1968-10-18 Portland,OR Blandford
## 6 1968-10-18 Portland,OR   Jackson

plyr package

And here is the plyr way:

library(plyr)
tmp=dlply(df, .(BirthDay,HomeTown), function(x) if (nrow(x)>1) return(x))
tmp[!sapply(tmp,is.null)]

## $`1968-10-18.Portland,OR`
##     BirthDay    HomeTown   SurName
## 1 1968-10-18 Portland,OR Blandford
## 2 1968-10-18 Portland,OR   Jackson
## 
## $`1970-08-08.New Brunswick,NJ`
##     BirthDay         HomeTown SurName
## 1 1970-08-08 New Brunswick,NJ  Levine
## 2 1970-08-08 New Brunswick,NJ   Levin

Benchmarks will come in a separate post.

base R

plyr package

Leave a Reply Cancel reply