Both base R and data.table offer their duplicated() functions which are useful if your main goals is to identify and possibly delete duplicates.
For example
BirthDay <- as.Date(c(rep("1970-08-08",2), "1950-03-01", "1963-12-10",rep("1968-10-18",2)))
HomeTown = c(rep("New Brunswick,NJ",2),"Springfield,OH","Buffalo,NY",rep("Portland,OR",2))
df <-data.frame(BirthDay,HomeTown)
duplicated(df)
## [1] FALSE TRUE FALSE FALSE FALSE TRUE
df[duplicated(df), ]
## BirthDay HomeTown
## 2 1970-08-08 New Brunswick,NJ
## 6 1968-10-18 Portland,OR
df[!duplicated(df), ]
## BirthDay HomeTown
## 1 1970-08-08 New Brunswick,NJ
## 3 1950-03-01 Springfield,OH
## 4 1963-12-10 Buffalo,NY
## 5 1968-10-18 Portland,OR
However, in my work most of the time this is not sufficient for what I would like to achieve. More often than not the duplicates of interest pertain only to a subset of the columns (“the key”), and I need to group those rows with duplicate keys;typically to investigate them further.
Instead of obtaining Booleans that indicate the non-specific notion “I gave seen this key/row before, so it is a duplicate but I will not tell you its first occurrence”, I would like a function that groups the rows/keys and so includes the “first row/key”. Here is an example of a data set with “noisy duplicates” in the name field, with the first 2 columns serving as an identifier:
SurName = c("Levine","Levin","Surflaw","Smith","Blandford","Jackson")
df <-data.frame(BirthDay,HomeTown,SurName)
Instead of
duplicated(df[,1:2])
## [1] FALSE TRUE FALSE FALSE FALSE TRUE
I would like to return the groups of rows that are duplicates. This can be achieved in base R and with various packages, but I first want to propose an elegant solution offered by the ingenious data.table library:
library(data.table)
DT = data.table(df, key=c("BirthDay","HomeTown"))
(DTgrouped = DT[, list(list(.I)) ,by=key(DT)])
## BirthDay HomeTown V1
## 1: 1950-03-01 Springfield,OH 1
## 2: 1963-12-10 Buffalo,NY 2
## 3: 1968-10-18 Portland,OR 3,4
## 4: 1970-08-08 New Brunswick,NJ 5,6
(DTrows = DTgrouped[sapply(DTgrouped[,V1],length)>1,V1])
## [[1]]
## [1] 3 4
##
## [[2]]
## [1] 5 6
And if you wanted not just the row numbers but the actual data:
lapply(DTrows, function(i) return(df[i,]))
## [[1]]
## BirthDay HomeTown SurName
## 3 1950-03-01 Springfield,OH Surflaw
## 4 1963-12-10 Buffalo,NY Smith
##
## [[2]]
## BirthDay HomeTown SurName
## 5 1968-10-18 Portland,OR Blandford
## 6 1968-10-18 Portland,OR Jackson
base R
Here is the base R solution
tmp=by(df, list(df$BirthDay,df$HomeTown), function(x) if (nrow(x)>1) return(x))
tmp[!sapply(tmp,is.null)]
## [[1]]
## BirthDay HomeTown SurName
## 1 1970-08-08 New Brunswick,NJ Levine
## 2 1970-08-08 New Brunswick,NJ Levin
##
## [[2]]
## BirthDay HomeTown SurName
## 5 1968-10-18 Portland,OR Blandford
## 6 1968-10-18 Portland,OR Jackson
plyr package
And here is the plyr way:
library(plyr)
tmp=dlply(df, .(BirthDay,HomeTown), function(x) if (nrow(x)>1) return(x))
tmp[!sapply(tmp,is.null)]
## $`1968-10-18.Portland,OR`
## BirthDay HomeTown SurName
## 1 1968-10-18 Portland,OR Blandford
## 2 1968-10-18 Portland,OR Jackson
##
## $`1970-08-08.New Brunswick,NJ`
## BirthDay HomeTown SurName
## 1 1970-08-08 New Brunswick,NJ Levine
## 2 1970-08-08 New Brunswick,NJ Levin
Benchmarks will come in a separate post.
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });