Remove *all* duplicate rows, unless there's a "similar" row

An option would be to group by 'V1', get the index of group that has length of unique elements greater than 1 and then take the unique

unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1])
#   V1 V2
#1:  2  5
#2:  2  6
#3:  2  7

Or as @r2evans mentioned

unique(dt[, .SD[(uniqueN(V2) > 1)], by = "V1"])

NOTE: The OP's dataset is data.table and data.table methods are the natural way of doing it


If we need a tidyverse option, a comparable one to the above data.table option is

library(dplyr)
dt %>%
   group_by(V1) %>% 
   filter(n_distinct(V2) > 1) %>% 
   distinct()

Also one dplyr possibility:

dt %>%
 group_by(V1) %>%
 filter(n_distinct(V2) != 1 & !duplicated(V2))

     V1    V2
  <dbl> <dbl>
1     2     5
2     2     6
3     2     7

Or:

dt %>%
 group_by(V1) %>%
 filter(n_distinct(V2) != 1) %>%
 group_by(V1, V2) %>%
 slice(1)

In your case with base R

dt[ave(dt$V2,dt$V1,FUN=function(x) length(unique(x)))>1&!duplicated(dt)]
   V1 V2
1:  2  5
2:  2  6
3:  2  7

Tags:

R

Data.Table