R data.table - only keep rows with duplicate ID (most efficient solution)

We can use .I to get the index of groups with frequency count greater than 1, extract the column and subset the data.table

dt[dt[, .I[.N >1], .(x, y)]$V1]

NOTE: It should be faster than .SD

Here is another option:

dt[dt[rowid(x, y) > 1], on=.(x, y), .SD]

In the example, your explanation for returning 0 row is correct. As grouping columns are used for grouping, it will be identical for each group and can be accessed via .BY and hence .SD need not have these columns to prevent duplication.

By default, when by is used, these are also returned as the leftmost columns in the output, hence in get_duplicate_id_rows2, you see x, y and then columns from .SD as specified in .SDcols.

Lastly, regarding efficiency, you can time the various options posted here using microbenchmark with your actual dataset and share your results.

R data.table - only keep rows with duplicate ID (most efficient solution)

Tags:

R

Data.Table

Related

Recent Posts