Subset data frame based on number of rows per group

Package "inops" has some useful infix operators. For this particular case the operator %in#% can select elements based on how many times they occur.

library(inops)

df[df$name %in#% 1:3,]

Which returns:

  name x
1    a 1
2    a 2
3    a 3
4    b 4
5    b 5

Here df$name %in#% 1:3 returns TRUE only for elements that occur 1, 2, or 3 times. If instead we wanted to select elements that occur 4 times we would do:

df[df$name %in#% 4,]

With the following result:

  name x
6    c 6
7    c 7
8    c 8
9    c 9

Yet another way using the dpylr package is using the count function and then doing a semi join on the original data frame:

library(dplyr)

df %>% 
  count(name) %>%
  filter(n <= 3) %>%
  semi_join(df, ., by = "name")

Using the dplyr package:

df %>%
  group_by(name) %>%
  filter(n() < 4)

# A tibble: 5 x 2
# Groups:   name [2]
  name      x
  <fct> <int>
1 a         1
2 a         2
3 a         3
4 b         4
5 b         5

n() returns the number of observations in the current group, so we can group_by name, and then keep only those rows which are part of a group where the number of rows in that group is less than 4.


First, two base alternatives. One relies on table, and the other on ave and length. Then, two data.table ways.


1. table

tt <- table(df$name)

df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

If you want to walk it through step by step:

# count each 'name', assign result to an object 'tt'
tt <- table(df$name)

# which 'name' in 'tt' occur more than three times?
# Result is a logical vector that can be used to subset the table 'tt'
tt < 3

# from the table, select 'name' that occur < 3 times
tt[tt < 3]

# ...their names
names(tt[tt < 3])

# rows of 'name' in the data frame that matches "the < 3 names"
# the result is a logical vector that can be used to subset the data frame 'df'
df$name %in% names(tt[tt < 3])

# subset data frame by a logical vector
# 'TRUE' rows are kept, 'FALSE' rows are removed.
# assign the result to a data frame with a new name
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

2. ave and length

As suggested by @flodel:

df[ave(df$x, df$name, FUN = length) < 3, ]

3. data.table: .N and .SD:

library(data.table)
setDT(df)[, if (.N < 3) .SD, by = name]

4. data.table: .N and .I:

setDT(df)
df[df[, .I[.N < 3], name]$V1] 

See also the related Q&A Count number of observations/rows per group and add result to data frame.