Filter top n largest groups in data.frame

We can use table to calculate frequency for each group, sort them in decreasing order, subset the top 2 entries and filter the respective groups.

library(dplyr)

example_data %>%
   filter(group %in% names(sort(table(group), decreasing = TRUE)[1:2]))


#   col1 col2 group
#1     1   16     2
#2     3   18     3
#3     4   19     2
#4     5   20     3
#5     7   22     3
#6     9   24     3
#7    11   26     2
#8    12   27     2
#9    13   28     2
#10   14   29     3
#11   15   30     3

Also you can directly use this in base R subset

subset(example_data, group %in% names(sort(table(group), decreasing = TRUE)[1:2]))

We can use tidyverse methods for this. Create a frequency column with add_count, arrange by that column and filter the rows where the 'group' is in the last two unique 'group' values

library(dplyr)
example_data %>% 
   add_count(group) %>% 
   arrange(n) %>%
   filter(group %in% tail(unique(group), 2)) %>%
   select(-n)
# A tibble: 11 x 3
#    col1  col2 group
#  <int> <int> <int>
# 1     1    16     2
# 2     4    19     2
# 3    11    26     2
# 4    12    27     2
# 5    13    28     2
# 6     3    18     3
# 7     5    20     3
# 8     7    22     3
# 9     9    24     3
#10    14    29     3
#11    15    30     3

Or using data.table

library(data.table)
setDT(example_data)[group %in% example_data[, .N, group][order(-N), head(group, 2)]]

With dplyr, you can also do:

example_data %>%
 add_count(group) %>%
 filter(dense_rank(desc(n)) <= 2) %>%
 select(-n)

   col1  col2 group
   <int> <int> <int>
 1     1    16     2
 2     3    18     3
 3     4    19     2
 4     5    20     3
 5     7    22     3
 6     9    24     3
 7    11    26     2
 8    12    27     2
 9    13    28     2
10    14    29     3
11    15    30     3

Or:

example_data %>%
 add_count(group) %>%
 slice(which(dense_rank(desc(n)) <= 2)) %>%
 select(-n)

Tags:

R