Replace a subset of a data frame with dplyr join operations

What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).

Here's an example for your toy data:

library(data.table)
setDT(df)             # convert to data.table without copy
setDT(sub_df)         # convert to data.table without copy

# join and update "df" by reference, i.e. without copy 
df[sub_df, on = c("id", "animal"), weight := i.weight]

The data is now updated:

#   id animal weight
#1:  1    dog   23.0
#2:  2    cat    2.2
#3:  3   duck    1.2
#4:  4  fairy    0.2
#5:  5  snake    1.3

You can use setDF to switch back to ordinary data.frame.


Remove the na's first, then simply stack the tibbles:

 bind_rows(filter(df,!is.na(weight)),sub_df)

For anyone looking for a solution to use in a tidyverse pipeline:

I run into this problem a lot, and have written a short function that uses mostly tidyverse verbs to get around this. It will account for the case when there are additional columns in the original df.

For example, if the OP's df had an additional 'height' column:

library(dplyr)

df <- tibble(id = seq(1:5),
                 animal = c("dog", "cat", "duck", "fairy", "snake"),
                 weight = c("23", NA, "1.2", "0.2",  "BAD"),
                 height = c("54", "45", "21", "50", "42"))

And the subset of data we wanted to join in was the same:

sub_df <- tibble(id = c(2, 5),
                     animal = c("cat", "snake"),
                     weight = c("2.2", "1.3"))

If we used the OP's method alone (anti_join %>% bind_rows), this won't work because of the additional 'height' column in df. An extra step or two is needed.

In this case we could use the following function:

replace_subset <- function(df, df_subset, id_col_names = c()) {

  # work out which of the columns contain "new" data
  new_data_col_names <- colnames(df_subset)[which(!colnames(df_subset) %in% id_col_names)]

  # complete the df_subset with the extra columns from df
  df_sub_to_join <- df_subset %>%
    left_join(select(df, -new_data_col_names), by = c(id_col_names))

  # join and bind rows
  df_out <- df %>%
    anti_join(df_sub_to_join, by = c(id_col_names)) %>%
    bind_rows(df_sub_to_join)

  return(df_out)

}

Now for the results:

replace_subset(df = df , df_subset = sub_df, id_col_names = c("id"))

## A tibble: 5 x 4
#     id animal weight height
#  <dbl> <chr>  <chr>  <chr> 
#1     1 dog    23     54    
#2     3 duck   1.2    21    
#3     4 fairy  0.2    50    
#4     2 cat    2.2    45    
#5     5 snake  1.3    42  

And here's an example using the function in a pipeline:

df %>%
  replace_subset(df_subset = sub_df, id_col_names = c("id")) %>%
  mutate_at(.vars = vars(c('weight', 'height')), .funs = ~as.numeric(.)) %>%
  mutate(bmi = weight / (height^2))

## A tibble: 5 x 5
#     id animal weight height      bmi
#  <dbl> <chr>   <dbl>  <dbl>    <dbl>
#1     1 dog      23       54 0.00789 
#2     3 duck      1.2     21 0.00272 
#3     4 fairy     0.2     50 0.00008 
#4     2 cat       2.2     45 0.00109 
#5     5 snake     1.3     42 0.000737

hope this is helpful :)

Tags:

R

Dplyr