Replace a sequence of values by group depending on preceeding values

Here's another data.table approach:

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(id, x=x-1, values="a")], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]
  • create a new column "x" with the run length ids per value grouped by id
  • join on itself while modifying the run length ids (x) to be the preceeding value and values to be "a" (the specific value you want to change), then update values with "b"
  • delete column x afterwards

The result is:

dt
#     id values
#  1:  1      a
#  2:  1      c
#  3:  1      b
#  4:  1      b
#  5:  1      a
#  6:  2      c
#  7:  2      c
#  8:  2      b
#  9:  2      b
# 10:  2      c
# 11:  3      c
# 12:  3      b
# 13:  3      b
# 14:  3      b
# 15:  3      b

And here's a generalization to the case where you want to replace values "a", "x", or "y" followed by "b" with "b":

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(values=c("a", "x", "y")), by = .(id, x=x-1)], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]

Late to the party and several nice run length alternatives were already provided ;) So here I try nafill instead.

(1) Create a variable 'v2' which is NA when 'values' are "a". (2) Fill missing values by next observation carried backward. (3) When the original 'values' are "a" and the corresponding filled values in 'v2' are "b", update 'v' with 'v2'.

# 1
dt[values != "a" , v2 := values]

# 2
d1[, v2 := v2[nafill(replace(seq_len(.N), is.na(v2), NA), type = "nocb")], by = id]

# 3
dt[values == "a" & v2 == "b", values := v2]

# clean-up
dt[ , v2 := NULL]

Currently, nafill only works with numeric variables, hence replace step in chunk # 2 (modified from @chinsoon12 in the issue nafill, setnafill for character, factor and other types).

The NA replacement code may be slightly shortened by using zoo::nalocf:

dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = id]

However, note that na.locf is slower.


When comparing the answers on larger data (data.table(id = rep(1:1e4, each = 1e4, replace = TRUE), values = sample(c("a", "b", "c"), 1e8, replace = TRUE)), it turns out that this alternative actually is faster than the others.