weird behavior when merging one non-empty data.frame with an empty one

This is a complex one. The mis-step occurs in this line of base::merge:

y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone), 
            -by.y, drop = FALSE]

When you pass df2.b as the y argument to merge, this line actually produces an invalid data frame, as you can see in the browser:

Browse[2]> y
#>        V4
#> NA   NULL
#> NA.1 <NA>
#> NA.2 <NA>
#> NA.3 <NA>
#> Warning message:
#> In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x,  :
#>  corrupt data frame: columns will be truncated or padded with NAs

If we trace the logic through, we can see that we can reproduce the error outside the debugger by calling:

df2.b[c(1, 1, 1, 1), -c(1:2), drop = FALSE]
#>        V4
#> NA   NULL
#> NA.1 <NA>
#> NA.2 <NA>
#> NA.3 <NA>
#> Warning message:
#> In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x,  :
#>  corrupt data frame: columns will be truncated or padded with NAs

Whereas, we don't get this problem for db2.a:

df2.a[c(1, 1, 1, 1), -c(1:2), drop = FALSE]
#>      V3 V4
#> NA   NA NA
#> NA.1 NA NA
#> NA.2 NA NA
#> NA.3 NA NA

So why is this? Even though df2.a and df2.b look the same when you print the data frame, they are not the same. An empty numeric vector isn't quite the same as NULL. The main difference (the one that causes the problem here) is that indexing an empty numeric vector gives you a non-zero length of NA values, whereas NULL gives you a single NULL value.

df2.a$V1[1:4]
#> [1] NA NA NA NA

df2.b$V1[1:4]
#> NULL

So I guess this is expected behaviour. The problem is that R allows NULL as a dataframe column at all. I'm surprised this kind of thing doesn't happen more often.


I tracked the cause of this issue and found that this mistake arises in the following section of merge.data.frame:

y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone), 
            -by.y, drop = FALSE]

To show the problem, try the following code:

df2.b[rep(1, 4), -(1:2), drop = FALSE]
#        V4
# NA   NULL
# NA.1 <NA>
# NA.2 <NA>
# NA.3 <NA>
# Warning message:
# In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x,  :
#   corrupt data frame: columns will be truncated or padded with NAs

df2.a[rep(1, 4), -(1:2), drop = FALSE]
#    V3 V4
# 1: NA NA
# 2: NA NA
# 3: NA NA
# 4: NA NA

Therefore, this issue is caused by [.data.frame. A section of the source code of [.data.frame is:

for (j in seq_along(x)) {
        xj <- xx[[sxx[j]]]
        x[[j]] <- if (length(dim(xj)) != 2L){
            xj[i]
        }else{ xj[i, , drop = FALSE]}
    }

here, x is the resulting data.frame to be returned. It now has columns V3 and V4 only. xx is a copy of the input data.frame (df2.b in our case). This for-loop will first assign NULL to column 1 of x. Thus, V3 is deleted at this step. Next, the for-loop assigns NULL to the column 2 of x. However, as V3 is gone, there is no column 2. Therefore, x will not be affected. That's why we get the unexpected results.

If we set df1 and df2.b to data.table, merging of them will throw an error. It seems that data.table::merge treats such cases more strictly. The error message will help us avoid getting unexpected results.

Tags:

Merge

R

Dataframe