Non-standard evaluation of subset argument with mapply in R

@AllanCameron alluded to the possibility of a purrr::map solution. Here are a few options:

  1. Since we know we want to subset by the breaks column, we only need to provide the cutoff values and therefore don't have to worry about delaying evaluation of an expression. Here and in the other examples, we name the elements of the breaks list so that the output will also have names telling us what breaks cutoff value was used. Also, in all examples we take advantage of the dplyr::filter function to filter the data in the data argument, rather than the subset argument:
library(tidyverse)

map2(list(breaks.lt.15=15,
          breaks.lt.20=20),
     list(~ wool,
          ~ wool + tension),
     ~ xtabs(.y, data=filter(warpbreaks, breaks < .x))
)
#> $breaks.lt.15
#> wool
#> A B 
#> 2 2 
#> 
#> $breaks.lt.20
#>     tension
#> wool L M H
#>    A 0 4 3
#>    B 2 2 5
  1. Similar to above, but we supply the whole filter expression and wrap the filter expressions in quos to delay evaluation. !!.x evaluates the expressions at the point where we filter the warpbreaks data frame inside xtabs.
map2(quos(breaks.lt.15=breaks < 15,
          breaks.lt.20=breaks < 20),
     list(~ wool,
          ~ wool + tension),
     ~ xtabs(.y, data=filter(warpbreaks, !!.x))
)
#> $breaks.lt.15
#> wool
#> A B 
#> 2 2 
#> 
#> $breaks.lt.20
#>     tension
#> wool L M H
#>    A 0 4 3
#>    B 2 2 5
  1. In case you want all combinations of filters and xtabs formulas, you can use the crossing function to generate the combinations and then pass that to pmap ("parallel map"), which can take any number of arguments, all contained in a single list. In this case we use rlang::exprs instead of quos to delay evaluation. rlang::exprs would also have worked above, but quos doesn't work here. I'm not sure I really understand why, but it has to do with capturing both the expression and its environment (quos) vs. capturing just the expression (exprs), as discussed here.
# map over all four combinations of breaks and xtabs formulas
crossing(
  rlang::exprs(breaks.lt.15=breaks < 15,
               breaks.lt.20=breaks < 20),
  list(~ wool,
       ~ wool + tension)
) %>% 
  pmap(~ xtabs(.y, data=filter(warpbreaks, !!.x)))
#> $breaks.lt.15
#> wool
#> A B 
#> 2 2 
#> 
#> $breaks.lt.15
#>     tension
#> wool L M H
#>    A 0 1 1
#>    B 1 0 1
#> 
#> $breaks.lt.20
#> wool
#> A B 
#> 7 9 
#> 
#> $breaks.lt.20
#>     tension
#> wool L M H
#>    A 0 4 3
#>    B 2 2 5

You could also go with tidyverse functions for the summary instead of xtabs and return a data frame. For example:

map2_df(c(15,20),
        list("wool",
             c("wool", "tension")),
        ~ warpbreaks %>% 
            filter(breaks < .x) %>% 
            group_by_at(.y) %>% 
            tally() %>% 
            bind_cols(max.breaks=.x - 1)
) %>% 
  mutate_if(is.factor, ~replace_na(fct_expand(., "All"), "All")) %>% 
  select(is.factor, everything())   # Using select this way requires development version of dplyr, soon to be released on CRAN as version 1.0.0
#> # A tibble: 7 x 4
#>   wool  tension     n max.breaks
#>   <fct> <fct>   <int>      <dbl>
#> 1 A     All         2         14
#> 2 B     All         2         14
#> 3 A     M           4         19
#> 4 A     H           3         19
#> 5 B     L           2         19
#> 6 B     M           2         19
#> 7 B     H           5         19

If you wanted to include marginal counts, you could do:

crossing(
  c(Inf,15,20),
  list(NULL, "wool", "tension", c("wool", "tension"))
) %>% 
  pmap_df(
    ~ warpbreaks %>% 
        filter(breaks < .x) %>% 
        group_by_at(.y) %>% 
        tally() %>% 
        bind_cols(max.breaks=.x - 1)
  ) %>% 
  mutate_if(is.factor, ~replace_na(fct_expand(., "All"), "All")) %>% 
  select(is.factor, everything()) 

#>    wool tension  n max.breaks
#> 1   All     All  4         14
#> 2     A     All  2         14
#> 3     B     All  2         14
#> 4   All       L  1         14
#> 5   All       M  1         14
#> 6   All       H  2         14
#> 7     A       M  1         14
#> 8     A       H  1         14
#> 9     B       L  1         14
#> 10    B       H  1         14
#> 11  All     All 16         19
#> 12    A     All  7         19
#> 13    B     All  9         19
#> 14  All       L  2         19
#> 15  All       M  6         19
#> 16  All       H  8         19
#> 17    A       M  4         19
#> 18    A       H  3         19
#> 19    B       L  2         19
#> 20    B       M  2         19
#> 21    B       H  5         19
#> 22  All     All 54        Inf
#> 23    A     All 27        Inf
#> 24    B     All 27        Inf
#> 25  All       L 18        Inf
#> 26  All       M 18        Inf
#> 27  All       H 18        Inf
#> 28    A       L  9        Inf
#> 29    A       M  9        Inf
#> 30    A       H  9        Inf
#> 31    B       L  9        Inf
#> 32    B       M  9        Inf
#> 33    B       H  9        Inf

And if we add a pivot_wider onto the end of the previous chain, we can get:

pivot_wider(names_from=max.breaks, values_from=n, 
            names_prefix="breaks<=", values_fill=list(n=0))
   wool  tension `breaks<=14` `breaks<=19` `breaks<=Inf`
 1 All   All                4           16            54
 2 A     All                2            7            27
 3 B     All                2            9            27
 4 All   L                  1            2            18
 5 All   M                  1            6            18
 6 All   H                  2            8            18
 7 A     M                  1            4             9
 8 A     H                  1            3             9
 9 B     L                  1            2             9
10 B     H                  1            5             9
11 B     M                  0            2             9
12 A     L                  0            0             9

It's an issue of NSE. One way is to inject the subset conditions in the call directly so they can be applied in the relevant context (the data, where breaks exist).

It can be done by using alist() instead of list(), to have a list of quoted expressions, then build the correct call, (using bquote() is the easiest way) and evaluate it.

mapply(
  FUN = function(formula, data, subset) 
    eval(bquote(xtabs(formula, data, .(subset)))),
  formula = list(~ wool,
                 ~ wool + tension),
  subset = alist(breaks < 15,
                 breaks < 20),
  MoreArgs = list(data = warpbreaks))
#> [[1]]
#> wool
#> A B 
#> 2 2 
#> 
#> [[2]]
#>     tension
#> wool L M H
#>    A 0 4 3
#>    B 2 2 5

mapply(FUN = function(formula, data, FUN, subset)
  eval(bquote(aggregate(formula, data, FUN, subset = .(subset)))),
  formula = list(breaks ~ wool,
                 breaks ~ wool + tension),
  subset = alist(breaks < 15,
                 breaks < 20),
  MoreArgs = list(data = warpbreaks,
                  FUN = length))
#> [[1]]
#>   wool breaks
#> 1    A      2
#> 2    B      2
#> 
#> [[2]]
#>   wool tension breaks
#> 1    B       L      2
#> 2    A       M      4
#> 3    B       M      2
#> 4    A       H      3
#> 5    B       H      5

You don't really need MoreArgs anymore as you can use the arguments directly in the call, so you might want to simplify it as follows :

mapply(
  FUN = function(formula, subset) 
    eval(bquote(xtabs(formula, warpbreaks, subset = .(subset)))),
  formula = list(~ wool,
                 ~ wool + tension),
  subset = alist(breaks < 15,
                 breaks < 20))
#> [[1]]
#> wool
#> A B 
#> 2 2 
#> 
#> [[2]]
#>     tension
#> wool L M H
#>    A 0 4 3
#>    B 2 2 5

mapply(FUN = function(formula, subset)
  eval(bquote(aggregate(formula, warpbreaks, length, subset = .(subset)))),
  formula = list(breaks ~ wool,
                 breaks ~ wool + tension),
  subset = alist(breaks < 15,
                 breaks < 20))
#> [[1]]
#>   wool breaks
#> 1    A      2
#> 2    B      2
#> 
#> [[2]]
#>   wool tension breaks
#> 1    B       L      2
#> 2    A       M      4
#> 3    B       M      2
#> 4    A       H      3
#> 5    B       H      5

You could also avoid the call manipulation and the adhoc FUN argument by building datasets to loop on using lapply :

mapply(
  FUN =  xtabs,
  formula = list(~ wool,
                 ~ wool + tension),
  data =  lapply(c(15, 20), function(x) subset(warpbreaks, breaks < x)))
#> [[1]]
#> wool
#> A B 
#> 2 2 
#> 
#> [[2]]
#>     tension
#> wool L M H
#>    A 0 4 3
#>    B 2 2 5

mapply(
  FUN =  aggregate,
  formula = list(breaks ~ wool,
                 breaks ~ wool + tension),
  data =  lapply(c(15, 20), function(x) subset(warpbreaks, breaks < x)),
  MoreArgs = list(FUN = length))
#> [[1]]
#>   wool breaks
#> 1    A      2
#> 2    B      2
#> 
#> [[2]]
#>   wool tension breaks
#> 1    B       L      2
#> 2    A       M      4
#> 3    B       M      2
#> 4    A       H      3
#> 5    B       H      5

The short answer is that when you create a list to pass as an argument to a function, it is evaluated at the point of creation. The error you are getting is because R tries to create the list you want to pass in the calling environment.

To see this more clearly, suppose you try creating the arguments you want to pass ahead of calling mapply:

f_list <- list(~ wool, ~ wool + tension)
d_list <- list(data = warpbreaks)
mapply(FUN = xtabs, formula = f_list, MoreArgs = d_list)
#> [[1]]
#> wool
#>  A  B 
#> 27 27 
#> 
#> [[2]]
#>     tension
#> wool L M H
#>    A 9 9 9
#>    B 9 9 9

There is no problem with creating a list of formulas, because these are not evaluated until needed, and of course warpbreaks is accessible from the global environment, hence this call to mapply works.

Of course, if you try to create the following list ahead of the mapply call:

subset_list <- list(breaks < 15, breaks < 20)

Then R will tell you that breaks isn't found.

However, if you create the list with warpbreaks in the search path, then you won't have a problem:

subset_list <- with(warpbreaks, list(breaks < 15, breaks < 20))
subset_list
#> [[1]]
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [14]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#> [27] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [40] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#> [53] FALSE FALSE
#> 
#> [[2]]
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
#> [14]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
#> [27] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
#> [40]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
#> [53]  TRUE FALSE

so you would think that we could just pass this to mapply and everything would be fine, but now we get a new error:

mapply(FUN = xtabs, formula = f_list, subset = subset_list, MoreArgs = d_list)
#> Error in eval(substitute(subset), data, env) : object 'dots' not found

So why are we getting this?

The problem lies in any functions passed to mapply that call eval, or that themselves call a function that uses eval.

If you look at the source code for mapply you will see that it takes the extra arguments you have passed and puts them in a list called dots, which it will then pass to an internal mapply call:

mapply
#> function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) 
#> {
#>     FUN <- match.fun(FUN)
#>     dots <- list(...)
#>     answer <- .Internal(mapply(FUN, dots, MoreArgs))
#> ...

If your FUN itself calls another function that calls eval on any of its arguments, it will therefore try to eval the object dots, which won't exist in the environment in which the eval is called. This is easy to see by doing an mapply on a match.call wrapper:

mapply(function(x) match.call(), x = list(1))
[[1]]
(function(x) match.call())(x = dots[[1L]][[1L]])

So a minimal reproducible example of our error is

mapply(function(x) eval(substitute(x)), x = list(1))
#> Error in eval(substitute(x)) : object 'dots' not found

So what's the solution? It seems like you have already hit on a perfectly good one, that is, manually subsetting the data frame you wish to pass. Others may suggest that you explore purrr::map to get a more elegant solution.

However, it is possible to get mapply to do what you want, and the secret is just to modify FUN to turn it into an anonymous wrapper of xtabs that subsets on the fly:

mapply(FUN = function(formula, subset, data) xtabs(formula, data[subset,]), 
       formula = list(~ wool, ~ wool + tension),
       subset = with(warpbreaks, list(breaks < 15, breaks < 20)),
       MoreArgs = list(data = warpbreaks))
#> [[1]]
#> wool
#> A B 
#> 2 2 
#> 
#> [[2]]
#>     tension
#> wool L M H
#>    A 0 4 3
#>    B 2 2 5