This document outlines my previous approach to non-standard evaluation (NSE). You should avoid it unless you are working with an older version of dplyr or tidyr.
There are three key ideas:
Instead of using substitute()
, use
lazyeval::lazy()
to capture both expression and
environment. (Or use lazyeval::lazy_dots(...)
to capture
promises in ...
)
Every function that uses NSE should have a standard evaluation
(SE) escape hatch that does the actual computation. The SE-function name
should end with _
.
The SE-function has a flexible input specification to make it easy for people to program with.
lazy()
The key tool that makes this approach possible is
lazy()
, an equivalent to substitute()
that
captures both expression and environment associated with a function
argument:
library(lazyeval)
f <- function(x = a - b) {
lazy(x)
}
f()
#> <lazy>
#> expr: a - b
#> env: <environment: 0x561651c4e348>
f(a + b)
#> <lazy>
#> expr: a + b
#> env: <environment: R_GlobalEnv>
As a complement to eval()
, the lazy package provides
lazy_eval()
that uses the environment associated with the
lazy object:
The second argument to lazy eval is a list or data frame where names should be looked up first:
lazy_eval()
also works with formulas, since they contain
the same information as a lazy object: an expression (only the RHS is
used by convention) and an environment:
Whenever we need a function that does non-standard evaluation, always
write the standard evaluation version first. For example, let’s
implement our own version of subset()
:
subset2_ <- function(df, condition) {
r <- lazy_eval(condition, df)
r <- r & !is.na(r)
df[r, , drop = FALSE]
}
subset2_(mtcars, lazy(mpg > 31))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
lazy_eval()
will always coerce it’s first argument into
a lazy object, so a variety of specifications will work:
subset2_(mtcars, ~mpg > 31)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
subset2_(mtcars, quote(mpg > 31))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
subset2_(mtcars, "mpg > 31")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Note that quoted called and strings don’t have environments
associated with them, so as.lazy()
defaults to using
baseenv()
. This will work if the expression is
self-contained (i.e. doesn’t contain any references to variables in the
local environment), and will otherwise fail quickly and robustly.
With the SE version in hand, writing the NSE version is easy. We just
use lazy()
to capture the unevaluated expression and
corresponding environment:
subset2 <- function(df, condition) {
subset2_(df, lazy(condition))
}
subset2(mtcars, mpg > 31)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
This standard evaluation escape hatch is very important because it allows us to implement different NSE approaches. For example, we could create a subsetting function that finds all rows where a variable is above a threshold:
above_threshold <- function(df, var, threshold) {
cond <- interp(~ var > x, var = lazy(var), x = threshold)
subset2_(df, cond)
}
above_threshold(mtcars, mpg, 31)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Here we’re using interp()
to modify a formula. We use
the value of threshold
and the expression in by
var
.
Because lazy()
captures the environment associated with
the function argument, we automatically avoid a subtle scoping bug
present in subset()
:
x <- 31
f1 <- function(...) {
x <- 30
subset(mtcars, ...)
}
# Uses 30 instead of 31
f1(mpg > x)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
f2 <- function(...) {
x <- 30
subset2(mtcars, ...)
}
# Correctly uses 31
f2(mpg > x)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
lazy()
has another advantage over
substitute()
- by default, it follows promises across
function invocations. This simplifies the casual use of NSE.
x <- 31
g1 <- function(comp) {
x <- 30
subset(mtcars, comp)
}
g1(mpg > x)
#> Error: object 'mpg' not found
g2 <- function(comp) {
x <- 30
subset2(mtcars, comp)
}
g2(mpg > x)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Note that g2()
doesn’t have a standard-evaluation escape
hatch, so it’s not suitable for programming with in the same way that
subset2_()
is.
Take the following example:
library(lazyeval)
f1 <- function(x) lazy(x)
g1 <- function(y) f1(y)
g1(a + b)
#> <lazy>
#> expr: a + b
#> env: <environment: R_GlobalEnv>
lazy()
returns a + b
because it always
tries to find the top-level promise.
In this case the process looks like this:
x
is bound to.y
, a
symbol) and the environment in which it should be evaluated (the
environment of g()
).x
is bound to a symbol, look up its value: it’s
bound to a promise.a + b
and should be
evaluated in the global environment.Occasionally, you want to avoid this recursive behaviour, so you can
use follow_symbol = FALSE
:
f2 <- function(x) lazy(x, .follow_symbols = FALSE)
g2 <- function(y) f2(y)
g2(a + b)
#> <lazy>
#> expr: x
#> env: <environment: 0x561650aa5a70>
Either way, if you evaluate the lazy expression you’ll get the same result:
Note that the resolution of chained promises only works with
unevaluated objects. This is because R deletes the information about the
environment associated with a promise when it has been forced, so that
the garbage collector is allowed to remove the environment from memory
in case it is no longer used. lazy()
will fail with an
error in such situations.