There are many different ways that magrittr could implement the pipe. The goal of this document is to elucidate the variations, and the various pros and cons of each approach. This document is primarily aimed at the magrittr developers (so we don’t forget about important considerations), but will be of interest to anyone who wants to understand pipes better, or to create their own pipe that makes different tradeoffs
There are three main options for how we might transform a pipeline in
base R expressions. Here they are illustrated with
x %>% foo() %>% bar()
:
Nested
Eager (mask), masking environment
This is essentially how %>%
has been implemented
prior to magrittr 2.0:
Eager (mask-num): masking environment, numbered placeholder
Eager (lexical): lexical environment
This variant assigns pipe expressions to the placeholder
.
in the current environment. This assignment is temporary:
once the pipe has returned, the placeholder binding is reset to its
previous state.
Lazy (mask): masking environments
Lazy (mask-num): masking environment, numbered placeholder
Lazy (lexical-num): lexical environment, numbered placeholder
We’ll first explore the desired properties we might want a pipe to possess and then see how each of the three variants does.
These are the properties that we might want a pipe to possess, roughly ordered from most important to least important.
Visibility: the visibility of the final function in the pipe should be preserved. This is important so that pipes that end in a side-effect function (which generally returns its first argument invisibly) do not print.
Multiple placeholders: each component of the
pipe should only be evaluated once even when there are multiple
placeholders, so that sample(10) %>% cbind(., .)
yields
two columns with the same value. Relatedly,
sample(10) %T>% print() %T>% print()
must print the
same values twice.
Lazy evaluation: steps of the pipe should only
be evaluated when actually needed. This is a useful property as it means
that pipes can handle code like stop("!") %>% try()
,
making pipes capable of capturing a wider range of R expressions.
On the other hand, it might have surprising effects. For instance if a function that suppresses warnings is added to the end of a pipeline, the suppression takes effect for the whole pipeline.
Persistence of piped values: arguments are not necessarily evaluated right away by the piped function. Sometimes they are evaluated long after the pipeline has returned, for example when a function factory is piped. With persistent piped values, the constructed function can be called at any time:
Refcount neutrality: the return value of the pipeline should have a reference count of 1 so it can be mutated in place in further manipulations.
Eager unbinding: pipes are often used with large data objects, so intermediate objects in the pipeline should be unbound as soon as possible so they are available for garbage collection.
Progressive stack: using the pipe should add as
few entries to the call stack as possible, so that
traceback()
is maximally useful.
Lexical side effects: side effects should occur
in the current lexical environment. This way,
NA %>% { foo <- . }
assigns the piped value in the
current environment and NA %>% { return(.) }
returns
from the function that contains the pipeline.
Continuous stack: the pipe should not affect the chain of parent frames. This is important for tree representations of the call stack.
It is possible to have proper visibility and a neutral impact on refcounts with all implementations by being a bit careful, so we’ll only consider the other properties:
Nested | Eager (mask) |
Eager (mask-num) |
Eager (lexical) |
Lazy (mask) |
Lazy (mask-num) |
Lazy (lexical-num) |
|
---|---|---|---|---|---|---|---|
Multiple placeholders | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Lazy evaluation | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
Persistence | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ |
Eager unbinding | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
Progressive stack | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
Lexical effects | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ |
Continuous stack | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ |
Some properties are a direct reflection of the high level design decisions.
The nested pipe does not assign piped expressions to a placeholder. All the other variants perform this assignment. This means that with a nested rewrite approach, it isn’t possible to have multiple placeholders unless the piped expression is pasted multiple times. This would cause multiple evaluations with deleterious effects:
Assigning to the placeholder within an argument would preserve the nestedness and lazyness. However that wouldn’t work properly because there’s no guarantee that the first argument will be evaluated before the second argument.
For these reasons, the nested pipe does not support multiple placeholders. By contrast, all the other variants assign the result of pipe expressions to the placeholder. There are variations in how the placeholder binding is created (lazily or eagerly, in a mask or in the current environment, with numbered symbols or with a unique symbol) but all these variants allow multiple placeholders.
Because the local variants of the pipe evaluate in a mask, they do not have lexical effects or a continuous stack. This is unlike the lexical variants that evaluate in the current environment.
Unlike the lazy variants, all eager versions implementing the pipe with iterated evaluation do not pass the lazy evaluation criterion.
Secondly, no lazy variant passes the progressive stack criterion. By construction, lazy evaluation requires pushing all the pipe expressions on the stack before evaluation starts. Conversely, all eager variants have a progressive stack.
None of the variants that use numbered placeholders can unbind piped values eagerly. This is how they achieve persistence of these bindings.
The GNU R team is considering implementing the nested approach in
base R with a parse-time code transformation (just like
->
is transformed to <-
by the
parser).
We have implemented three approaches in magrittr:
These approaches have complementary strengths and weaknesses.
The nested pipe does not bind expressions to a placeholder and so can’t support multiple placeholders.
Because it relies on the usual rules of argument application, the nested pipe is lazy.
The pipe expressions are binded as promises within the execution environment of each function. This environment persists as long as a promise holds onto it. Evaluating the promise discards the reference to the environment which becomes available for garbage collection.
For instance, here is a function factory that creates a function. The constructed function returns the value supplied at the time of creation:
This does not cause any issue with the nested pipe:
Because the piped expressions are lazily evaluated, the whole pipeline is pushed on the stack before execution starts. This results in a more complex backtrace than necessary:
faulty <- function() stop("tilt")
f <- function(x) x + 1
g <- function(x) x + 2
h <- function(x) x + 3
faulty() %|>% f() %|>% g() %|>% h()
#> Error in faulty() : tilt
traceback()
#> 7: stop("tilt")
#> 6: faulty()
#> 5: f(faulty())
#> 4: g(f(faulty()))
#> 3: h(g(f(faulty())))
#> 2: .External2(magrittr_pipe) at pipe.R#181
#> 1: faulty() %|>% f() %|>% g() %|>% h()
Also note how the expressions in the backtrace look different from the actual code. This is because of the nested rewrite of the pipeline.
This is a benefit of using the normal R rules of evaluation. Side effects occur in the correct environment:
Control flow has the correct behaviour:
Because evaluation occurs in the current environment, the stack is continuous. Let’s instrument errors with a structured backtrace to see what that means:
The tree representation of the backtrace correctly represents the hierarchy of execution frames:
foobar <- function(x) x %|>% quux()
quux <- function(x) x %|>% stop()
"tilt" %|>% foobar()
#> Error in x %|>% stop() : tilt
rlang::last_trace()
#> <error/rlang_error>
#> tilt
#> Backtrace:
#> █
#> 1. ├─"tilt" %|>% foobar()
#> 2. └─global::foobar("tilt")
#> 3. ├─x %|>% quux()
#> 4. └─global::quux(x)
#> 5. └─x %|>% stop()
Pipe expressions are eagerly assigned to the placeholder. This makes it possible to use the placeholder multiple times without causing multiple evaluations.
Assignment forces eager evaluation of each step.
Because we’re updating the value of .
at each step, the
piped expressions are not persistent. This has subtle effects when the
piped expressions are not evaluated right away.
With the eager pipe we get rather confusing results with the factory
function if we try to call the constructed function in the middle of the
pipeline. In the following snippet the placeholder .
is
binded to the constructed function itself rather than the initial value
TRUE
, by the time the function is called:
fn <- TRUE %!>% factory() %!>% { .() }
fn()
#> function ()
#> x
#> <bytecode: 0x556093f60da8>
#> <environment: 0x556093e61210>
Also, since we’re binding .
in the current environment,
we need to clean it up once the pipeline has returned. At that point,
the placeholder no longer exists:
Or it has been reset to its previous value, if any:
This is the flip side of updating the value of the placeholder at each step. The previous intermediary values can be collected right away.
Since pipe expressions are evaluated one by one as they come, only the relevant part of the pipeline is on the stack when an error is thrown:
faulty <- function() stop("tilt")
f <- function(x) x + 1
g <- function(x) x + 2
h <- function(x) x + 3
faulty() %!>% f() %!>% g() %!>% h()
#> Error in faulty() : tilt
traceback()
#> 4: stop("tilt")
#> 3: faulty()
#> 2: .External2(magrittr_pipe) at pipe.R#163
#> 1: faulty() %!>% f() %!>% g() %!>% h()
Pipe expressions are lazily assigned to the placeholder. This makes it possible to use the placeholder multiple times without causing multiple evaluations.
Arguments are assigned with delayedAssign()
and lazily
evaluated:
The lazy masking pipe uses one masking environment per pipe expression. This allows persistence of the intermediary values and out of order evaluation. The factory function works as expected for instance:
Because we use one mask environment per pipe expression, the intermediary values can be collected as soon as they are no longer needed.
With a lazy pipe the whole pipeline is pushed onto the stack before evaluation.
faulty <- function() stop("tilt")
f <- function(x) x + 1
g <- function(x) x + 2
h <- function(x) x + 3
faulty() %?>% f() %?>% g() %?>% h()
#> Error in faulty() : tilt
traceback()
#> 7: stop("tilt")
#> 6: faulty()
#> 5: f(.)
#> 4: g(.)
#> 3: h(.)
#> 2: .External2(magrittr_pipe) at pipe.R#174
#> 1: faulty() %?>% f() %?>% g() %?>% h()
Note however how the backtrace is less cluttered than with the nested pipe approach, thanks to the placeholder.
The lazy pipe evaluates in a mask. This causes lexical side effects to occur in the incorrect environment.
Stack-sensitive functions like return()
function cannot
find the proper frame environment:
The masking environment causes a discontinuous stack tree:
foobar <- function(x) x %?>% quux()
quux <- function(x) x %?>% stop()
"tilt" %?>% foobar()
#> Error in x %?>% stop() : tilt
rlang::last_trace()
#> <error/rlang_error>
#> tilt
#> Backtrace:
#> █
#> 1. ├─"tilt" %?>% foobar()
#> 2. ├─global::foobar(.)
#> 3. │ └─x %?>% quux()
#> 4. └─global::quux(.)
#> 5. └─x %?>% stop()