Using table.express

The goal of this package is to offer an alternative way of expressing common operations with data.table without sacrificing the performance optimizations that it offers. The foundation for the data manipulation verbs is the dplyr package, which also advocates the piping operator from the magrittr package. The rlang package powers most of this package’s functionality, which means that tidy evaluation can also be supported. There are other resources that provide comprehensive descriptions of these packages, so they will not be explained here.

Even though using data manipulation verbs can improve expressiveness in some cases, this is not always true, so using the traditional data.table syntax might still be preferable in many situations. Unfortunately, it is not entirely straightforward to use the verbs without considerable knowledge of the syntax understood by data.table, as well as its optimizations and special symbols. We will return to the importance of this at the end.

Preliminaries

All verbs defined in this package essentially parse the input given to them and build expressions that data.table understands. These expressions are used in a data.table frame, i.e. in a call similar to DT[i, j, by, ...] where DT is a data.table. The frame’s main elements are the 3 shown: i, j, and by. Let us first list the available single-table verbs and their effects in the frame (two-table joining verbs have their own vignette):

Verbs that set i:
- arrange/order_by
- filter
- filter_on
- filter_sd
- where
- max_by
- min_by
Verbs that set j:
- select
- mutate
- mutate_sd
- transmute
- transmute_sd
Verbs for grouping:
- group_by to specify by
- key_by to specify keyby
Verbs that set more than one clause:
- distinct specifies j and by

The where verb is equivalent to dplyr’s filter, but is always lazy, i.e. it expects another verb to be used afterwards. The order_by verb is simply an alias for arrange.

Some verbs can/must be combined with others, which means that some expressions must be built lazily. It is possible to work in an entirely lazy manner (see below), but, as of version 0.3.0, it is no longer necessary (do note that this makes table.express conflict with dtplyr). Nevertheless, in order to enable some of the expressions to be built lazily, the order in which the verbs are used becomes important:

If desired, group_by/key_by should be used first.
Verbs that set i should be used next, usually specifying whether they should evaluate the frame or not.
Verbs that set j should be used last.

The examples here explain this in more detail with the mtcars data:

data("mtcars")

DT <- mtcars %>%
  as.data.table %T>%
  print

#>       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>  1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
#>  2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
#> ---                                                                  
#> 31:  15.0     8   301   335  3.54 3.570 14.60     0     1     5     8
#> 32:  21.4     4   121   109  4.11 2.780 18.60     1     1     4     2

Expression delimiters

The foundation for this package is building expressions that are almost entirely delegated to data.table. In order to explicitly show when an expression is being built and subsequently evaluated, we can use 3 delimiters:

start_expr
end_expr
chain

These also serve as visual reminders that we are not dealing directly with data.tables during the process. We capture the input data.table and start the process with start_expr, and evaluate the final expression with end_expr. Using chain is equivalent to calling end_expr immediately followed by start_expr.

In order to print more information regarding the expression-building process, we can either set options(table.express.verbose = TRUE), or pass .verbose = TRUE to start_expr or chain.

Arranging rows

Let’s start with a lazy example to get an idea of how the built expressions look like. The arrange/order_by verbs add an expression with order to the frame, and let data.table handle it as usual:

DT %>%
  start_expr %>%
  order_by(mpg, -cyl)

#> .DT_[order(mpg, -cyl)]

We see here that the built expression includes a .DT_ pronoun. When the expression is evaluated, the captured data.table is assigned to the evaluation environment as said pronoun. When evaluating eagerly, this whole process happens in the background:

DT %>%
  order_by(mpg, -cyl)

#>       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>  1:  10.4     8 472.0   205  2.93 5.250 17.98     0     0     3     4
#>  2:  10.4     8 460.0   215  3.00 5.424 17.82     0     0     3     4
#> ---                                                                  
#> 31:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
#> 32:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1

Selecting columns

Even though selecting a subset of columns is a common operation, it may be undesirable to do so when working with data.tables because it always leads to deep data copies. Given that data.table supports modification by reference, these copies are necessary. For example, the following would require copying all columns between mpg and disp, since any of them could be modified by reference afterwards.

DT[, mpg:disp][, cyl := as.integer(cyl)]

Therefore, it might be preferable to use mutation verbs to remove columns that are not of interest.

With that said, the select verb can be used as usual:

DT %>%
  select(mpg, am)

#>       mpg    am
#>     <num> <num>
#>  1:  21.0     1
#>  2:  21.0     1
#> ---            
#> 31:  15.0     1
#> 32:  21.4     1

To maintain consistency, even single columns are kept as data.tables:

DT %>%
  select(mpg)

#>       mpg
#>     <num>
#>  1:  21.0
#>  2:  21.0
#> ---      
#> 31:  15.0
#> 32:  21.4

Calls to tidyselect’s helpers or to : are handled specially internally:

DT %>%
  select(mpg:disp, gear:carb)

#>       mpg   cyl  disp  gear  carb
#>     <num> <num> <num> <num> <num>
#>  1:  21.0     6   160     4     4
#>  2:  21.0     6   160     4     4
#> ---                              
#> 31:  15.0     8   301     5     8
#> 32:  21.4     4   121     4     2

DT %>%
  select(contains("M", ignore.case = TRUE))

#>       mpg    am
#>     <num> <num>
#>  1:  21.0     1
#>  2:  21.0     1
#> ---            
#> 31:  15.0     1
#> 32:  21.4     1

Tidy evaluation and the .parse argument can also aid in cases where certain parts of the frame were computed programmatically:

selected <- c("mpg", "am")
DT %>%
  select(!!!selected, .parse = TRUE)

#>       mpg    am
#>     <num> <num>
#>  1:  21.0     1
#>  2:  21.0     1
#> ---            
#> 31:  15.0     1
#> 32:  21.4     1

Transmuting columns

The transmute verb simply wraps everything in its ellipsis in a call to list and assigns the expression to the frame’s j.

DT %>%
  transmute(foo = mpg * 2, bar = exp(cyl))

#>       foo        bar
#>     <num>      <num>
#>  1:  42.0  403.42879
#>  2:  42.0  403.42879
#> ---                 
#> 31:  30.0 2980.95799
#> 32:  42.8   54.59815

Even though select can be used for the same (given the way data.table handles j), the simplicity of transmute’s internals makes it preferable when no tidyselect helpers are needed, since it avoids extra function calls and can reduce overhead significantly in special circumstances.

Summarizing columns

Because of the way data.table handles j, transmute could be used to summarize columns. However, the summarize/summarise verbs can be more expressive, and they check whether the summary value actually has length 1.

Mutating columns

The mutate verb builds an expression with := in order to perform assignment by reference by default.

DT %>%
  start_expr %>%
  mutate(mpg = mpg / 2, hp = log(hp))

#> .DT_[, `:=`(mpg = mpg/2, hp = log(hp))]

DT %>%
  copy %>%
  mutate(mpg = mpg / 2, hp = log(hp)) %>% {
    invisible(print(.))
  }

#>       mpg   cyl  disp       hp  drat    wt  qsec    vs    am  gear  carb
#>     <num> <num> <num>    <num> <num> <num> <num> <num> <num> <num> <num>
#>  1:  10.5     6   160 4.700480  3.90 2.620 16.46     0     1     4     4
#>  2:  10.5     6   160 4.700480  3.90 2.875 17.02     0     1     4     4
#> ---                                                                     
#> 31:   7.5     8   301 5.814131  3.54 3.570 14.60     0     1     5     8
#> 32:  10.7     4   121 4.691348  4.11 2.780 18.60     1     1     4     2

It can also support expressions that already contain :=:

new_vars <- c("x", "y")

DT %>%
  copy %>%
  mutate(!!new_vars := .(1, 2), .unquote_names = FALSE) %>% {
    invisible(print(.))
  }

#>       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb     x
#>     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>  1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#>  2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> ---                                                                        
#> 31:  15.0     8   301   335  3.54 3.570 14.60     0     1     5     8     1
#> 32:  21.4     4   121   109  4.11 2.780 18.60     1     1     4     2     1
#>         y
#>     <num>
#>  1:     2
#>  2:     2
#> ---      
#> 31:     2
#> 32:     2

Filtering rows

The where/filter verbs work with the i part of the frame:

DT %>%
  filter(vs == 1L, carb > 2L)

#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:  19.2     6 167.6   123  3.92  3.44  18.3     1     0     4     4
#> 2:  17.8     6 167.6   123  3.92  3.44  18.9     1     0     4     4

DT %>%
  where(vs == 1L, carb > 2L) %>%
  transmute(mean_mpg = mean(mpg))

#>    mean_mpg
#>       <num>
#> 1:     18.5

As mentioned in the beginning, where can be used as a lazy version of filter, letting us add other clauses to the same frame.

The helper verb filter_sd can be used to apply the same conditions to many columns, and it can use a special pronoun .COL while specifying the expression, lambdas as formulas, as well as tidyselect helpers to choose .SDcols:

DT %>%
  filter_sd(c("mpg", "qsec"), `>`, 20)

#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
#> 2:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1

DT %>%
  filter_sd(c("mpg", "qsec"), .COL > 20)

#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
#> 2:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1

DT %>%
  start_expr %>%
  filter_sd(contains("m"), ~ .x > 0)

#> .DT_[mpg > 0 & am > 0]

# like dplyr's filter_if
DT %>%
  start_expr %>%
  filter_sd(~ is.numeric(.x), .x > 0)

#> .DT_[mpg > 0 & cyl > 0 & disp > 0 & hp > 0 & drat > 0 & wt > 
#>     0 & qsec > 0 & vs > 0 & am > 0 & gear > 0 & carb > 0]

In order to enable combination of filtering helpers with other verbs, the .expr parameter can be used to indicate that the expression should be kept, delaying evaluation until a verb that sets j is used:

DT %>%
  copy %>%
  filter_sd(c("vs", "am"), .COL > 0, .expr = TRUE) %>%
  mutate(some_flag = TRUE) %>% {
    invisible(print(.))
  }

#>       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb some_flag
#>     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>    <lgcl>
#>  1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4        NA
#>  2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4        NA
#> ---                                                                            
#> 31:  15.0     8   301   335  3.54 3.570 14.60     0     1     5     8        NA
#> 32:  21.4     4   121   109  4.11 2.780 18.60     1     1     4     2      TRUE

Using keys or secondary indices

The filter_on verb can be used to build an expression that specifies the on argument of the frame. It accepts key-value pairs where each key is a column in the data, and each value is the corresponding value that the column should have to match:

DT %>%
  start_expr %>%
  filter_on(cyl = 6, am = 0) %T>%
  print %>%
  end_expr

#> .DT_[list(6, 0), on = c("cyl", "am")]

#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
#> 2:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
#> 3:  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
#> 4:  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4

DT %>%
  filter_on(cyl = 6, am = 0, mult = "first")

#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1

Modifying subset of data

In order to support functionality similar to data.table’s DT[, lapply(.SD, fun), .SDcols = c("...")] syntax, 2 verbs are provided: mutate_sd and transmute_sd.

Starting with mutate_sd, it modifies columns in .SDcols by reference, and columns that are not part of .SDcols are kept:

DT %>%
  mutate_sd(c("mpg", "cyl"), exp)

print(DT)

#>            mpg        cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>          <num>      <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>  1: 1318815734  403.42879   160   110  3.90 2.620 16.46     0     1     4     4
#>  2: 1318815734  403.42879   160   110  3.90 2.875 17.02     0     1     4     4
#> ---                                                                            
#> 31:    3269017 2980.95799   301   335  3.54 3.570 14.60     0     1     5     8
#> 32: 1967441884   54.59815   121   109  4.11 2.780 18.60     1     1     4     2

DT %>%
  mutate_sd(c("mpg", "cyl"), log)

print(DT)

#>       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>  1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
#>  2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
#> ---                                                                  
#> 31:  15.0     8   301   335  3.54 3.570 14.60     0     1     5     8
#> 32:  21.4     4   121   109  4.11 2.780 18.60     1     1     4     2

It would also be possible to remove several columns with mutate_sd:

DT %>%
  copy %>%
  mutate_sd(c("mpg", "cyl"), NULL) %>% {
    invisible(print(.))
  }

#>      disp    hp  drat    wt  qsec    vs    am  gear  carb
#>     <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>  1:   160   110  3.90 2.620 16.46     0     1     4     4
#>  2:   160   110  3.90 2.875 17.02     0     1     4     4
#> ---                                                      
#> 31:   301   335  3.54 3.570 14.60     0     1     5     8
#> 32:   121   109  4.11 2.780 18.60     1     1     4     2

Additionally, mutate_sd supports the special .COL pronoun that symbolizes the column that should be modified, as well as lambdas as formulas:

# like dplyr's mutate_if
DT %>%
  start_expr %>%
  mutate_sd(all(.COL %% 1 == 0), as.integer)

#> .DT_[, `:=`(c("cyl", "hp", "vs", "am", "gear", "carb"), lapply(.SD, 
#>     as.integer)), .SDcols = c("cyl", "hp", "vs", "am", "gear", 
#> "carb")]

DT %>%
  start_expr %>%
  mutate_sd(~ all(.x %% 1 == 0), as.integer)

#> .DT_[, `:=`(c("cyl", "hp", "vs", "am", "gear", "carb"), lapply(.SD, 
#>     as.integer)), .SDcols = c("cyl", "hp", "vs", "am", "gear", 
#> "carb")]

On the other hand, transmute_sd never modifies by reference, and supports special expressions to “build” what is chosen as .SDcols. These expressions can use tidyselect helpers, as well as another special pronoun: .COLNAME:

DT %>%
  transmute_sd(starts_with("d"), .COL * 2)

#>      disp  drat
#>     <num> <num>
#>  1:   320  7.80
#>  2:   320  7.80
#> ---            
#> 31:   602  7.08
#> 32:   242  8.22

DT %>%
  transmute_sd(grepl("^d", .COLNAME), .COL * 2)

#>      disp  drat
#>     <num> <num>
#>  1:   320  7.80
#>  2:   320  7.80
#> ---            
#> 31:   602  7.08
#> 32:   242  8.22

# like dplyr's transmute_if
DT %>%
  start_expr %>%
  transmute_sd(~ is.numeric(.x), ~ .x * 2)

#> .DT_[, .transmute_matching(.SD, .which = rlang::quo(~is.numeric(.x)), 
#>     .hows = rlang::quos(~.x * 2))]

Both verbs also support a list of functions for multiple transformations, but mutate_sd performs pairwise transformations by default, whereas transmute_sd performs all combinations:

DT %>%
  copy %>%
  mutate_sd(contains("m"), .(round, -1L)) %>% {
    invisible(print(.))
  }

#>       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>  1:    21     6   160   110  3.90 2.620 16.46     0    -1     4     4
#>  2:    21     6   160   110  3.90 2.875 17.02     0    -1     4     4
#> ---                                                                  
#> 31:    15     8   301   335  3.54 3.570 14.60     0    -1     5     8
#> 32:    21     4   121   109  4.11 2.780 18.60     1    -1     4     2

DT %>%
  transmute_sd(contains("m"), .(min, max, mean))

#>    min.mpg min.am max.mpg max.am mean.mpg mean.am
#>      <num>  <num>   <num>  <num>    <num>   <num>
#> 1:    10.4      0    33.9      1 20.09062 0.40625

Data manipulation by group

Since data.table already supports this by means of its by parameter, the group_by verb simply parses its input and assigns it accordingly:

DT %>%
  group_by(gear) %>%
  select(.N)

#>     gear     N
#>    <num> <int>
#> 1:     4    12
#> 2:     3    15
#> 3:     5     5

The key_by verb does the same but sets the key of the input in order to sort:

DT %>%
  key_by(gear) %>%
  select(.N)

#> Key: <gear>
#>     gear     N
#>    <num> <int>
#> 1:     3    15
#> 2:     4    12
#> 3:     5     5

Using group_by or key_by by itself would not be useful. Therefore, they are entirely lazy, and simply return the expression builder that is used in the background. Using other verbs will normally lead to the evaluation of the expression, though some verbs can delay evaluation, e.g. where and other verbs that set i when their respective .expr parameters are set to TRUE.

Distinct combinations of columns

The distinct verb is a shortcut for:

select(.SD[1]) with a by clause if .keep = TRUE.
Same as above plus specific .SDcols if .keep is a character vector.
unique(.SD) with specific .SDcols if .keep = FALSE.

DT %>%
  distinct(vs, am)

#>       vs    am   mpg   cyl  disp    hp  drat    wt  qsec  gear  carb
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:     0     1  21.0     6   160   110  3.90 2.620 16.46     4     4
#> 2:     1     1  22.8     4   108    93  3.85 2.320 18.61     4     1
#> 3:     1     0  21.4     6   258   110  3.08 3.215 19.44     3     1
#> 4:     0     0  18.7     8   360   175  3.15 3.440 17.02     3     2

DT %>%
  distinct(vsam = vs + am, .keep = names(DT))

#>     vsam   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:     1  21.0     6   160   110  3.90  2.62 16.46     0     1     4     4
#> 2:     2  22.8     4   108    93  3.85  2.32 18.61     1     1     4     1
#> 3:     0  18.7     8   360   175  3.15  3.44 17.02     0     0     3     2

DT %>%
  distinct(vs, am, .keep = FALSE)

#>       vs    am
#>    <num> <num>
#> 1:     0     1
#> 2:     1     1
#> 3:     1     0
#> 4:     0     0

Lazy considerations

When working lazily, i.e. with start_expr/chain/end_expr, there are special considerations that need to be taken into account.

Nesting expressions

Some data.table idioms benefit from nesting expressions, for instance this idiom to find, by groups, rows where a column has its extrema. The helper function nest_expr can help in these cases. When used inside a supported verb, it eagerly builds a functional chain with %>%, where the currently captured data.table enters as .

For example, the linked idiom could be expressed as:

DT %>%
  start_expr(.verbose = TRUE) %>%
  where(nest_expr(.start = FALSE, {
    .[, .I[mpg == max(mpg)], by = vs]$V1
  })) %>%
  end_expr

#> Nesting the result of evaluating the following functional chain:
#> .DT_ %>% {
#>     .[, .I[mpg == max(mpg)], by = vs]$V1
#> }
#> Expression after where.ExprBuilder(...):
#> .DT_[.NEST_0_]
#> Using captured data.table for evaluation.
#> Evaluating:
#> .DT_[.NEST_0_]

#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:  26.0     4 120.3    91  4.43 2.140  16.7     0     1     5     2
#> 2:  33.9     4  71.1    65  4.22 1.835  19.9     1     1     4     1

Nevertheless, this specific logic has been encapsulated in the max_by/min_by verbs.

Automatic expression chaining

By default, the verbs defined in this package automatically start a new frame whenever they want to define one of i/j/by, but the current expression’s frame has already specified it; otherwise they add to the current frame. More complex expressions are thus supported by automatically chaining data.table frames:

DT %>%
  start_expr %>%
  select(mean_mpg = mean(mpg)) %>%
  where(hp > 50L) %>%
  group_by(vs, am, gear) %>%
  order_by(gear, -vs, am) %>%
  filter(mean_mpg > 20) %T>%
  print %>%
  end_expr %>% {
    invisible(print(., nrows = 10L))
  }

#> .DT_[hp > 50L, .select_matching(, mean_mpg = mean(mpg), .negate = FALSE), 
#>     by = list(vs, am, gear)][order(gear, -vs, am)][mean_mpg > 
#>     20]
#>       vs    am  gear mean_mpg
#>    <num> <num> <num>    <num>
#> 1:     1     0     3 20.33333
#> 2:     1     0     4 21.05000
#> 3:     1     1     4 28.03333
#> 4:     0     1     4 21.00000
#> 5:     1     1     5 30.40000

If we wanted to be explicit about chaining whenever possible (see below), we could set options(table.express.chain = FALSE), which would lead to a warning being shown whenever a part of the query is replaced.

Explicit chaining

The automatic chaining mentioned above is enough in most situations. For example the following chains lead to the same result, and therefore have the same semantics:

DT[mpg > 20, mpg * 2]

#>  [1] 42.0 42.0 45.6 42.8 48.8 45.6 64.8 60.8 67.8 43.0 54.6 52.0 60.8 42.8

DT[mpg > 20][, mpg * 2]

#>  [1] 42.0 42.0 45.6 42.8 48.8 45.6 64.8 60.8 67.8 43.0 54.6 52.0 60.8 42.8

However, these two chains have different semantics:

DT[, .(mpg = mpg * 2)][mpg > 40]

#>       mpg
#>     <num>
#>  1:  42.0
#>  2:  42.0
#> ---      
#> 13:  60.8
#> 14:  42.8

DT[mpg > 40, .(mpg = mpg * 2)]

#> Empty data.table (0 rows and 1 cols): mpg

As mentioned above, chain can be used to chain expressions by evaluating the current one with end_expr, and immediately capturing the resulting data.table to start building a new expression. This can be helpful in situations where automatic chaining (or lack thereof) can lead to a change in the expression’s semantics:

DT %>%
  start_expr %>%
  transmute(mpg = mpg * 2) %>%
  filter(mpg > 40) %T>%
  print %>%
  end_expr

#> .DT_[mpg > 40, list(mpg = mpg * 2)]

#> Empty data.table (0 rows and 1 cols): mpg

DT %>%
  start_expr %>%
  transmute(mpg = mpg * 2) %>%
  chain %>%
  filter(mpg > 40) %>%
  end_expr

#>       mpg
#>     <num>
#>  1:  42.0
#>  2:  42.0
#> ---      
#> 13:  60.8
#> 14:  42.8

Eager verbs

In the following cases, the mentioned verbs use the captured data.table eagerly during expression building, even when working lazily:

filter_sd when .SDcols is a tidyselect call or a predicate using .COL.
mutate_sd when .SDcols is a tidyselect call or a predicate using .COL.
Using nest_expr in:
- anti_join’s y.
- inner_join’s y.
- right_join’s y.
- filter or where.
max_by and min_by, since they are based on nest_expr.

This can lead to unexpected results if we don’t keep in mind the expression that is built:

DT[, mpg:disp]

#>       mpg   cyl  disp
#>     <num> <num> <num>
#>  1:  21.0     6   160
#>  2:  21.0     6   160
#> ---                  
#> 31:  15.0     8   301
#> 32:  21.4     4   121

DT %>%
  start_expr %>%
  select(mpg:disp) %>%
  transmute(ans = sqrt(mpg)) %>%
  filter_sd(contains("m"), .COL > 0)

#> .DT_[, .select_matching(.SD, mpg:disp, .negate = FALSE)][mpg > 
#>     0 & am > 0, list(ans = sqrt(mpg))]

The select gets rid of am, but filter_sd sees the columns of DT before any expression has been evaluated. Explicit chaining can help in these cases, capturing intermediate results:

DT %>%
  start_expr %>%
  select(mpg:disp) %>%
  chain %>%
  transmute(ans = sqrt(mpg)) %>%
  filter_sd(contains("m"), .COL > 0)

#> .DT_[mpg > 0, list(ans = sqrt(mpg))]

As simple as possible

Many of the verbs in this package try to help make code more concise whenever possible. However, some of the added convenience requires helper functions that aid with non-standard evaluation, which doesn’t always play nicely with the data.table optimizations mentioned in the beginning. Therefore, it is worth keeping the built expressions as close to traditional data.table syntax as possible. The verbs try to stick to this principle, but in order to build a simple output expression, they also require a simple input.

Let us illustrate this point with transmute_sd, which can be used to compute summaries:

DT %>%
  start_expr %>%
  group_by(am, vs) %>%
  transmute_sd(c("mpg", "disp"), mean) %>%
  frame_append(verbose = TRUE) %T>%
  print %>%
  end_expr

#> .DT_[, c(lapply(.SD, mean)), by = list(am, vs), .SDcols = c(c("mpg", 
#>     "disp")), verbose = TRUE]
#> Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
#> forder.c received 32 rows and 2 columns
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
#> forder.c received a vector type 'integer' length 4
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'c(lapply(.SD, mean))' to 'list(mean(mpg), mean(disp))'
#> GForce optimized j to 'list(gmean(mpg), gmean(disp))' (see ?GForce)
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> This gmean took (narm=FALSE) ... gather took 0.000s
#> 0.000s
#> This gmean took (narm=FALSE) ... gather took 0.000s
#> 0.000s
#> gforce eval took 0.000
#> 0.000s elapsed (0.000s cpu)

#>       am    vs      mpg     disp
#>    <num> <num>    <num>    <num>
#> 1:     1     0 19.75000 206.2167
#> 2:     1     1 28.37143  89.8000
#> 3:     0     1 20.74286 175.1143
#> 4:     0     0 15.05000 357.6167

We can see, for example, lapply optimization changed j .... Changing the expressions just a bit, without affecting its semantics, still hinders data.table’s ability to recognize cases it can optimize:

DT %>%
  start_expr %>%
  group_by(am, vs) %>%
  transmute_sd(c("mpg", "disp"), mean(.COL)) %>%
  frame_append(verbose = TRUE) %T>%
  print %>%
  end_expr

#> .DT_[, .transmute_matching(.SD, .which = rlang::quo(c("mpg", 
#>     "disp")), .hows = rlang::quos(mean(.COL))), by = list(am, 
#>     vs), verbose = TRUE]
#> Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
#> forder.c received 32 rows and 2 columns
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
#> forder.c received a vector type 'integer' length 4
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as '.transmute_matching(.SD, .which = rlang::quo(c("mpg", "disp")), .hows = rlang::quos(mean(.COL)))'
#> GForce is on, but not activated for this query; left j unchanged (see ?GForce)
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
#> 
#>   collecting discontiguous groups took 0.000s for 4 groups
#>   eval(j) took 0.001s for 4 calls
#> 0.001s elapsed (0.001s cpu)

#>       am    vs      mpg     disp
#>    <num> <num>    <num>    <num>
#> 1:     1     0 19.75000 206.2167
#> 2:     1     1 28.37143  89.8000
#> 3:     0     1 20.74286 175.1143
#> 4:     0     0 15.05000 357.6167

Corroborated by the statements j unchanged, GForce FALSE, etc. Nevertheless, a different syntax can enable additional functionality, for instance column renaming:

DT %>%
  group_by(am, vs) %>%
  transmute_sd(c("mpg", "disp"), .(avg = mean(.COL), min = min(.COL)))

#>       am    vs  avg.mpg avg.disp min.mpg min.disp
#>    <num> <num>    <num>    <num>   <num>    <num>
#> 1:     1     0 19.75000 206.2167    15.0    120.3
#> 2:     1     1 28.37143  89.8000    21.4     71.1
#> 3:     0     1 20.74286 175.1143    17.8    120.1
#> 4:     0     0 15.05000 357.6167    10.4    275.8

Thus, in order to decide which syntax to use, we must be aware of what can be optimized by data.table, but as a general rule of thumb, the simpler the better.

DT %>%
  start_expr %>%
  group_by(am, vs) %>%
  transmute_sd(c("mpg", "disp"), .(min, max)) %>%
  frame_append(verbose = TRUE) %T>%
  print %>%
  end_expr

#> .DT_[, c(min = lapply(.SD, min), max = lapply(.SD, max)), by = list(am, 
#>     vs), .SDcols = c(c("mpg", "disp")), verbose = TRUE]
#> Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
#> forder.c received 32 rows and 2 columns
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
#> forder.c received a vector type 'integer' length 4
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'c(min = lapply(.SD, min), max = lapply(.SD, max))' to 'list(min(mpg), min(disp), max(mpg), max(disp))'
#> GForce optimized j to 'list(gmin(mpg), gmin(disp), gmax(mpg), gmax(disp))' (see ?GForce)
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> gforce eval took 0.000
#> 0.000s elapsed (0.001s cpu)

#>       am    vs min.mpg min.disp max.mpg max.disp
#>    <num> <num>   <num>    <num>   <num>    <num>
#> 1:     1     0    15.0    120.3    26.0      351
#> 2:     1     1    21.4     71.1    33.9      121
#> 3:     0     1    17.8    120.1    24.4      258
#> 4:     0     0    10.4    275.8    19.2      472