--- title: "Using table.express" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using table.express} %\VignetteEngine{knitr::rmarkdown_notangle} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} library("table.express") library("data.table") # knitr defaults knitr::opts_chunk$set(comment = "#>") options("datatable.print.nrows" = 4L) options("datatable.print.topn" = 2L) ``` The goal of this package is to offer an alternative way of **expressing** common operations with [`data.table`](https://cran.r-project.org/package=data.table) without sacrificing the performance optimizations that it offers. The foundation for the data manipulation verbs is the [`dplyr`](https://cran.r-project.org/package=dplyr) package, which also advocates the piping operator from the [`magrittr`](https://cran.r-project.org/package=magrittr) package. The [`rlang`](https://cran.r-project.org/package=rlang) package powers most of this package's functionality, which means that [tidy evaluation](https://dplyr.tidyverse.org/articles/programming.html) can also be supported. There are other resources that provide comprehensive descriptions of these packages, so they will not be explained here. Even though using data manipulation verbs can improve expressiveness in some cases, this is not always true, so using the traditional `data.table` syntax might still be preferable in many situations. Unfortunately, it is not entirely straightforward to use the verbs without considerable knowledge of the syntax understood by `data.table`, as well as its [optimizations](https://rdrr.io/github/Rdatatable/data.table/man/datatable-optimize.html) and [special symbols](https://rdrr.io/github/Rdatatable/data.table/man/special-symbols.html). We will return to the importance of this at the [end](#as-simple-as-possible). ### Preliminaries All verbs defined in this package essentially parse the input given to them and build expressions that `data.table` understands. These expressions are used in a `data.table` *frame*, i.e. in a call similar to `DT[i, j, by, ...]` where `DT` is a `data.table`. The [frame's main elements](https://rdrr.io/github/Rdatatable/data.table/man/data.table.html) are the 3 shown: `i`, `j`, and `by`. Let us first list the available single-table verbs and their effects in the frame (two-table joining verbs have [their own vignette](https://asardaes.github.io/table.express/articles/joins.html)): - Verbs that set `i`: - `arrange`/`order_by` - `filter` - `filter_on` - `filter_sd` - `where` - `max_by` - `min_by` - Verbs that set `j`: - `select` - `mutate` - `mutate_sd` - `transmute` - `transmute_sd` - Verbs for grouping: - `group_by` to specify `by` - `key_by` to specify `keyby` - Verbs that set more than one clause: - `distinct` specifies `j` and `by` The `where` verb is equivalent to `dplyr`'s `filter`, but is always lazy, i.e. it expects another verb to be used afterwards. The `order_by` verb is simply an alias for `arrange`. Some verbs can/must be combined with others, which means that some expressions must be built *lazily*. It is possible to work in an entirely lazy manner (see [below](#expression-delimiters)), but, as of version 0.3.0, it is no longer necessary (do note that this makes `table.express` conflict with [`dtplyr`](https://cran.r-project.org/package=dtplyr)). Nevertheless, in order to enable some of the expressions to be built lazily, the *order* in which the verbs are used becomes important: - If desired, `group_by`/`key_by` should be used first. - Verbs that set `i` should be used next, usually specifying whether they should evaluate the frame or not. - Verbs that set `j` should be used last. The examples here explain this in more detail with the `mtcars` data: ```{r mtcars} data("mtcars") DT <- mtcars %>% as.data.table %T>% print ``` ### Expression delimiters The foundation for this package is building expressions that are almost entirely delegated to `data.table`. In order to explicitly show when an expression is being built and subsequently evaluated, we can use 3 delimiters: - `start_expr` - `end_expr` - `chain` These also serve as visual reminders that we are not dealing directly with `data.table`s during the process. We capture the input `data.table` and start the process with `start_expr`, and evaluate the final expression with `end_expr`. Using `chain` is equivalent to calling `end_expr` immediately followed by `start_expr`. In order to print more information regarding the expression-building process, we can either set `options(table.express.verbose = TRUE)`, or pass `.verbose = TRUE` to `start_expr` or `chain`. ## Arranging rows Let's start with a lazy example to get an idea of how the built expressions look like. The `arrange`/`order_by` verbs add an expression with `order` to the frame, and let `data.table` handle it as usual: ```{r arrange-lazy} DT %>% start_expr %>% order_by(mpg, -cyl) ``` We see here that the built expression includes a `.DT_` pronoun. When the expression is evaluated, the captured `data.table` is assigned to the evaluation environment as said pronoun. When evaluating eagerly, this whole process happens in the background: ```{r arrange-eager} DT %>% order_by(mpg, -cyl) ``` ## Selecting columns Even though selecting a subset of columns is a common operation, it may be undesirable to do so when working with `data.table`s because it *always* leads to **deep** data copies. Given that `data.table` supports modification by reference, these copies are necessary. For example, the following would require copying all columns between `mpg` and `disp`, since any of them could be modified by reference afterwards. ```{r select-deep-copy, eval = FALSE} DT[, mpg:disp][, cyl := as.integer(cyl)] ``` Therefore, it might be preferable to use mutation verbs to remove columns that are not of interest. With that said, the `select` verb can be used as usual: ```{r select} DT %>% select(mpg, am) ``` To maintain consistency, even single columns are kept as `data.table`s: ```{r select-single} DT %>% select(mpg) ``` Calls to `tidyselect`'s helpers or to `:` are handled specially internally: ```{r select-special} DT %>% select(mpg:disp, gear:carb) DT %>% select(contains("M", ignore.case = TRUE)) ``` Tidy evaluation and the `.parse` argument can also aid in cases where certain parts of the frame were computed programmatically: ```{r select-tidy} selected <- c("mpg", "am") DT %>% select(!!!selected, .parse = TRUE) ``` ## Transmuting columns The `transmute` verb simply wraps everything in its ellipsis in a call to `list` and assigns the expression to the frame's `j`. ```{r transmute} DT %>% transmute(foo = mpg * 2, bar = exp(cyl)) ``` Even though `select` can be used for the same (given the way `data.table` handles `j`), the simplicity of `transmute`'s internals makes it preferable when no `tidyselect` helpers are needed, since it avoids extra function calls and can reduce overhead significantly in special circumstances. ### Summarizing columns Because of the way `data.table` handles `j`, `transmute` could be used to summarize columns. However, the `summarize`/`summarise` verbs can be more expressive, and they check whether the summary value actually has length 1. ## Mutating columns The `mutate` verb builds an expression with `:=` in order to perform assignment by reference by default. ```{r mutate} DT %>% start_expr %>% mutate(mpg = mpg / 2, hp = log(hp)) DT %>% copy %>% mutate(mpg = mpg / 2, hp = log(hp)) %>% { invisible(print(.)) } ``` It can also support expressions that already contain `:=`: ```{r mutate-no-unquote} new_vars <- c("x", "y") DT %>% copy %>% mutate(!!new_vars := .(1, 2), .unquote_names = FALSE) %>% { invisible(print(.)) } ``` ## Filtering rows The `where`/`filter` verbs work with the `i` part of the frame: ```{r filter} DT %>% filter(vs == 1L, carb > 2L) DT %>% where(vs == 1L, carb > 2L) %>% transmute(mean_mpg = mean(mpg)) ``` As mentioned in the beginning, `where` can be used as a lazy version of `filter`, letting us add other clauses to the same frame. The helper verb `filter_sd` can be used to apply the same conditions to many columns, and it can use a special pronoun `.COL` while specifying the expression, lambdas as formulas, as well as `tidyselect` helpers to choose `.SDcols`: ```{r filter-sd} DT %>% filter_sd(c("mpg", "qsec"), `>`, 20) DT %>% filter_sd(c("mpg", "qsec"), .COL > 20) DT %>% start_expr %>% filter_sd(contains("m"), ~ .x > 0) # like dplyr's filter_if DT %>% start_expr %>% filter_sd(~ is.numeric(.x), .x > 0) ``` In order to enable combination of filtering helpers with other verbs, the `.expr` parameter can be used to indicate that the expression should be kept, delaying evaluation until a verb that sets `j` is used: ```{r filter-sd-expr} DT %>% copy %>% filter_sd(c("vs", "am"), .COL > 0, .expr = TRUE) %>% mutate(some_flag = TRUE) %>% { invisible(print(.)) } ``` ### Using keys or secondary indices The `filter_on` verb can be used to build an expression that specifies the `on` argument of the frame. It accepts key-value pairs where each key is a column in the data, and each value is the corresponding value that the column should have to match: ```{r filter-on} DT %>% start_expr %>% filter_on(cyl = 6, am = 0) %T>% print %>% end_expr DT %>% filter_on(cyl = 6, am = 0, mult = "first") ``` ## Modifying subset of data In order to support functionality similar to `data.table`'s `DT[, lapply(.SD, fun), .SDcols = c("...")]` syntax, 2 verbs are provided: `mutate_sd` and `transmute_sd`. Starting with `mutate_sd`, it modifies columns in `.SDcols` by reference, and columns that are not part of `.SDcols` are kept: ```{r mutate-sd} DT %>% mutate_sd(c("mpg", "cyl"), exp) print(DT) DT %>% mutate_sd(c("mpg", "cyl"), log) print(DT) ``` It would also be possible to remove several columns with `mutate_sd`: ```{r mutate-sd-rm} DT %>% copy %>% mutate_sd(c("mpg", "cyl"), NULL) %>% { invisible(print(.)) } ``` Additionally, `mutate_sd` supports the special `.COL` pronoun that symbolizes the column that should be modified, as well as lambdas as formulas: ```{r mutate-sd-col} # like dplyr's mutate_if DT %>% start_expr %>% mutate_sd(all(.COL %% 1 == 0), as.integer) DT %>% start_expr %>% mutate_sd(~ all(.x %% 1 == 0), as.integer) ``` On the other hand, `transmute_sd` never modifies by reference, and supports special expressions to "build" what is chosen as `.SDcols`. These expressions can use `tidyselect` helpers, as well as another special pronoun: `.COLNAME`: ```{r transmute-sd} DT %>% transmute_sd(starts_with("d"), .COL * 2) DT %>% transmute_sd(grepl("^d", .COLNAME), .COL * 2) # like dplyr's transmute_if DT %>% start_expr %>% transmute_sd(~ is.numeric(.x), ~ .x * 2) ``` Both verbs also support a list of functions for multiple transformations, but `mutate_sd` performs pairwise transformations by default, whereas `transmute_sd` performs all combinations: ```{r list-transform-sd} DT %>% copy %>% mutate_sd(contains("m"), .(round, -1L)) %>% { invisible(print(.)) } DT %>% transmute_sd(contains("m"), .(min, max, mean)) ``` ## Data manipulation by group Since `data.table` already supports this by means of its `by` parameter, the `group_by` verb simply parses its input and assigns it accordingly: ```{r group-by} DT %>% group_by(gear) %>% select(.N) ``` The `key_by` verb does the same but sets the key of the input in order to sort: ```{r key-by} DT %>% key_by(gear) %>% select(.N) ``` Using `group_by` or `key_by` by itself would not be useful. Therefore, they are entirely lazy, and simply return the expression builder that is used in the background. Using other verbs will normally lead to the evaluation of the expression, though some verbs can delay evaluation, e.g. `where` and other verbs that set `i` when their respective `.expr` parameters are set to `TRUE`. ## Distinct combinations of columns The `distinct` verb is a shortcut for: - `select(.SD[1])` with a `by` clause if `.keep = TRUE`. - Same as above plus specific `.SDcols` if `.keep` is a character vector. - `unique(.SD)` with specific `.SDcols` if `.keep = FALSE`. ```{r distinct} DT %>% distinct(vs, am) DT %>% distinct(vsam = vs + am, .keep = names(DT)) DT %>% distinct(vs, am, .keep = FALSE) ``` ## Lazy considerations When working lazily, i.e. with `start_expr`/`chain`/`end_expr`, there are special considerations that need to be taken into account. ### Nesting expressions Some `data.table` idioms benefit from nesting expressions, for instance [this idiom](https://stackoverflow.com/a/16574176/5793905) to find, by groups, rows where a column has its extrema. The helper function `nest_expr` can help in these cases. When used inside a supported verb, it [eagerly](#eager-verbs) builds a functional chain with `%>%`, where the currently captured `data.table` enters as `.` For example, the linked idiom could be expressed as: ```{r nest-expr} DT %>% start_expr(.verbose = TRUE) %>% where(nest_expr(.start = FALSE, { .[, .I[mpg == max(mpg)], by = vs]$V1 })) %>% end_expr ``` Nevertheless, this specific logic has been encapsulated in the `max_by`/`min_by` verbs. ### Automatic expression chaining By default, the verbs defined in this package automatically start a new frame whenever they want to define one of `i`/`j`/`by`, but the current expression's frame has already specified it; otherwise they add to the current frame. More complex expressions are thus supported by automatically chaining `data.table` frames: ```{r chaining} DT %>% start_expr %>% select(mean_mpg = mean(mpg)) %>% where(hp > 50L) %>% group_by(vs, am, gear) %>% order_by(gear, -vs, am) %>% filter(mean_mpg > 20) %T>% print %>% end_expr %>% { invisible(print(., nrows = 10L)) } ``` If we wanted to be explicit about chaining whenever possible (see [below](#explicit-chaining)), we could set `options(table.express.chain = FALSE)`, which would lead to a warning being shown whenever a part of the query is replaced. ### Explicit chaining The automatic chaining mentioned above is enough in most situations. For example the following chains lead to the same result, and therefore have the same semantics: ```{r two-chains-one-semantic} DT[mpg > 20, mpg * 2] DT[mpg > 20][, mpg * 2] ``` However, these two chains have different semantics: ```{r two-chains-two-semantics} DT[, .(mpg = mpg * 2)][mpg > 40] DT[mpg > 40, .(mpg = mpg * 2)] ``` As mentioned [above](#expression-delimiters), `chain` can be used to chain expressions by evaluating the current one with `end_expr`, and immediately capturing the resulting `data.table` to start building a new expression. This can be helpful in situations where automatic chaining (or lack thereof) can lead to a change in the expression's semantics: ```{r two-exprs-two-semantics} DT %>% start_expr %>% transmute(mpg = mpg * 2) %>% filter(mpg > 40) %T>% print %>% end_expr DT %>% start_expr %>% transmute(mpg = mpg * 2) %>% chain %>% filter(mpg > 40) %>% end_expr ``` ### Eager verbs In the following cases, the mentioned verbs use the captured `data.table` eagerly during expression building, even when working lazily: - `filter_sd` when `.SDcols` is a `tidyselect` call or a predicate using `.COL`. - `mutate_sd` when `.SDcols` is a `tidyselect` call or a predicate using `.COL`. - Using `nest_expr` in: - `anti_join`'s `y`. - `inner_join`'s `y`. - `right_join`'s `y`. - `filter` or `where`. - `max_by` and `min_by`, since they are based on `nest_expr`. This can lead to unexpected results if we don't keep in mind the expression that is built: ```{r implicit-chain-filter_sd} DT[, mpg:disp] DT %>% start_expr %>% select(mpg:disp) %>% transmute(ans = sqrt(mpg)) %>% filter_sd(contains("m"), .COL > 0) ``` The `select` gets rid of `am`, but `filter_sd` sees the columns of `DT` before any expression has been evaluated. Explicit chaining can help in these cases, capturing intermediate results: ```{r explicit-chain-filter_sd} DT %>% start_expr %>% select(mpg:disp) %>% chain %>% transmute(ans = sqrt(mpg)) %>% filter_sd(contains("m"), .COL > 0) ``` ## As simple as possible Many of the verbs in this package try to help make code more concise whenever possible. However, some of the added convenience requires helper functions that aid with non-standard evaluation, which doesn't always play nicely with the `data.table` optimizations mentioned in the beginning. Therefore, it is worth keeping the built expressions as close to traditional `data.table` syntax as possible. The verbs try to stick to this principle, but in order to build a simple output expression, they also require a simple input. Let us illustrate this point with `transmute_sd`, which can be used to compute summaries: ```{r asap-1} DT %>% start_expr %>% group_by(am, vs) %>% transmute_sd(c("mpg", "disp"), mean) %>% frame_append(verbose = TRUE) %T>% print %>% end_expr ``` We can see, for example, `lapply optimization changed j ...`. Changing the expressions just a bit, without affecting its semantics, still hinders `data.table`'s ability to recognize cases it can optimize: ```{r asap-2} DT %>% start_expr %>% group_by(am, vs) %>% transmute_sd(c("mpg", "disp"), mean(.COL)) %>% frame_append(verbose = TRUE) %T>% print %>% end_expr ``` Corroborated by the statements `j unchanged`, `GForce FALSE`, etc. Nevertheless, a different syntax can enable additional functionality, for instance column renaming: ```{r asap-3} DT %>% group_by(am, vs) %>% transmute_sd(c("mpg", "disp"), .(avg = mean(.COL), min = min(.COL))) ``` Thus, in order to decide which syntax to use, we must be aware of what can be optimized by `data.table`, but as a general rule of thumb, the simpler the better. ```{r asap-4} DT %>% start_expr %>% group_by(am, vs) %>% transmute_sd(c("mpg", "disp"), .(min, max)) %>% frame_append(verbose = TRUE) %T>% print %>% end_expr ```