---
title: "Joining verbs for data.table"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Joining verbs for data.table}
  %\VignetteEngine{knitr::rmarkdown_notangle}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
library("table.express")
library("data.table")

# knitr defaults
knitr::opts_chunk$set(comment = "#>",
                      error = TRUE)
```

Almost all joins between 2 `data.table`s use a notation where one of them is used as `i` in a frame applied to the other,
and the joining columns are specified with the `on` parameter.
However, in addition to the "basic" joins,
`data.table` allows for special cases like
rolling joins, summarizing while joining, non-equi joins, etc.
This vignette will describe the notation to apply these joins with verbs defined in `table.express`,
which, like the [single-table verbs](https://asardaes.github.io/table.express/articles/table.express.html),
build `data.table` expressions.

## Basic joins

We'll consider most of the `dplyr` joining verbs in this section:

- `inner_join`
- `left_join`
- `right_join`
- `anti_join`
- `semi_join`
- `full_join`

```{r basic-join-data}
A <- data.table::data.table(x = rep(c("b", "a", "c"), each = 3),
                            y = c(1, 3, 6),
                            v = 1:9)

B <- data.table::data.table(x = c("c", "b"),
                            v2 = 8:7,
                            foo = c(4, 2))

A
B
```

The methods defined in `table.express` accept the `on` part of the expression in their ellipsis:

```{r inner-join}
A %>%
    inner_join(B, x)

A %>%
    inner_join(B, x, v = v2)
```

An important thing to note in the second example above is the order in which the columns are given,
i.e. that `v` is written before `v2`,
since the order is relevant for `data.table`.
We can remember the correct order simply by looking at which `data.table` appears first in the expression,
and knowing that said `data.table`'s columns must appear first in the `on` expressions.
In this case, `A` appears before `B`,
so writing `v2 = v` would not work.

In order to maintain consistency in the ordering just described,
`left_join` automatically swaps expression elements internally in order to build the expression:

```{r left-join}
A %>%
    start_expr %>%
    left_join(B, x, v = v2) %T>%
    print %>%
    end_expr
```

We can also see an extra `.DT_0_` pronoun in the expression.
These special `.DT_*_` pronouns hold the different `data.table`s that have entered the expression so far,
and are automatically assigned to the evaluation's environment.
In this case, `.DT_` holds `B` and `.DT_0_` holds `A`.

No additional considerations are required to use `right_join` or `anti_join`:

```{r right-and-anti-join}
A %>%
    right_join(B, x, v = v2)

A %>%
    anti_join(B, x, v = v2)
```

A `semi_join` is essentially a `right_join` with `nomatch = NULL`,
and where `j` is set to `unique(.SD)`:

```{r semi-join}
setnames(B, "v2", "v")

A %>%
    semi_join(B, x)
```

Finally, `full_join` is basically a wrapper for `merge` specifying `all = TRUE`:

```{r full-join}
A %>%
    full_join(B, x)
```

### Expression chaining

Something to keep in mind is that most joins specify `i` and `on` inside the frame,
so any subsequent verbs that specify `j`, `by`, or `keyby` would still be possible.
In order to enable this,
many joining verbs have an `.expr` parameter that indicates that the expression should be kept,
delaying evaluation until a verb that sets `j` is used.
This can be useful if only a subset of the joined columns is desired:

```{r no-chain-after-join}
A %>%
    left_join(B, x, .expr = TRUE) %>%
    select(x, y, foo)
```

But, when working lazily,
this would require explicit chaining for expressions that should be applied to the joined table:

```{r chain-after-join}
A %>%
    start_expr %>%
    left_join(B, x) %>%
    chain %>%
    group_by(x) %>%
    mutate(y = cumsum(y)) %>%
    end_expr %>% {
        invisible(print(.))
    }
```

This is particularly important if the selection expressions call any function,
e.g. `tidyselect` helpers or even `:` with non-numerics,
because `data.table` does not support that kind of syntax for `j` in the same frame as a join.

## Mutating join

A special `data.table` idiom is described [here as an update join](https://stackoverflow.com/a/54313203/5793905).
In order to highlight the fact that it modifies the left-hand side table by reference,
the `mutate_join` verb is defined in `table.express`.
Said verb accepts the columns to be added in its  `.SDcols` parameter,
possibly with new names:

```{r basic-mutate-join}
A %>%
    mutate_join(B, x, .SDcols = c("foo", v_from_B = "v"))

print(A)
```

A particularity of this idiom is that the number of rows from the resulting join must match the left-hand side exactly or not at all,
so this won't work:

```{r invalid-mutate-join}
B %>%
    mutate_join(A, x, .SDcols = "y")
```

In these cases, we must either use `mult` if appropriate,
or specify a summarizing expression in `.SDcols`:

```{r summarizing-mutate-join}
B %>%
    mutate_join(A, x, .SDcols = "y", mult = "first")

print(B)

B %>%
    mutate_join(A, x, .SDcols = .(y = mean(y)))

print(B)
```

The last example specifies `by = .EACHI` in the joining expression.

## Rolling joins

```{r rolling-join-data, echo=FALSE}
website <- data.table::rbindlist(list(
    data.table::data.table(
        name = rep("Isabel", 5),
        session_start_time = as.POSIXct(c("2016-01-01 11:01",
                                          "2016-01-02 8:59",
                                          "2016-01-05 18:18",
                                          "2016-01-07 19:03",
                                          "2016-01-08 19:01"))
    ),
    data.table::data.table(
        name = "Sally",
        session_start_time = as.POSIXct("2016-01-03 10:00")
    ),
    data.table::data.table(
        name = rep("Francis", 6),
        session_start_time = as.POSIXct(c("2016-01-02 13:09",
                                          "2016-01-03 19:22",
                                          "2016-01-08 8:44",
                                          "2016-01-08 20:22",
                                          "2016-01-10 17:36",
                                          "2016-01-15 16:56"))
    ),
    data.table::data.table(
        name = rep("Erica", 2),
        session_start_time = as.POSIXct(c("2016-01-04 19:12",
                                          "2016-01-04 21:05"))
    ),
    data.table::data.table(
        name = rep("Vivian", 2),
        session_start_time = as.POSIXct(c("2016-01-01 9:10",
                                          "2016-01-09 2:15"))
    )
))

paypal <- data.table::rbindlist(list(
    data.table::data.table(
        name = "Isabel",
        purchase_time = as.POSIXct("2016-01-08 19:10")
    ),
    data.table::data.table(
        name = rep("Sally", 2),
        purchase_time = as.POSIXct(c("2016-01-03 10:06",
                                     "2016-01-03 10:15"))
    ),
    data.table::data.table(
        name = rep("Francis", 3),
        purchase_time = as.POSIXct(c("2016-01-03 19:28",
                                     "2016-01-08 20:33",
                                     "2016-01-10 17:46"))
    ),
    data.table::data.table(
        name = "Erica",
        purchase_time = as.POSIXct("2016-01-03 08:02")
    ),
    data.table::data.table(
        name = "Mom",
        purchase_time = as.POSIXct("2015-12-02 17:58")
    )
))

data.table::setkey(website, name, session_start_time)
data.table::setkey(paypal, name, purchase_time)

website[, session_id := .GRP, by = .(name, session_start_time)]
paypal[, payment_id := .GRP, by = .(name, purchase_time)]
```

A nice blog post describing rolling joins can be found at [R-bloggers](https://www.r-bloggers.com/2016/06/understanding-data-table-rolling-joins/),
so almost the same `website` and `paypal` tables will be used for the examples below.
Another short description with animated depictions can also be found [here](https://www.gormanalysis.com/blog/r-data-table-rolling-joins/).

```{r rolling-join-data-print}
print(website)
print(paypal)
```

In contrast to the blog post,
no `join_time` is added to the tables.
This is done on purpose in order to show what happens with the columns that are rolled.

Let's use a left rolling join to obtain the `session_id` that immediately preceded a purchase, if any:

```{r rolling-left-join}
paypal %>%
    left_join(website, name, purchase_time = session_start_time, roll = Inf)
```

We can see that the rows returned are from the left-hand side (`paypal`),
and since neither Mom nor Erica visited the website before their purchases,
their `session_id` ended as `NA`.

The order of the columns in the `on` expressions is the same as [above](#basic-joins).
The tricky part is that the rolled column ended up with the name from the right-hand side,
but keeping the values from the left-hand side.
If we "invert" the join,
the result is the same,
but the rolled column's name is now from the expression's left-hand side.

```{r rolling-right-join}
website %>%
    right_join(paypal, name, session_start_time = purchase_time, roll = Inf)
```

Note, however, that `roll` stayed equal to `Inf`.
This is because even though the column order in the expressions changed,
we could understand the rolling expressions as follows:

- For `left_join`, the rolling column on the left is `purchase_time`,
  so with `roll = Inf`,
  the values from `session_start_time` are rolled forward onto `purchase_time` to find a match while joining.
- For `right_join`, the rolling column on the right is `purchase_time`,
  so `roll` must stay as `Inf` to keep the same semantics.
  
Now let's say we want to keep all the rows from `website` and find the closest `payment_id` that occurred *after* the visit.
This could be expressed as:

```{r rolling-left-join-2}
website %>%
    left_join(paypal, name, session_start_time = purchase_time, roll = -Inf)
```

In order to simplify the meaning of `rollends` a bit,
we could think of it as missing or being a single `TRUE`/`FALSE`.
If it's missing, rolling works according to the value of `roll`,
otherwise:

- When `rollends = TRUE`, the value of `roll` is inverted *only* for those rows that would have no match otherwise.
- When `rollends = FALSE`, a matching roll will only occur if the column's value falls in a gap with values both before *and* after.

```{r rolling-left-join-rollend}
website %>%
    left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = TRUE)

website %>%
    left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = FALSE)
```

Vivian's `payment_id`s are always `NA` because she has never purchased anything.
On the other hand, no one except Francis has visited the website both before and after a purchase.

## Non-equi joins

Non-equi joins are similar to rolling joins,
but instead of rolling a single row's value,
they can return several values per row.

Using the same data as before,
we could find *all* the `session_id`s that preceded a `payment_id`,
giving "priority" to `paypal`'s rows:

```{r lne-join}
paypal %>%
    left_join(website, name, purchase_time >= session_start_time)
```

Priority above simply means that all rows from `paypal` are returned,
even if they don't have a match in `website`.
Even though a column `session_start_time` appears in the result,
the values contained therein are from `paypal`'s `purchase_time`.

A corresponding right non-equi join would yield the same result,
expecting only a different order in the columns that are part of the comparisons:

```{r rne-join}
website %>%
    right_join(paypal, name, session_start_time <= purchase_time)
```

## Self joins

In case a self join were necessary,
perhaps while using a rolling or non-equi join,
the way `magrittr`'s pipe handles the `.` outside of nested calls wouldn't allow calling a joining verb with `.` both in `x` and `y`.
To work around this,
the following verbs default to an eager self join when `y` is missing:

- `full_join`
- `left_join`
- `mutate_join`

As a somewhat contrived example,
we could add a rolling count of weekly visits per user to the `website` data introduced [above](#rolling-joins):

```{r self-join}
website %>%
    mutate(window_start = session_start_time - as.difftime(7, units = "days")) %>%
    mutate_join(, name, window_start <= session_start_time, session_start_time >= session_start_time,
                .SDcols = .(weekly_visits = .N),
                .by_each = TRUE) %>%
    mutate(window_start = NULL)

print(website)
```