--- title: "Example 1: Basic usage" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Example 1: Basic usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` # Use tidyfst just like dplyr This part of vignette has referred to `dplyr`'s vignette in . We'll try to reproduce all the results. First load the needed packages. ```{r} library(tidyfst) library(nycflights13) library(data.table) data.table(flights) ``` ## Filter rows with `filter_dt()` ```{r} filter_dt(flights, month == 1 & day == 1) ``` Note that comma could not be used in the expressions. Which means `filter_dt(flights, month == 1,day == 1)` would return error. ## Arrange rows with `arrange_dt()` ```{r} arrange_dt(flights, year, month, day) ``` Use `-` (minus symbol) to order a column in descending order: ```{r} arrange_dt(flights, -arr_delay) ``` ## Select columns with `select_dt()` ```{r} select_dt(flights, year, month, day) ``` `select_dt(flights, year:day)` and `select_dt(flights, -(year:day))` are not supported. But I have added a feature to help select with regular expression, which means you can: ```{r} select_dt(flights, "^dep") ``` The rename process is almost the same as that in `dplyr`: ```{r} select_dt(flights, tail_num = tailnum) rename_dt(flights, tail_num = tailnum) ``` ## Add new columns with `mutate_dt()` ```{r} mutate_dt(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) ``` However, if you just create the column, please split them. The following codes would not work: ```{r,eval=FALSE} mutate_dt(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) ) ``` Instead, use: ```{r} mutate_dt(flights,gain = arr_delay - dep_delay) %>% mutate_dt(gain_per_hour = gain / (air_time / 60)) ``` If you only want to keep the new variables, use `transmute_dt()`: ```{r} transmute_dt(flights, gain = arr_delay - dep_delay ) ``` ## Summarise values with `summarise_dt()` ```{r} summarise_dt(flights, delay = mean(dep_delay, na.rm = TRUE) ) ``` ## Randomly sample rows with `sample_n_dt()` and `sample_frac_dt()` ```{r} sample_n_dt(flights, 10) sample_frac_dt(flights, 0.01) ``` ## Grouped operations For the below `dplyr` codes: ```{r,eval=FALSE} by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000) ``` We could get it via: ```{r} flights %>% summarise_dt( count = .N, dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE),by = tailnum) ``` `summarise_dt` (or `summarize_dt`) has a parameter "by", you can specify the group. We could find the number of planes and the number of flights that go to each possible destination: ```{r} # the dplyr syntax: # destinations <- group_by(flights, dest) # summarise(destinations, # planes = n_distinct(tailnum), # flights = n() # ) summarise_dt(flights,planes = uniqueN(tailnum),flights = .N,by = dest) %>% arrange_dt(dest) ``` If you need to group by many variables, use: ```{r} # the dplyr syntax: # daily <- group_by(flights, year, month, day) # (per_day <- summarise(daily, flights = n())) flights %>% summarise_dt(by = .(year,month,day),flights = .N) # (per_month <- summarise(per_day, flights = sum(flights))) flights %>% summarise_dt(by = .(year,month,day),flights = .N) %>% summarise_dt(by = .(year,month),flights = sum(flights)) # (per_year <- summarise(per_month, flights = sum(flights))) flights %>% summarise_dt(by = .(year,month,day),flights = .N) %>% summarise_dt(by = .(year,month),flights = sum(flights)) %>% summarise_dt(by = .(year),flights = sum(flights)) ``` # Comparison with data.table syntax *tidyfst* provides a tidy syntax for *data.table*. For such design, *tidyfst* never runs faster than the analogous *data.table* codes. Nevertheless, it facilitate the dplyr-users to gain the computation performance in no time and guide them to learn more about data.table for speed. Below, we'll compare the syntax of `tidyfst` and `data.table` (referring to [Introduction to data.table](https://rdatatable.gitlab.io/data.table/articles/datatable-intro.html)). This could let you know how they are different, and let users to choose their preference. Ideally, *tidyfst* will lead even more users to learn more about *data.table* and its wonderful features, so as to design more extentions for *tidyfst* in the future. ## Data Because we want a more stable data source, here we'll use the flight data from the above `nycflights13` package. ```{r} library(tidyfst) library(data.table) library(nycflights13) flights = data.table(flights) %>% na.omit() ``` ## Subset rows ```{r} # data.table head(flights[origin == "JFK" & month == 6L]) flights[1:2] flights[order(origin, -dest)] # tidyfst flights %>% filter_dt(origin == "JFK" & month == 6L) %>% head() flights %>% slice_dt(1:2) flights %>% arrange_dt(origin,-dest) ``` ## Select column(s) ```{r} # data.table flights[, list(arr_delay)] flights[, .(arr_delay, dep_delay)] flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)] # tidyfst flights %>% select_dt(arr_delay) flights %>% select_dt(arr_delay, dep_delay) flights %>% transmute_dt(delay_arr = arr_delay, delay_dep = dep_delay) ``` ## Mixed computation ```{r} # data.table flights[, sum( (arr_delay + dep_delay) < 0)] flights[origin == "JFK" & month == 6L, .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))] flights[origin == "JFK" & month == 6L, length(dest)] flights[origin == "JFK" & month == 6L, .N] # tidyfst flights %>% summarise_dt(sum( (arr_delay + dep_delay) < 0)) flights %>% filter_dt(origin == "JFK" & month == 6L) %>% summarise_dt(m_arr = mean(arr_delay), m_dep = mean(dep_delay)) flights %>% filter_dt(origin == "JFK" & month == 6L) %>% nrow() flights %>% filter_dt(origin == "JFK" & month == 6L) %>% count_dt() flights %>% filter_dt(origin == "JFK" & month == 6L) %>% summarise_dt(.N) ``` In the above examples, we could learn that in *tidyfst*, you could still use the methods in data.table, such as `.N`. ## Refer to columns by names ```{r} # data.table flights[, c("arr_delay", "dep_delay")] select_cols = c("arr_delay", "dep_delay") flights[ , ..select_cols] flights[ , select_cols, with = FALSE] flights[, !c("arr_delay", "dep_delay")] flights[, -c("arr_delay", "dep_delay")] # returns year,month and day flights[, year:day] # returns day, month and year flights[, day:year] # returns all columns except year, month and day flights[, -(year:day)] flights[, !(year:day)] # tidyfst flights %>% select_dt(c("arr_delay", "dep_delay")) select_cols = c("arr_delay", "dep_delay") flights %>% select_dt(cols = select_cols) flights %>% select_dt(-arr_delay,-dep_delay) flights %>% select_dt(year:day) flights %>% select_dt(day:year) flights %>% select_dt(-(year:day)) flights %>% select_dt(!(year:day)) ``` ## Aggregations ```{r} # data.table flights[, .N, by = .(origin)] flights[carrier == "AA", .N, by = origin] flights[carrier == "AA", .N, by = .(origin, dest)] flights[carrier == "AA", .(mean(arr_delay), mean(dep_delay)), by = .(origin, dest, month)] # tidyfst flights %>% count_dt(origin) # sort by default flights %>% filter_dt(carrier == "AA") %>% count_dt(origin) flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest) flights %>% filter_dt(carrier == "AA") %>% summarise_dt(mean(arr_delay), mean(dep_delay), by = .(origin, dest, month)) ``` Note that currently `keyby` is not used in *tidyfst*. This featuer might be included in the future for better performance in order-independent tasks. Moreover, `count_dt` is sorted automatically by the counted number, this could be controlled by the parameter "sort". ```{r} # data.table flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)] flights[, .N, .(dep_delay>0, arr_delay>0)] # tidyfst flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest,sort = FALSE) %>% arrange_dt(origin,-dest) flights %>% summarise_dt(.N,by = .(dep_delay>0, arr_delay>0)) ``` Now let's try a more complex example: ```{r} # data.table flights[carrier == "AA", lapply(.SD, mean), by = .(origin, dest, month), .SDcols = c("arr_delay", "dep_delay")] # tidyfst flights %>% filter_dt(carrier == "AA") %>% group_dt( by = .(origin, dest, month), at_dt("_delay",summarise_dt,mean) ) ``` Let me explain what happens here, especially in `group_dt`. First filter by condition `carrier == "AA"`, then group by three variables, which are `origin, dest, month`. Last, summarise by columns with "_delay" in the column names and get the mean value of all such variables(with "_delay" in their column names). This is a very creative design, utilizing `.SD` in *data.table* and upgrade the `group_by` function in *dplyr* (because you never need to `ungroup` now, just put the group operations in the `group_dt`). And **you can pipe in the group_dt function**. Let's play with it a little bit further: ```{r} flights %>% filter_dt(carrier == "AA") %>% group_dt( by = .(origin, dest, month), at_dt("_delay",summarise_dt,mean) %>% mutate_dt(sum = dep_delay + arr_delay) ) ``` However, I don't recommend using it if you don't acutually need it for group computation (just start another pipe follows `group_dt`). Now let's end with some easy examples: ```{r} # data.table flights[, head(.SD, 2), by = month] # tidyfst flights %>% group_dt(by = month,head(2)) ``` Deep inside, *tidyfst* is born from *dplyr* and *data.table*, and use *stringr* to make flexible APIs, so as to bring their superiority into full play.