Before tidyft, I’ve designed a package named tidyfst. Backed by data.table, it is fast and convenient. By then, I was not so interested in modification by reference, which always causes trouble in my workflow. Therefore, I use a lot of functions to make copies so as to suppress the in place replacement. However, when it comes to big data, simply making a new copy of the original data set could be time consuming and memory inefficient. So I tried to write some functions using the feature of modification by reference. This ends up in inconsistency of many functions in the tidyfst package. In the end, I removed all the in place replacement functions in tidyfst and build a new package instead. This is how tidyft comes into being.
You cannot step into the same river twice, for other waters are continually flowing on.
—— Heraclitus
If you try to do data operations on any data.table(s), never use it
again for futher analysis, because it is not the data you know before.
And you might never figure out what have happened and what has been
changed in that process. If you really want to use it again, try make a
copy first using copy()
, which might take extra time and
space (that’s why tidyft avoid doing this all the time).
Another rule is, tidyft only deals with data.table(s), the raw data.frame and other formats such as tibble could not work. If you already have lots of data.frames in the environment, try these codes.
library(tidyft)
#>
#> Life's short, use R.
#>
#> Attaching package: 'tidyft'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
# make copies
copy(iris) -> a
copy(mtcars) -> b
# before
class(a)
#> [1] "data.frame"
class(b)
#> [1] "data.frame"
# convert codes
lapply(ls(),get) %>%
lapply(setDT) %>%
invisible()
# after
class(a)
#> [1] "data.frame"
class(b)
#> [1] "data.frame"
One last thing, while modifications are carried out in place, doesn’t mean that the results could not be showed after operation. The data.table package would return it invisibly, but in tidyft, the final results are always printed if possible. This brings no reduction to the computation performance.
tidyft would not be so powerful without fst. I first introduce this workflow into tidyfst. In such workflow, you do not have to read all data into memory, only import the needed data when necessary. tidyft is not so convenient for in-memory operations, but it works very well (if not best) with the fst workflow. Here we’ll make some examples.
rm(list = ls())
library(tidyft)
# make a large data.frame
iris[rep(1:nrow(iris),1e4),] -> dt
# size: 1500000 rows, 5 columns
dim(dt)
#> [1] 1500000 5
# save as fst table
as_fst(dt) -> ft
# remove the data.frame from RAM
rm(dt)
# inspect the fst table of large iris
ft
#> <fst file>
#> 1500000 rows, 5 columns (dt98176154459.fst)
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <double> <double> <double> <double> <factor>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> -- -- -- -- -- --
#> 1499996 6.7 3.0 5.2 2.3 virginica
#> 1499997 6.3 2.5 5.0 1.9 virginica
#> 1499998 6.5 3.0 5.2 2.0 virginica
#> 1499999 6.2 3.4 5.4 2.3 virginica
#> 1500000 5.9 3.0 5.1 1.8 virginica
summary_fst(ft)
#> <fst file>
#> 1500000 rows, 5 columns (dt98176154459.fst)
#>
#> * 'Sepal.Length': double
#> * 'Sepal.Width' : double
#> * 'Petal.Length': double
#> * 'Petal.Width' : double
#> * 'Species' : factor
# list the variables in the environment
ls() # only the ft exists
#> [1] "ft"
The as_fst
could save any data.frame as “.fst” file in
temporary file and parse it back as fst table. Fst table is small in
RAM, but if you want to get any part of the data.frame, you can get it
in almost no time:
ft %>%
slice_fst(5555:6666) # get 5555 to 6666 row
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 5.0 3.6 1.4 0.2 setosa
#> 2: 5.4 3.9 1.7 0.4 setosa
#> 3: 4.6 3.4 1.4 0.3 setosa
#> 4: 5.0 3.4 1.5 0.2 setosa
#> 5: 4.4 2.9 1.4 0.2 setosa
#> ---
#> 1108: 5.9 3.0 4.2 1.5 versicolor
#> 1109: 6.0 2.2 4.0 1.0 versicolor
#> 1110: 6.1 2.9 4.7 1.4 versicolor
#> 1111: 5.6 2.9 3.6 1.3 versicolor
#> 1112: 6.7 3.1 4.4 1.4 versicolor
Except for slice_fst
, there are also other functions for
subsetting the data, such as
select_fst
,filter_fst
. Good practice is: Make
subsets of the data and use the least needy data to do operations. For
very large data sets, you may try to do tests on a sample of the data
(using slice
or select
to get several rows or
columns) first before you implement a huge operation. Now let’s do a
slightly complex manipulation. We’ll use sys_time_print
to
measure the running time.
sys_time_print({
res = ft %>%
select_fst(Species,Sepal.Length,Sepal.Width) %>%
rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>%
arrange(group,sl) %>%
filter(sl > 5) %>%
distinct(sl,.keep_all = TRUE) %>%
summarise(sw = max(sw),by = group)
})
#> [1] "Finished in 0.469s elapsed (0.615s cpu)"
res
#> group sw
#> <fctr> <num>
#> 1: setosa 4.4
#> 2: versicolor 3.3
#> 3: virginica 3.8
This should be pretty fast. Becasue when we use the data in fst table, we never get them until using the “_fst” suffix functions, so the tidyft functions never modify the data in the fst file or fst table. That is to say, we do not have to worry about the modification by reference any more. No copies made, fastest ever.
The fst workflow could also be working with other tools, though less efficient. Now let’s compare the performance of tidyft, data.table, dtplyr and dplyr.
rm(list = ls())
library(data.table)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:tidyft':
#>
#> add_count, anti_join, arrange, count, cummean, distinct, filter,
#> full_join, group_by, groups, inner_join, lag, lead, left_join,
#> mutate, nth, pull, relocate, rename, right_join, select,
#> select_vars, semi_join, slice, slice_head, slice_max, slice_min,
#> slice_sample, slice_tail, summarise, transmute, ungroup
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(dtplyr)
library(tidyft)
# make a large data.frame
iris[rep(1:nrow(iris),1e4),] -> dt
# size: 1500000 rows, 5 columns
dim(dt)
#> [1] 1500000 5
# save as fst table
as_fst(dt) -> ft
# remove the data.frame from RAM
rm(dt)
bench::mark(
dplyr = ft %>%
select_fst(Species,Sepal.Length,Sepal.Width,Petal.Length) %>%
dplyr::select(-Petal.Length) %>%
dplyr::rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>%
dplyr::arrange(group,sl) %>%
dplyr::filter(sl > 5) %>%
dplyr::distinct(sl,.keep_all = TRUE) %>%
dplyr::group_by(group) %>%
dplyr::summarise(sw = max(sw)),
dtplyr = ft %>%
select_fst(Species,Sepal.Length,Sepal.Width,Petal.Length) %>%
lazy_dt() %>%
dplyr::select(-Petal.Length) %>%
dplyr::rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>%
dplyr::arrange(group,sl) %>%
dplyr::filter(sl > 5) %>%
dplyr::distinct(sl,.keep_all = TRUE) %>%
dplyr::group_by(group) %>%
dplyr::summarise(sw = max(sw)) %>%
as.data.table(),
data.table = ft[,c("Species","Sepal.Length","Sepal.Width","Petal.Length")] %>%
setDT() %>%
.[,.SD,.SDcols = -"Petal.Length"] %>%
setnames(old =c("Species","Sepal.Length","Sepal.Width"),
new = c("group","sl","sw")) %>%
setorder(group,sl) %>%
.[sl>5] %>% unique(by = "sl") %>%
.[,.(sw = max(sw)),by = group],
tidyft = ft %>%
tidyft::select_fst(Species,Sepal.Length,Sepal.Width,Petal.Length) %>%
tidyft::select(-Petal.Length) %>%
tidyft::rename(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>%
tidyft::arrange(group,sl) %>%
tidyft::filter(sl > 5) %>%
tidyft::distinct(sl,.keep_all = TRUE) %>%
tidyft::summarise(sw = max(sw),by = group),
check = setequal
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 127ms 182ms 5.94 180MB 19.8
#> 2 dtplyr 212ms 258ms 3.87 201MB 15.5
#> 3 data.table 106ms 123ms 7.30 130MB 11.0
#> 4 tidyft 114ms 122ms 7.29 102MB 9.12
Because tidyft is based on data.table, therefore, if you always use
data.table correctly, then tidyft should not perform better than
data.table (I do use some tricks, by never do column selection but
delete the unselected ones instead, which is faster and more memory
efficient than using .SDcols
in data.table). However,
tidyft has a very different syntax, which might be more readable. And
lots of complex operations of data.table has been wrapped in it. This
could save your day to write the correct codes sometimes. I hope all my
time devoted to this work could possibly save some of your valuable time
on data operations of big datasets.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dtplyr_1.3.1.9000 dplyr_1.1.4 data.table_1.16.99 fstcore_0.9.18
#> [5] tidyft_0.9.20 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] jsonlite_1.8.9 compiler_4.4.2 tidyselect_1.2.1
#> [4] Rcpp_1.0.13.5 stringr_1.5.1 parallel_4.4.2
#> [7] jquerylib_0.1.4 yaml_2.3.10 fastmap_1.2.0
#> [10] R6_2.5.1 generics_0.1.3 knitr_1.49
#> [13] tibble_3.2.1 maketools_1.3.1 bslib_0.8.0
#> [16] pillar_1.9.0 rlang_1.1.4 utf8_1.2.4
#> [19] cachem_1.1.0 stringi_1.8.4.9001 xfun_0.49
#> [22] sass_0.4.9 sys_3.4.3 cli_3.6.3
#> [25] withr_3.0.2 magrittr_2.0.3.9000 digest_0.6.37
#> [28] fst_0.9.9 lifecycle_1.0.4 vctrs_0.6.5.9000
#> [31] bench_1.1.3 evaluate_1.0.1 glue_1.8.0
#> [34] buildtools_1.0.0 profmem_0.6.0 fansi_1.0.6
#> [37] tools_4.4.2 pkgconfig_2.0.3 htmltools_0.5.8.1