我的R语言小伙伴最近分享了自己使用R来做工业级数据清洗的经验,最近我自己在不断测试我的新包tidyfst,因此就拿这个data.table的案例来尝试一下。
本次测试,将不会加载data.table包,但是其实tidyfst里面无处不是data.table的元素,而且也导出了很多内置的data.table函数,比如as.data.table和data.table。所以这些代码在tidyfst中就可以自如地使用。
library(tidyfst)
diamonds <- ggplot2::diamonds
n = 1e5 #如果想做工业级测试,可以继续增加数量
set.seed(2020)
dtranges <- seq.Date(from = as.Date("2011-01-01"),
to = as.Date("2020-01-01"),
by = 1)
n1 <- sample(nrow(diamonds), n, replace = TRUE)
dat1 <- as.data.table(diamonds[n1, ])
dat1[, "dt"] <- sample(dtranges, n, replace = TRUE) # 增加dt列
n2 <- sample(nrow(dat1), nrow(dat1)/1000)
dat1[n2, "price"] <- NA # price列构造千分之一缺失值
dat2 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
price1 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))
dat3 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
price2 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))
print(dat1)
#> carat cut color clarity depth table price x y
#> <num> <ord> <ord> <ord> <num> <num> <int> <num> <num>
#> 1: 2.23 Premium J VS2 61.0 58 14867 8.39 8.36
#> 2: 0.46 Ideal F SI2 61.5 54 758 4.98 5.01
#> 3: 0.70 Ideal H VVS1 60.9 57 3611 5.70 5.78
#> 4: 1.50 Premium F VS2 61.5 58 14719 7.34 7.32
#> 5: 0.40 Very Good G IF 61.0 59 1154 4.75 4.79
#> ---
#> 99996: 1.14 Ideal F VS2 61.4 56 8017 6.73 6.78
#> 99997: 0.39 Ideal F SI1 61.8 55 886 4.70 4.72
#> 99998: 2.00 Good F SI1 63.0 64 17869 7.67 7.76
#> 99999: 1.00 Premium H VS2 59.7 59 5139 6.57 6.52
#> 100000: 0.30 Ideal G VVS2 60.6 57 878 4.35 4.33
#> 2 variables not shown: [z <num>, dt <Date>]
后面的分析,经常要根据日期进行计算。所以,先对日期进行排序,就能够提高运行速度。在tidyfst中,可以使用arrange_dt
函数来对数据进行原位的各种操作,其中就包括排序。
dat1 = arrange_dt(dat1,dt)
dat1
#> carat cut color clarity depth table price x y
#> <num> <ord> <ord> <ord> <num> <num> <int> <num> <num>
#> 1: 0.34 Very Good H VS2 62.4 60.0 537 4.41 4.44
#> 2: 0.30 Good G VS1 63.6 57.0 776 4.26 4.23
#> 3: 1.12 Ideal F VVS2 61.4 57.0 9634 6.69 6.66
#> 4: 1.00 Premium E SI1 59.1 58.0 4805 6.52 6.44
#> 5: 0.50 Fair D VVS1 65.9 64.0 1792 4.92 5.03
#> ---
#> 99996: 0.32 Ideal J VS1 62.0 54.7 442 4.39 4.42
#> 99997: 0.33 Ideal G VS2 62.5 57.0 743 4.41 4.39
#> 99998: 0.28 Very Good I VS2 61.9 57.0 379 4.16 4.19
#> 99999: 0.30 Ideal G VS2 62.0 56.0 556 4.28 4.30
#> 100000: 0.32 Ideal F VS2 62.4 54.0 645 4.38 4.40
#> 2 variables not shown: [z <num>, dt <Date>]
那么,现在dat1的数据就按照日期排好序了。
在tidyfst中,我设置了一个sys_time_print
函数,可以方便地输出system.time()
函数返回的结果。
sys_time_print({
r1_1 <- dat1 %>%
summarise_dt(
by = .(cut,color),
mean_price = mean(price, na.rm = TRUE),
median_price = median(price, na.rm = TRUE),
max_price = max(price, na.rm = TRUE)
)
})
#> [1] "# Finished in 0.005s elapsed (0.003s cpu)"
r1_1
#> cut color mean_price median_price max_price
#> <ord> <ord> <num> <num> <int>
#> 1: Very Good H 4390.470 3394.0 18803
#> 2: Good G 3999.886 3303.0 18788
#> 3: Ideal F 3309.061 1751.0 18780
#> 4: Premium E 3481.628 1881.0 18426
#> 5: Fair D 4237.899 3205.0 16386
#> 6: Premium J 6416.235 5110.5 18706
#> 7: Premium F 4347.158 2874.0 18791
#> 8: Ideal D 2608.030 1550.5 18693
#> 9: Very Good G 3900.558 2403.0 18818
#> 10: Very Good E 3301.873 2051.0 18731
#> 11: Premium G 4543.181 2780.0 18741
#> 12: Ideal G 3751.035 1881.0 18806
#> 13: Ideal E 2558.501 1429.0 18729
#> 14: Premium H 5215.019 4451.0 18795
#> 15: Premium I 5921.498 4545.5 18823
#> 16: Very Good F 3792.863 2473.0 18777
#> 17: Good D 3369.352 2661.0 18468
#> 18: Fair G 4186.403 2797.0 18574
#> 19: Good E 3293.289 2407.0 18236
#> 20: Premium D 3796.609 2348.0 18286
#> 21: Ideal H 3924.682 2320.0 18659
#> 22: Very Good I 5475.587 4013.0 18500
#> 23: Ideal I 4434.030 2631.0 18779
#> 24: Fair J 5029.502 3422.0 18531
#> 25: Very Good D 3437.133 2242.0 18526
#> 26: Good F 3518.548 2755.0 18686
#> 27: Good I 4966.745 3484.0 18707
#> 28: Good H 4098.280 3323.0 18640
#> 29: Fair H 5454.118 4134.5 18308
#> 30: Fair F 3790.703 2961.0 17995
#> 31: Ideal J 4972.801 4135.0 18508
#> 32: Very Good J 5107.638 4126.5 18430
#> 33: Fair I 4278.831 3022.0 18242
#> 34: Good J 4578.490 3690.0 18325
#> 35: Fair E 3834.358 2996.5 15584
#> cut color mean_price median_price max_price
tidyfst是永远不可能比data.table快的,但是如果你觉得上面的代码更容易掌握、更容易读懂,而在日常工作中多花零点几秒的运行时间没有太大问题(实际上节省了大家的交流时间,甚至就是节省将来自己再次读懂自己代码的时间),tidyfst就值得拥有。
sys_time_print({
r1_2 <- dat1 %>%
arrange_dt(dt,-price) %>%
drop_na_dt(price) %>%
group_dt(
by = dt,
head(1)
)
})
#> [1] "# Finished in 0.028s elapsed (0.053s cpu)"
r1_2
#> dt carat cut color clarity depth table price x
#> <Date> <num> <ord> <ord> <ord> <num> <num> <int> <num>
#> 1: 2011-01-01 2.10 Premium J VS2 58.3 54 12401 8.49
#> 2: 2011-01-02 2.02 Premium I VS2 59.0 59 17893 8.34
#> 3: 2011-01-03 1.61 Ideal F VS1 62.0 55 17414 7.55
#> 4: 2011-01-04 2.19 Premium E SI2 62.4 61 18232 8.31
#> 5: 2011-01-05 2.05 Premium H VS2 60.7 61 16235 8.25
#> ---
#> 3284: 2019-12-28 2.22 Ideal J SI2 62.8 57 13703 8.33
#> 3285: 2019-12-29 2.10 Very Good J VS2 61.3 59 14229 8.21
#> 3286: 2019-12-30 1.55 Very Good H VS2 63.3 56 10546 7.38
#> 3287: 2019-12-31 1.51 Very Good F VVS2 63.1 56 17317 7.32
#> 3288: 2020-01-01 2.04 Premium J VS2 63.0 58 13734 8.02
#> 2 variables not shown: [y <num>, z <num>]
实质上,merge函数已经优化得很好。tidyfst设计*_join
系列函数的时候,只是为了一种不一样的语法结构来帮助实现不同的连接,因为它确实更加直观一些。但是实质上它还是merge.data.table函数的包装版本。
sys_time_print({
r2_1 <- dat1 %>%
left_join_dt(dat2,by = "dt")
})
#> [1] "# Finished in 0.090s elapsed (0.107s cpu)"
r2_1
#> Key: <dt>
#> dt carat cut color clarity depth table price x
#> <Date> <num> <ord> <ord> <ord> <num> <num> <int> <num>
#> 1: 2011-01-01 0.34 Very Good H VS2 62.4 60.0 537 4.41
#> 2: 2011-01-01 0.30 Good G VS1 63.6 57.0 776 4.26
#> 3: 2011-01-01 1.12 Ideal F VVS2 61.4 57.0 9634 6.69
#> 4: 2011-01-01 1.00 Premium E SI1 59.1 58.0 4805 6.52
#> 5: 2011-01-01 0.50 Fair D VVS1 65.9 64.0 1792 4.92
#> ---
#> 99996: 2020-01-01 0.32 Ideal J VS1 62.0 54.7 442 4.39
#> 99997: 2020-01-01 0.33 Ideal G VS2 62.5 57.0 743 4.41
#> 99998: 2020-01-01 0.28 Very Good I VS2 61.9 57.0 379 4.16
#> 99999: 2020-01-01 0.30 Ideal G VS2 62.0 56.0 556 4.28
#> 100000: 2020-01-01 0.32 Ideal F VS2 62.4 54.0 645 4.38
#> 3 variables not shown: [y <num>, z <num>, price1 <int>]
sys_time_print({
mymerge <- function(x, y) left_join_dt(x, y, by = "dt")
r2_2 <- Reduce(mymerge, list(dat1, dat2, dat3))
})
#> [1] "# Finished in 0.012s elapsed (0.023s cpu)"
r2_2
#> Key: <dt>
#> dt carat cut color clarity depth table price x
#> <Date> <num> <ord> <ord> <ord> <num> <num> <int> <num>
#> 1: 2011-01-01 0.34 Very Good H VS2 62.4 60.0 537 4.41
#> 2: 2011-01-01 0.30 Good G VS1 63.6 57.0 776 4.26
#> 3: 2011-01-01 1.12 Ideal F VVS2 61.4 57.0 9634 6.69
#> 4: 2011-01-01 1.00 Premium E SI1 59.1 58.0 4805 6.52
#> 5: 2011-01-01 0.50 Fair D VVS1 65.9 64.0 1792 4.92
#> ---
#> 99996: 2020-01-01 0.32 Ideal J VS1 62.0 54.7 442 4.39
#> 99997: 2020-01-01 0.33 Ideal G VS2 62.5 57.0 743 4.41
#> 99998: 2020-01-01 0.28 Very Good I VS2 61.9 57.0 379 4.16
#> 99999: 2020-01-01 0.30 Ideal G VS2 62.0 56.0 556 4.28
#> 100000: 2020-01-01 0.32 Ideal F VS2 62.4 54.0 645 4.38
#> 4 variables not shown: [y <num>, z <num>, price1 <int>, price2 <int>]
sys_time_print({
mean1 <- function(x) mean(x, na.rm = TRUE)
max1 <- function(x) max(x, na.rm = TRUE)
r3_1 <-dat1 %>%
wider_dt(cut,
value = c("depth", "price"),
name = "color",
fun = list(mean1,max1))
})
#> [1] "# Finished in 0.010s elapsed (0.014s cpu)"
r3_1
#> Key: <cut>
#> cut depth_mean1_D depth_mean1_E depth_mean1_F depth_mean1_G
#> <ord> <num> <num> <num> <num>
#> 1: Fair 64.15688 63.09882 63.41129 64.33663
#> 2: Good 62.32176 62.16065 62.18220 62.60212
#> 3: Very Good 61.70454 61.73751 61.73138 61.83043
#> 4: Premium 61.18875 61.16850 61.25098 61.27894
#> 5: Ideal 61.67241 61.68693 61.65616 61.70585
#> 24 variables not shown: [depth_mean1_H <num>, depth_mean1_I <num>, depth_mean1_J <num>, price_mean1_D <num>, price_mean1_E <num>, price_mean1_F <num>, price_mean1_G <num>, price_mean1_H <num>, price_mean1_I <num>, price_mean1_J <num>, ...]
sys_time_print({
r3_2 <-dat1 %>%
select_dt(cut,color,x,y,z) %>%
longer_dt(cut,color,
name = "xyz",
value = "xyzvalue")
})
#> [1] "# Finished in 0.008s elapsed (0.004s cpu)"
r3_2
#> cut color xyz xyzvalue
#> <ord> <ord> <fctr> <num>
#> 1: Very Good H x 4.41
#> 2: Good G x 4.26
#> 3: Ideal F x 6.69
#> 4: Premium E x 6.52
#> 5: Fair D x 4.92
#> ---
#> 299996: Ideal J z 2.73
#> 299997: Ideal G z 2.75
#> 299998: Very Good I z 2.58
#> 299999: Ideal G z 2.66
#> 300000: Ideal F z 2.74
对于填充空值来说,可以这样操作:
sys_time_print({
dat1 %>% fill_na_dt(price) -> dat1
})
#> [1] "# Finished in 0.003s elapsed (0.004s cpu)"
dat1
#> carat cut color clarity depth table price x y
#> <num> <ord> <ord> <ord> <num> <num> <int> <num> <num>
#> 1: 0.34 Very Good H VS2 62.4 60.0 537 4.41 4.44
#> 2: 0.30 Good G VS1 63.6 57.0 776 4.26 4.23
#> 3: 1.12 Ideal F VVS2 61.4 57.0 9634 6.69 6.66
#> 4: 1.00 Premium E SI1 59.1 58.0 4805 6.52 6.44
#> 5: 0.50 Fair D VVS1 65.9 64.0 1792 4.92 5.03
#> ---
#> 99996: 0.32 Ideal J VS1 62.0 54.7 442 4.39 4.42
#> 99997: 0.33 Ideal G VS2 62.5 57.0 743 4.41 4.39
#> 99998: 0.28 Very Good I VS2 61.9 57.0 379 4.16 4.19
#> 99999: 0.30 Ideal G VS2 62.0 56.0 556 4.28 4.30
#> 100000: 0.32 Ideal F VS2 62.4 54.0 645 4.38 4.40
#> 2 variables not shown: [z <num>, dt <Date>]
sys_time_print({
mutate_dt(dat1,
mean_price = mean(price, na.rm = TRUE),
sd_price = sd(price, na.rm = TRUE),
by = .(cut, color))
})
#> [1] "# Finished in 0.012s elapsed (0.017s cpu)"
dat1
#> carat cut color clarity depth table price x y
#> <num> <ord> <ord> <ord> <num> <num> <int> <num> <num>
#> 1: 0.34 Very Good H VS2 62.4 60.0 537 4.41 4.44
#> 2: 0.30 Good G VS1 63.6 57.0 776 4.26 4.23
#> 3: 1.12 Ideal F VVS2 61.4 57.0 9634 6.69 6.66
#> 4: 1.00 Premium E SI1 59.1 58.0 4805 6.52 6.44
#> 5: 0.50 Fair D VVS1 65.9 64.0 1792 4.92 5.03
#> ---
#> 99996: 0.32 Ideal J VS1 62.0 54.7 442 4.39 4.42
#> 99997: 0.33 Ideal G VS2 62.5 57.0 743 4.41 4.39
#> 99998: 0.28 Very Good I VS2 61.9 57.0 379 4.16 4.19
#> 99999: 0.30 Ideal G VS2 62.0 56.0 556 4.28 4.30
#> 100000: 0.32 Ideal F VS2 62.4 54.0 645 4.38 4.40
#> 2 variables not shown: [z <num>, dt <Date>]
sys_time_print({
dat1 %>%
group_dt(
by = dt,
mutate_dt(id = seq(.N))
) -> dat1
})
#> [1] "# Finished in 1.330s elapsed (1.317s cpu)"
dat1
#> dt carat cut color clarity depth table price x
#> <Date> <num> <ord> <ord> <ord> <num> <num> <int> <num>
#> 1: 2011-01-01 0.34 Very Good H VS2 62.4 60.0 537 4.41
#> 2: 2011-01-01 0.30 Good G VS1 63.6 57.0 776 4.26
#> 3: 2011-01-01 1.12 Ideal F VVS2 61.4 57.0 9634 6.69
#> 4: 2011-01-01 1.00 Premium E SI1 59.1 58.0 4805 6.52
#> 5: 2011-01-01 0.50 Fair D VVS1 65.9 64.0 1792 4.92
#> ---
#> 99996: 2020-01-01 0.32 Ideal J VS1 62.0 54.7 442 4.39
#> 99997: 2020-01-01 0.33 Ideal G VS2 62.5 57.0 743 4.41
#> 99998: 2020-01-01 0.28 Very Good I VS2 61.9 57.0 379 4.16
#> 99999: 2020-01-01 0.30 Ideal G VS2 62.0 56.0 556 4.28
#> 100000: 2020-01-01 0.32 Ideal F VS2 62.4 54.0 645 4.38
#> 3 variables not shown: [y <num>, z <num>, id <int>]
sys_time_print({
dat1 %>%
group_dt(
by = color,
mutate_dt(
MA10_price = frollmean(price, 10),
MSD10_price = frollapply(price, 10, FUN = sd)
)
) -> dat1
})
#> [1] "# Finished in 0.717s elapsed (0.764s cpu)"
dat1
#> color dt carat cut clarity depth table price x
#> <ord> <Date> <num> <ord> <ord> <num> <num> <int> <num>
#> 1: H 2011-01-01 0.34 Very Good VS2 62.4 60 537 4.41
#> 2: H 2011-01-01 1.20 Premium SI2 62.1 57 3965 6.84
#> 3: H 2011-01-01 1.51 Premium SI1 58.0 59 10063 7.57
#> 4: H 2011-01-02 1.00 Premium VS2 60.7 60 4930 6.55
#> 5: H 2011-01-02 1.01 Ideal SI2 61.1 56 4693 6.44
#> ---
#> 99996: I 2019-12-30 0.54 Ideal SI1 62.1 54 1057 5.24
#> 99997: I 2019-12-31 0.90 Very Good VS2 59.3 59 3992 6.27
#> 99998: I 2019-12-31 0.30 Ideal SI1 61.3 56 506 4.33
#> 99999: I 2020-01-01 1.22 Premium VS2 62.3 59 5832 6.79
#> 100000: I 2020-01-01 0.28 Very Good VS2 61.9 57 379 4.16
#> 5 variables not shown: [y <num>, z <num>, id <int>, MA10_price <num>, MSD10_price <num>]
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bench_1.1.3 dplyr_1.1.4 data.table_1.16.99 tidyfst_1.8.1
#> [5] rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 jsonlite_1.8.9 compiler_4.4.2
#> [4] tidyselect_1.2.1 Rcpp_1.0.13.6 stringr_1.5.1
#> [7] parallel_4.4.2 jquerylib_0.1.4 scales_1.3.0
#> [10] yaml_2.3.10 fastmap_1.2.0 ggplot2_3.5.1
#> [13] R6_2.5.1 generics_0.1.3 knitr_1.49
#> [16] tibble_3.2.1 maketools_1.3.1 munsell_0.5.1
#> [19] bslib_0.8.0 pillar_1.9.0 rlang_1.1.4
#> [22] utf8_1.2.4 cachem_1.1.0 stringi_1.8.4.9001
#> [25] xfun_0.49 sass_0.4.9 sys_3.4.3
#> [28] cli_3.6.3 withr_3.0.2 magrittr_2.0.3.9000
#> [31] grid_4.4.2 digest_0.6.37 fst_0.9.9
#> [34] lifecycle_1.0.4 vctrs_0.6.5.9000 evaluate_1.0.1
#> [37] glue_1.8.0 buildtools_1.0.0 colorspace_2.1-1
#> [40] profmem_0.6.0 fansi_1.0.6 fstcore_0.9.18
#> [43] tools_4.4.2 pkgconfig_2.0.3 htmltools_0.5.8.1