--- title: "collapse and dplyr" subtitle: "Fast (Weighted) Aggregations and Transformations in a Piped Workflow" author: "Sebastian Krantz" date: "2021-01-04" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{collapse and dplyr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: cache: true --- This vignette focuses on the integration of *collapse* and the popular *dplyr* package by Hadley Wickham. In particular it will demonstrate how using *collapse*'s fast functions and some fast alternatives for *dplyr* verbs can substantially facilitate and speed up basic data manipulation, grouped and weighted aggregations and transformations, and panel data computations (i.e. between- and within-transformations, panel-lags, differences and growth rates) in a *dplyr* (piped) workflow. *** **Notes:** - This vignette is targeted at *dplyr* / *tidyverse* users. *collapse* is a standalone package and can be programmed efficiently without pipes or *dplyr* verbs. - The 'Introduction to *collapse*' vignette provides a thorough introduction to the package and a built-in structured documentation is available under `help("collapse-documentation")` after installing the package. In addition `help("collapse-package")` provides a compact set of examples for quick-start. - Documentation and vignettes can also be viewed [online](). *** ## 1. Fast Aggregations A key feature of *collapse* is it's broad set of *Fast Statistical Functions* (`fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fnobs, fndistinct`) which are able to substantially speed-up column-wise, grouped and weighted computations on vectors, matrices or data frames. The functions are S3 generic, with a default (vector), matrix and data frame method, as well as a grouped_df method for grouped tibbles used by *dplyr*. The grouped tibble method has the following arguments: ```r FUN.grouped_df(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] ...) ``` where `w` is a weight variable, and `TRA` and can be used to transform `x` using the computed statistics and one of 10 available transformations (`"replace_fill", "replace", "-", "-+", "/", "%", "+", "*", "%%", "-%%"`, discussed in section 2). `na.rm` efficiently removes missing values and is `TRUE` by default. `use.g.names` generates new row-names from the unique combinations of groups (default: disabled), whereas `keep.group_vars` (default: enabled) will keep the grouping columns as is custom in the native `data %>% group_by(...) %>% summarize(...)` workflow in *dplyr*. Finally, `keep.w` regulates whether a weighting variable used is also aggregated and saved in a column. For `fsum, fmean, fmedian, fnth, fvar, fsd` and `fmode` this will compute the sum of the weights in each group, whereas `fprod` returns the product of the weights. With that in mind, let's consider some straightforward applications. ### 1.1 Simple Aggregations Consider the Groningen Growth and Development Center 10-Sector Database included in *collapse* and introduced in the main vignette: ```r library(collapse) head(GGDC10S) # Country Regioncode Region Variable Year AGR MIN MAN PU # 1 BWA SSA Sub-saharan Africa VA 1960 NA NA NA NA # 2 BWA SSA Sub-saharan Africa VA 1961 NA NA NA NA # 3 BWA SSA Sub-saharan Africa VA 1962 NA NA NA NA # 4 BWA SSA Sub-saharan Africa VA 1963 NA NA NA NA # 5 BWA SSA Sub-saharan Africa VA 1964 16.30154 3.494075 0.7365696 0.1043936 # 6 BWA SSA Sub-saharan Africa VA 1965 15.72700 2.495768 1.0181992 0.1350976 # CON WRT TRA FIRE GOV OTH SUM # 1 NA NA NA NA NA NA NA # 2 NA NA NA NA NA NA NA # 3 NA NA NA NA NA NA NA # 4 NA NA NA NA NA NA NA # 5 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229 # 6 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710 # Summarize the Data: # descr(GGDC10S, cols = is_categorical) # aperm(qsu(GGDC10S, ~Variable, cols = is.numeric)) # Efficiently converting to tibble (no deep copy) GGDC10S <- qTBL(GGDC10S) ``` Simple column-wise computations using the fast functions and pipe operators are performed as follows: ```r library(dplyr) GGDC10S %>% fnobs # Number of Observations # Country Regioncode Region Variable Year AGR MIN MAN PU # 5027 5027 5027 5027 5027 4364 4355 4355 4354 # CON WRT TRA FIRE GOV OTH SUM # 4355 4355 4355 4355 3482 4248 4364 GGDC10S %>% fndistinct # Number of distinct values # Country Regioncode Region Variable Year AGR MIN MAN PU # 43 6 6 2 67 4353 4224 4353 4237 # CON WRT TRA FIRE GOV OTH SUM # 4339 4344 4334 4349 3470 4238 4364 GGDC10S %>% select_at(6:16) %>% fmedian # Median # AGR MIN MAN PU CON WRT TRA FIRE GOV # 4394.5194 173.2234 3718.0981 167.9500 1473.4470 3773.6430 1174.8000 960.1251 3928.5127 # OTH SUM # 1433.1722 23186.1936 GGDC10S %>% select_at(6:16) %>% fmean # Mean # AGR MIN MAN PU CON WRT TRA FIRE GOV # 2526696.5 1867908.9 5538491.4 335679.5 1801597.6 3392909.5 1473269.7 1657114.8 1712300.3 # OTH SUM # 1684527.3 21566436.8 GGDC10S %>% fmode # Mode # Country Regioncode Region Variable Year # "USA" "ASI" "Asia" "EMP" "2010" # AGR MIN MAN PU CON # "171.315882316326" "0" "4645.12507642586" "0" "1.34623115930777" # WRT TRA FIRE GOV OTH # "21.8380052682527" "8.97743416914571" "40.0701608636442" "0" "3626.84423577048" # SUM # "37.4822945751317" GGDC10S %>% fmode(drop = FALSE) # Keep data structure intact # # A tibble: 1 × 16 # Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV # * # 1 USA ASI Asia EMP 2010 171. 0 4645. 0 1.35 21.8 8.98 40.1 0 # # ℹ 2 more variables: OTH , SUM ``` Moving on to grouped statistics, we can compute the average value added and employment by sector and country using: ```r GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmean # # A tibble: 85 × 13 # Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM # # 1 EMP ARG 1420. 52.1 1932. 1.02e2 7.42e2 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4 # 2 EMP BOL 964. 56.0 235. 5.35e0 1.23e2 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3 # 3 EMP BRA 17191. 206. 6991. 3.65e2 3.52e3 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4 # 4 EMP BWA 188. 10.5 18.1 3.09e0 2.53e1 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2 # 5 EMP CHL 702. 101. 625. 2.94e1 2.96e2 6.95e2 2.58e2 272. NA 1.00e3 3.98e3 # 6 EMP CHN 287744. 7050. 67144. 1.61e3 2.09e4 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5 # 7 EMP COL 3091. 145. 1175. 3.39e1 5.24e2 2.07e3 4.70e2 649. NA 1.73e3 9.89e3 # 8 EMP CRI 231. 1.70 136. 1.43e1 5.76e1 1.57e2 4.24e1 54.9 128. 6.51e1 8.87e2 # 9 EMP DEW 2490. 407. 8473. 2.26e2 2.09e3 4.44e3 1.48e3 1689. 3945. 9.99e2 2.62e4 # 10 EMP DNK 236. 8.03 507. 1.38e1 1.71e2 4.55e2 1.61e2 181. 549. 1.11e2 2.39e3 # # ℹ 75 more rows ``` Similarly we can aggregate using any other of the above functions. It is important to not use *dplyr*'s `summarize` together with these functions since that would eliminate their speed gain. These functions are fast because they are executed only once and carry out the grouped computations in C++, whereas `summarize` will apply the function to each group in the grouped tibble. *** #### Excursus: What is Happening Behind the Scenes? To better explain this point it is perhaps good to shed some light on what is happening behind the scenes of *dplyr* and *collapse*. Fundamentally both packages follow different computing paradigms: *dplyr* is an efficient implementation of the Split-Apply-Combine computing paradigm. Data is split into groups, these data-chunks are then passed to a function carrying out the computation, and finally recombined to produce the aggregated data.frame. This modus operandi is evident in the grouping mechanism of *dplyr*. When a data.frame is passed through *group_by*, a 'groups' attribute is attached: ```r GGDC10S %>% group_by(Variable, Country) %>% attr("groups") # # A tibble: 85 × 3 # Variable Country .rows # > # 1 EMP ARG [62] # 2 EMP BOL [61] # 3 EMP BRA [62] # 4 EMP BWA [52] # 5 EMP CHL [63] # 6 EMP CHN [62] # 7 EMP COL [61] # 8 EMP CRI [62] # 9 EMP DEW [61] # 10 EMP DNK [64] # # ℹ 75 more rows ``` This object is a data.frame giving the unique groups and in the third (last) column vectors containing the indices of the rows belonging to that group. A command like `summarize` uses this information to split the data.frame into groups which are then passed sequentially to the function used and later recombined. These steps are also done in C++ which makes *dplyr* quite efficient. Now *collapse* is based around one-pass grouped computations at the C++ level using its own grouped statistical functions. In other words the data is not split and recombined at all but the entire computation is performed in a single C++ loop running through that data and completing the computations for each group simultaneously. This modus operandi is also evident in *collapse* grouping objects. The method `GRP.grouped_df` takes a *dplyr* grouping object from a grouped tibble and efficiently converts it to a *collapse* grouping object: ```r GGDC10S %>% group_by(Variable, Country) %>% GRP %>% str # Class 'GRP' hidden list of 9 # $ N.groups : int 85 # $ group.id : int [1:5027] 46 46 46 46 46 46 46 46 46 46 ... # $ group.sizes : int [1:85] 62 61 62 52 63 62 61 62 61 64 ... # $ groups :List of 2 # ..$ Variable: chr [1:85] "EMP" "EMP" "EMP" "EMP" ... # .. ..- attr(*, "label")= chr "Variable" # .. ..- attr(*, "format.stata")= chr "%9s" # ..$ Country : chr [1:85] "ARG" "BOL" "BRA" "BWA" ... # .. ..- attr(*, "label")= chr "Country" # .. ..- attr(*, "format.stata")= chr "%9s" # $ group.vars : chr [1:2] "Variable" "Country" # $ ordered : Named logi [1:2] TRUE FALSE # ..- attr(*, "names")= chr [1:2] "ordered" "sorted" # $ order : NULL # $ group.starts: NULL # $ call : language GRP.grouped_df(X = .) ``` This object is a list where the first three elements give the number of groups, the group-id to which each row belongs and a vector of group-sizes. A function like `fsum` uses this information to (for each column) create a result vector of size 'N.groups' and the run through the column using the 'group.id' vector to add the i'th data point to the 'group.id[i]'th element of the result vector. When the loop is finished, the grouped computation is also finished. It is obvious that *collapse* is faster than *dplyr* since it's method of computing involves less steps, and it does not need to call statistical functions multiple times. See the benchmark section. *** ### 1.2 More Speed using *collapse* Verbs *collapse* fast functions do not develop their maximal performance on a grouped tibble created with `group_by` because of the additional conversion cost of the grouping object incurred by `GRP.grouped_df`. This cost is already minimized through the use of C++, but we can do even better replacing `group_by` with `collapse::fgroup_by`. `fgroup_by` works like `group_by` but does the grouping with `collapse::GRP` (up to 10x faster than `group_by`) and simply attaches a *collapse* grouping object to the grouped_df. Thus the speed gain is 2-fold: Faster grouping and no conversion cost when calling *collapse* functions. Another improvement comes from replacing the *dplyr* verb `select` with `collapse::fselect`, and, for selection using column names, indices or functions use `collapse::get_vars` instead of `select_at` or `select_if`. Next to `get_vars`, *collapse* also introduces the predicates `num_vars`, `cat_vars`, `char_vars`, `fact_vars`, `logi_vars` and `date_vars` to efficiently select columns by type. ```r GGDC10S %>% fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fmedian # # A tibble: 85 × 13 # Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM # # 1 EMP ARG 1325. 47.4 1988. 1.05e2 7.82e2 1.85e3 5.80e2 464. 1739. 866. 9.74e3 # 2 EMP BOL 943. 53.5 167. 4.46e0 6.60e1 1.32e2 9.70e1 15.3 NA 384. 1.84e3 # 3 EMP BRA 17481. 225. 7208. 3.76e2 4.05e3 6.45e3 1.58e3 4355. 4450. 4479. 5.19e4 # 4 EMP BWA 175. 12.2 13.1 3.71e0 1.90e1 2.11e1 6.75e0 10.4 53.8 31.2 3.61e2 # 5 EMP CHL 690. 93.9 607. 2.58e1 2.30e2 4.84e2 2.05e2 106. NA 900. 3.31e3 # 6 EMP CHN 293915 8150. 61761. 1.14e3 1.06e4 1.70e4 9.56e3 4328. 19468. 9954. 4.45e5 # 7 EMP COL 3006. 84.0 1033. 3.71e1 4.19e2 1.55e3 3.91e2 655. NA 1430. 8.63e3 # 8 EMP CRI 216. 1.49 114. 7.92e0 5.50e1 8.98e1 2.55e1 19.6 122. 60.6 7.19e2 # 9 EMP DEW 2178 320. 8459. 2.47e2 2.10e3 4.45e3 1.53e3 1656 3700 900 2.65e4 # 10 EMP DNK 187. 3.75 508. 1.36e1 1.65e2 4.61e2 1.61e2 169. 642. 104. 2.42e3 # # ℹ 75 more rows microbenchmark(collapse = GGDC10S %>% fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fmedian, hybrid = GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmedian, dplyr = GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% summarise_all(median, na.rm = TRUE)) # Unit: microseconds # expr min lq mean median uq max neval # collapse 236.406 263.6095 303.309 295.9175 337.061 419.635 100 # hybrid 2699.317 2894.9690 3573.611 2998.3505 3119.772 56249.212 100 # dplyr 15923.908 16297.8280 18810.943 16742.5140 18578.105 71125.939 100 ``` Benchmarks on the different components of this code and with larger data are provided under 'Benchmarks'. Note that a grouped tibble created with `fgroup_by` can no longer be used for grouped computations with *dplyr* verbs like `mutate` or `summarize`. `fgroup_by` first assigns the class *GDP_df* which is for printing grouping information and subsetting, then the object classes (*tbl_df*, *data.table* or whatever else), followed by classes *grouped_df* and *data.frame*, and adds the grouping object in a 'groups' attribute. Since *tbl_df* is assigned before *grouped_df*, the object is treated by the *dplyr* ecosystem like a normal tibble. ```r class(group_by(GGDC10S, Variable, Country)) # [1] "grouped_df" "tbl_df" "tbl" "data.frame" class(fgroup_by(GGDC10S, Variable, Country)) # [1] "GRP_df" "tbl_df" "tbl" "grouped_df" "data.frame" ``` The function `fungroup` removes classes 'GDP_df' and 'grouped_df' and the 'groups' attribute (and can thus also be used for grouped tibbles created with `dplyr::group_by`). Note that any kind of data frame based class can be grouped with `fgroup_by`, and still retain full responsiveness to all methods defined for that class. Functions performing aggregation on the grouped data frame remove the grouping object and classes afterwards, yielding an object with the same class and attributes as the input. The print method shown below reports the grouping variables, and then in square brackets the information `[number of groups | average group size (standard-deviation of group sizes)]`: ```r fgroup_by(GGDC10S, Variable, Country) # # A tibble: 5,027 × 16 # Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV # # 1 BWA SSA Sub-s… VA 1960 NA NA NA NA NA NA NA NA NA # 2 BWA SSA Sub-s… VA 1961 NA NA NA NA NA NA NA NA NA # 3 BWA SSA Sub-s… VA 1962 NA NA NA NA NA NA NA NA NA # 4 BWA SSA Sub-s… VA 1963 NA NA NA NA NA NA NA NA NA # 5 BWA SSA Sub-s… VA 1964 16.3 3.49 0.737 0.104 0.660 6.24 1.66 1.12 4.82 # 6 BWA SSA Sub-s… VA 1965 15.7 2.50 1.02 0.135 1.35 7.06 1.94 1.25 5.70 # 7 BWA SSA Sub-s… VA 1966 17.7 1.97 0.804 0.203 1.35 8.27 2.15 1.36 6.37 # 8 BWA SSA Sub-s… VA 1967 19.1 2.30 0.938 0.203 0.897 4.31 1.72 1.54 7.04 # 9 BWA SSA Sub-s… VA 1968 21.1 1.84 0.750 0.203 1.22 5.17 2.44 1.03 5.03 # 10 BWA SSA Sub-s… VA 1969 21.9 5.24 2.14 0.578 3.47 5.75 2.72 1.23 5.59 # # ℹ 5,017 more rows # # ℹ 2 more variables: OTH , SUM # # Grouped by: Variable, Country [85 | 59 (7.7) 4-65] ``` Note further that `fselect` and `get_vars` are not full drop-in replacements for `select` because they do not have a grouped_df method: ```r GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% tail(3) # # A tibble: 3 × 13 # # Groups: Variable, Country [1] # Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM # # 1 EMP EGY 5206. 29.0 2436. 307. 2733. 2977. 1992. 801. 5539. NA 22020. # 2 EMP EGY 5186. 27.6 2374. 318. 2795. 3020. 2048. 815. 5636. NA 22219. # 3 EMP EGY 5161. 24.8 2348. 325. 2931. 3110. 2065. 832. 5736. NA 22533. GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% tail(3) # # A tibble: 3 × 11 # AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM # # 1 5206. 29.0 2436. 307. 2733. 2977. 1992. 801. 5539. NA 22020. # 2 5186. 27.6 2374. 318. 2795. 3020. 2048. 815. 5636. NA 22219. # 3 5161. 24.8 2348. 325. 2931. 3110. 2065. 832. 5736. NA 22533. ``` Since by default `keep.group_vars = TRUE` in the *Fast Statistical Functions*, the end result is nevertheless the same: ```r GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmean %>% tail(3) # # A tibble: 3 × 13 # Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM # # 1 VA VEN 6860. 35478. 1.96e4 1.06e3 1.17e4 1.93e4 8.03e3 5.60e3 NA 19986. 1.28e5 # 2 VA ZAF 16419. 42928. 8.76e4 1.38e4 1.64e4 6.83e4 4.53e4 6.64e4 7.58e4 30167. 4.63e5 # 3 VA ZMB 1268849. 1006099. 9.00e5 2.19e5 8.66e5 2.10e6 7.05e5 9.10e5 1.10e6 81871. 9.16e6 GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% fmean %>% tail(3) # # A tibble: 3 × 13 # Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM # # 1 VA VEN 6860. 35478. 1.96e4 1.06e3 1.17e4 1.93e4 8.03e3 5.60e3 NA 19986. 1.28e5 # 2 VA ZAF 16419. 42928. 8.76e4 1.38e4 1.64e4 6.83e4 4.53e4 6.64e4 7.58e4 30167. 4.63e5 # 3 VA ZMB 1268849. 1006099. 9.00e5 2.19e5 8.66e5 2.10e6 7.05e5 9.10e5 1.10e6 81871. 9.16e6 ``` Another useful verb introduced by *collapse* is `fgroup_vars`, which can be used to efficiently obtain the grouping columns or grouping variables from a grouped tibble: ```r # fgroup_by fully supports grouped tibbles created with group_by or fgroup_by: GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars %>% head(3) # # A tibble: 3 × 2 # Variable Country # # 1 VA BWA # 2 VA BWA # 3 VA BWA GGDC10S %>% fgroup_by(Variable, Country) %>% fgroup_vars %>% head(3) # # A tibble: 3 × 2 # Variable Country # # 1 VA BWA # 2 VA BWA # 3 VA BWA # The other possibilities: GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("unique") %>% head(3) # # A tibble: 3 × 2 # Variable Country # # 1 EMP ARG # 2 EMP BOL # 3 EMP BRA GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("names") # [1] "Variable" "Country" GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("indices") # [1] 4 1 GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_indices") # Variable Country # 4 1 GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("logical") # [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_logical") # Country Regioncode Region Variable Year AGR MIN MAN PU # TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE # CON WRT TRA FIRE GOV OTH SUM # FALSE FALSE FALSE FALSE FALSE FALSE FALSE ``` Another *collapse* verb to mention here is `fsubset`, a faster alternative to `dplyr::filter` which also provides an option to flexibly subset columns after the select argument: ```r # Two equivalent calls, the first is substantially faster GGDC10S %>% fsubset(Variable == "VA" & Year > 1990, Country, Year, AGR:GOV) %>% head(3) # # A tibble: 3 × 11 # Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV # # 1 BWA 1991 303. 2647. 473. 161. 580. 807. 233. 433. 1073. # 2 BWA 1992 333. 2691. 537. 178. 679. 725. 285. 517. 1234. # 3 BWA 1993 405. 2625. 567. 219. 634. 772. 350. 673. 1487. GGDC10S %>% filter(Variable == "VA" & Year > 1990) %>% select(Country, Year, AGR:GOV) %>% head(3) # # A tibble: 3 × 11 # Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV # # 1 BWA 1991 303. 2647. 473. 161. 580. 807. 233. 433. 1073. # 2 BWA 1992 333. 2691. 537. 178. 679. 725. 285. 517. 1234. # 3 BWA 1993 405. 2625. 567. 219. 634. 772. 350. 673. 1487. ``` *collapse* also offers `roworder`, `frename`, `colorder` and `ftransform`/`TRA` as fast replacements for `dplyr::arrange`, `dplyr::rename`, `dplyr::relocate` and `dplyr::mutate`. ### 1.3 Multi-Function Aggregations One can also aggregate with multiple functions at the same time. For such operations it is often necessary to use curly braces `{` to prevent first argument injection so that `%>% cbind(FUN1(.), FUN2(.))` does not evaluate as `%>% cbind(., FUN1(.), FUN2(.))`: ```r GGDC10S %>% fgroup_by(Variable, Country) %>% get_vars(6:16) %>% { cbind(fmedian(.), add_stub(fmean(., keep.group_vars = FALSE), "mean_")) } %>% head(3) # Variable Country AGR MIN MAN PU CON WRT TRA # 1 EMP ARG 1324.5255 47.35255 1987.5912 104.738825 782.40283 1854.612 579.93982 # 2 EMP BOL 943.1612 53.53538 167.1502 4.457895 65.97904 132.225 96.96828 # 3 EMP BRA 17480.9810 225.43693 7207.7915 375.851832 4054.66103 6454.523 1580.81120 # FIRE GOV OTH SUM mean_AGR mean_MIN mean_MAN mean_PU mean_CON # 1 464.39920 1738.836 866.1119 9743.223 1419.8013 52.08903 1931.7602 101.720936 742.4044 # 2 15.34259 NA 384.0678 1842.055 964.2103 56.03295 235.0332 5.346433 122.7827 # 3 4354.86210 4449.942 4478.6927 51881.110 17191.3529 206.02389 6991.3710 364.573404 3524.7384 # mean_WRT mean_TRA mean_FIRE mean_GOV mean_OTH mean_SUM # 1 1982.1775 648.5119 627.79291 2043.471 992.4475 10542.177 # 2 281.5164 115.4728 44.56442 NA 395.5650 2220.524 # 3 8509.4612 2054.3731 4413.54448 5307.280 5710.2665 54272.985 ``` The function `add_stub` used above is a *collapse* function adding a prefix (default) or suffix to variables names. The *collapse* predicate `add_vars` provides a more efficient alternative to `cbind.data.frame`. The idea here is 'adding' variables to the data.frame in the first argument i.e. the attributes of the first argument are preserved, so the expression below still gives a tibble instead of a data.frame: ```r GGDC10S %>% fgroup_by(Variable, Country) %>% { add_vars(get_vars(., "Reg", regex = TRUE) %>% ffirst, # Regular expression matching column names num_vars(.) %>% fmean(keep.group_vars = FALSE) %>% add_stub("mean_"), # num_vars selects all numeric variables fselect(., PU:TRA) %>% fmedian(keep.group_vars = FALSE) %>% add_stub("median_"), fselect(., PU:CON) %>% fmin(keep.group_vars = FALSE) %>% add_stub("min_")) } %>% head(3) # # A tibble: 3 × 22 # Variable Country Regioncode Region mean_Year mean_AGR mean_MIN mean_MAN mean_PU mean_CON mean_WRT # # 1 EMP ARG LAM Latin … 1980. 1420. 52.1 1932. 102. 742. 1982. # 2 EMP BOL LAM Latin … 1980 964. 56.0 235. 5.35 123. 282. # 3 EMP BRA LAM Latin … 1980. 17191. 206. 6991. 365. 3525. 8509. # # ℹ 11 more variables: mean_TRA , mean_FIRE , mean_GOV , mean_OTH , # # mean_SUM , median_PU , median_CON , median_WRT , median_TRA , # # min_PU , min_CON ``` Another nice feature of `add_vars` is that it can also very efficiently reorder columns i.e. bind columns in a different order than they are passed. This can be done by simply specifying the positions the added columns should have in the final data frame, and then `add_vars` shifts the first argument columns to the right to fill in the gaps. ```r GGDC10S %>% fsubset(Variable == "VA", Country, AGR, SUM) %>% fgroup_by(Country) %>% { add_vars(fgroup_vars(.,"unique"), fmean(., keep.group_vars = FALSE) %>% add_stub("mean_"), fsd(., keep.group_vars = FALSE) %>% add_stub("sd_"), pos = c(2,4,3,5)) } %>% head(3) # # A tibble: 3 × 5 # Country mean_AGR sd_AGR mean_SUM sd_SUM # # 1 ARG 14951. 33061. 152534. 301316. # 2 BOL 3300. 4456. 22619. 33173. # 3 BRA 76870. 59442. 1200563. 976963. ``` A much more compact solution to multi-function and multi-type aggregation is offered by the function *collapg*: ```r # This aggregates numeric colums using the mean (fmean) and categorical columns with the mode (fmode) GGDC10S %>% fgroup_by(Variable, Country) %>% collapg %>% head(3) # # A tibble: 3 × 16 # Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE GOV # # 1 EMP ARG LAM Latin … 1980. 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043. # 2 EMP BOL LAM Latin … 1980 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA # 3 EMP BRA LAM Latin … 1980. 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307. # # ℹ 2 more variables: OTH , SUM ``` By default it aggregates numeric columns using the `fmean` and categorical columns using `fmode`, and preserves the order of all columns. Changing these defaults is very easy: ```r # This aggregates numeric colums using the median and categorical columns using the first value GGDC10S %>% fgroup_by(Variable, Country) %>% collapg(fmedian, flast) %>% head(3) # # A tibble: 3 × 16 # Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE # # 1 EMP ARG LAM Latin Amer… 1980. 1325. 47.4 1988. 105. 782. 1855. 580. 464. # 2 EMP BOL LAM Latin Amer… 1980 943. 53.5 167. 4.46 66.0 132. 97.0 15.3 # 3 EMP BRA LAM Latin Amer… 1980. 17481. 225. 7208. 376. 4055. 6455. 1581. 4355. # # ℹ 3 more variables: GOV , OTH , SUM ``` One can apply multiple functions to both numeric and/or categorical data: ```r GGDC10S %>% fgroup_by(Variable, Country) %>% collapg(list(fmean, fmedian), list(first, fmode, flast)) %>% head(3) # # A tibble: 3 × 32 # Variable Country first.Regioncode fmode.Regioncode flast.Regioncode first.Region fmode.Region # # 1 EMP ARG LAM LAM LAM Latin America Latin America # 2 EMP BOL LAM LAM LAM Latin America Latin America # 3 EMP BRA LAM LAM LAM Latin America Latin America # # ℹ 25 more variables: flast.Region , fmean.Year , fmedian.Year , fmean.AGR , # # fmedian.AGR , fmean.MIN , fmedian.MIN , fmean.MAN , fmedian.MAN , # # fmean.PU , fmedian.PU , fmean.CON , fmedian.CON , fmean.WRT , # # fmedian.WRT , fmean.TRA , fmedian.TRA , fmean.FIRE , fmedian.FIRE , # # fmean.GOV , fmedian.GOV , fmean.OTH , fmedian.OTH , fmean.SUM , # # fmedian.SUM ``` Applying multiple functions to only numeric (or only categorical) data allows return in a long format: ```r GGDC10S %>% fgroup_by(Variable, Country) %>% collapg(list(fmean, fmedian), cols = is.numeric, return = "long") %>% head(3) # # A tibble: 3 × 15 # Function Variable Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH # # 1 fmean EMP ARG 1980. 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043. 992. # 2 fmean EMP BOL 1980 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA 396. # 3 fmean EMP BRA 1980. 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307. 5710. # # ℹ 1 more variable: SUM ``` Finally, `collapg` also makes it very easy to apply aggregator functions to certain columns only: ```r GGDC10S %>% fgroup_by(Variable, Country) %>% collapg(custom = list(fmean = 6:8, fmedian = 10:12)) %>% head(3) # # A tibble: 3 × 8 # Variable Country AGR MIN MAN CON WRT TRA # # 1 EMP ARG 1420. 52.1 1932. 782. 1855. 580. # 2 EMP BOL 964. 56.0 235. 66.0 132. 97.0 # 3 EMP BRA 17191. 206. 6991. 4055. 6455. 1581. ``` To understand more about `collapg`, look it up in the documentation (`?collapg`). ### 1.4 Weighted Aggregations Weighted aggregations are possible with the functions `fsum, fprod, fmean, fmedian, fnth, fmode, fvar` and `fsd`. The implementation is such that by default (option `keep.w = TRUE`) these functions also aggregate the weights, so that further weighted computations can be performed on the aggregated data. `fprod` saves the product of the weights, whereas the other functions save the sum of the weights in a column next to the grouping variables. If `na.rm = TRUE` (the default), rows with missing weights are omitted from the computation. ```r # This computes a frequency-weighted grouped standard-deviation, taking the total EMP / VA as weight GGDC10S %>% fgroup_by(Variable, Country) %>% fselect(AGR:SUM) %>% fsd(SUM) %>% head(3) # # A tibble: 3 × 13 # Variable Country sum.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH # # 1 EMP ARG 653615. 225. 22.2 176. 20.5 285. 856. 195. 493. 1123. 506. # 2 EMP BOL 135452. 99.7 17.1 168. 4.87 123. 324. 98.1 69.8 NA 258. # 3 EMP BRA 3364925. 1587. 73.8 2952. 93.8 1861. 6285. 1306. 3003. 3621. 4257. # This computes a weighted grouped mode, taking the total EMP / VA as weight GGDC10S %>% fgroup_by(Variable, Country) %>% fselect(AGR:SUM) %>% fmode(SUM) %>% head(3) # # A tibble: 3 × 13 # Variable Country sum.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH # # 1 EMP ARG 653615. 1162. 127. 2164. 152. 1415. 3768. 1060. 1748. 4336. 1999. # 2 EMP BOL 135452. 819. 37.6 604. 10.8 433. 893. 333. 321. NA 1057. # 3 EMP BRA 3364925. 16451. 313. 11841. 388. 8154. 21860. 5169. 12011. 12149. 14235. ``` The weighted variance / standard deviation is currently only implemented with frequency weights. Weighted aggregations may also be performed with `collapg`. By default `fsum` is used to compute a sum of the weights, but it is also possible here to aggregate the weights with other functions: ```r # This aggregates numeric colums using the weighted mean (the default) and categorical columns using the weighted mode (the default). # Weights (column SUM) are aggregated using both the sum and the maximum. GGDC10S %>% group_by(Variable, Country) %>% collapg(w = SUM, wFUN = list(fsum, fmax)) %>% head(3) # # A tibble: 3 × 17 # Variable Country fsum.SUM fmax.SUM Regioncode Region Year AGR MIN MAN PU CON WRT # # 1 EMP ARG 653615. 17929. LAM Latin … 1985. 1361. 56.5 1935. 105. 811. 2217. # 2 EMP BOL 135452. 4508. LAM Latin … 1987. 977. 57.9 296. 7.07 167. 400. # 3 EMP BRA 3364925. 102572. LAM Latin … 1989. 17746. 238. 8466. 389. 4436. 11376. # # ℹ 4 more variables: TRA , FIRE , GOV , OTH ``` ## 2. Fast Transformations *collapse* also provides some fast transformations that significantly extend the scope and speed of manipulations that can be performed with `dplyr::mutate`. ### 2.1 Fast Transform and Compute Variables The function `ftransform` can be used to manipulate columns in the same ways as `mutate`: ```r GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>% ftransform(AGR_perc = AGR / SUM * 100, # Computing % of VA in Agriculture AGR_mean = fmean(AGR), # Average Agricultural VA AGR = NULL, SUM = NULL) %>% # Deleting columns AGR and SUM head # # A tibble: 6 × 4 # Country Year AGR_perc AGR_mean # # 1 BWA 1960 NA 5137561. # 2 BWA 1961 NA 5137561. # 3 BWA 1962 NA 5137561. # 4 BWA 1963 NA 5137561. # 5 BWA 1964 43.5 5137561. # 6 BWA 1965 40.0 5137561. ``` The modification brought by `ftransformv` enables transformations of groups of columns like `dplyr::mutate_at` and `dplyr::mutate_if`: ```r # This replaces variables mpg, carb and wt by their log (.c turns expressions into character vectors) mtcars %>% ftransformv(.c(mpg, carb, wt), log) %>% head # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 3.044522 6 160 110 3.90 0.9631743 16.46 0 1 4 1.3862944 # Mazda RX4 Wag 3.044522 6 160 110 3.90 1.0560527 17.02 0 1 4 1.3862944 # Datsun 710 3.126761 4 108 93 3.85 0.8415672 18.61 1 1 4 0.0000000 # Hornet 4 Drive 3.063391 6 258 110 3.08 1.1678274 19.44 1 0 3 0.0000000 # Hornet Sportabout 2.928524 8 360 175 3.15 1.2354715 17.02 0 0 3 0.6931472 # Valiant 2.895912 6 225 105 2.76 1.2412686 20.22 1 0 3 0.0000000 # Logging numeric variables iris %>% ftransformv(is.numeric, log) %>% head # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 1.629241 1.252763 0.3364722 -1.6094379 setosa # 2 1.589235 1.098612 0.3364722 -1.6094379 setosa # 3 1.547563 1.163151 0.2623643 -1.6094379 setosa # 4 1.526056 1.131402 0.4054651 -1.6094379 setosa # 5 1.609438 1.280934 0.3364722 -1.6094379 setosa # 6 1.686399 1.360977 0.5306283 -0.9162907 setosa ``` Instead of `column = value` type arguments, it is also possible to pass a single list of transformed variables to `ftransform`, which will be regarded in the same way as an evaluated list of `column = value` arguments. It can be used for more complex transformations: ```r # Logging values and replacing generated Inf values mtcars %>% ftransform(fselect(., mpg, cyl, vs:gear) %>% lapply(log) %>% replace_Inf) %>% head # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 3.044522 1.791759 160 110 3.90 2.620 16.46 NA 0 1.386294 4 # Mazda RX4 Wag 3.044522 1.791759 160 110 3.90 2.875 17.02 NA 0 1.386294 4 # Datsun 710 3.126761 1.386294 108 93 3.85 2.320 18.61 0 0 1.386294 1 # Hornet 4 Drive 3.063391 1.791759 258 110 3.08 3.215 19.44 0 NA 1.098612 1 # Hornet Sportabout 2.928524 2.079442 360 175 3.15 3.440 17.02 NA NA 1.098612 2 # Valiant 2.895912 1.791759 225 105 2.76 3.460 20.22 0 NA 1.098612 1 ``` If only the computed columns need to be returned, `fcompute` provides an efficient alternative: ```r GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>% fcompute(AGR_perc = AGR / SUM * 100, AGR_mean = fmean(AGR)) %>% head # # A tibble: 6 × 2 # AGR_perc AGR_mean # # 1 NA 5137561. # 2 NA 5137561. # 3 NA 5137561. # 4 NA 5137561. # 5 43.5 5137561. # 6 40.0 5137561. ``` `ftransform` and `fcompute` are an order of magnitude faster than `mutate`, but they do not support grouped computations using arbitrary functions. We will see that this is hardly a limitation as *collapse* provides very efficient and elegant alternative programming mechanisms... ### 2.2 Replacing and Sweeping out Statistics All statistical (scalar-valued) functions in the collapse package (`fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fnobs, fndistinct`) have a `TRA` argument which can be used to efficiently transform data by either (column-wise) replacing data values with computed statistics or sweeping the statistics out of the data. Operations can be specified using either an integer or quoted operator / string. The 10 operations supported by `TRA` are: * 1 - "replace_fill" : replace and overwrite missing values (same as `mutate`) * 2 - "replace" : replace but preserve missing values * 3 - "-" : subtract (center) * 4 - "-+" : subtract group-statistics but add average of group statistics * 5 - "/" : divide (scale) * 6 - "%" : compute percentages (divide and multiply by 100) * 7 - "+" : add * 8 - "*" : multiply * 9 - "%%" : modulus * 10 - "-%%" : subtract modulus Simple transformations are again straightforward to specify: ```r # This subtracts the median value from all data points i.e. centers on the median GGDC10S %>% num_vars %>% fmedian(TRA = "-") %>% head # # A tibble: 6 × 12 # Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM # # 1 -22 NA NA NA NA NA NA NA NA NA NA NA # 2 -21 NA NA NA NA NA NA NA NA NA NA NA # 3 -20 NA NA NA NA NA NA NA NA NA NA NA # 4 -19 NA NA NA NA NA NA NA NA NA NA NA # 5 -18 -4378. -170. -3717. -168. -1473. -3767. -1173. -959. -3924. -1431. -23149. # 6 -17 -4379. -171. -3717. -168. -1472. -3767. -1173. -959. -3923. -1430. -23147. # This replaces all data points with the mode GGDC10S %>% char_vars %>% fmode(TRA = "replace") %>% head # # A tibble: 6 × 4 # Country Regioncode Region Variable # # 1 USA ASI Asia EMP # 2 USA ASI Asia EMP # 3 USA ASI Asia EMP # 4 USA ASI Asia EMP # 5 USA ASI Asia EMP # 6 USA ASI Asia EMP ``` Similarly for grouped transformations: ```r # Replacing data with the 2nd quartile (25%) GGDC10S %>% fselect(Variable, Country, AGR:SUM) %>% fgroup_by(Variable, Country) %>% fnth(0.25, TRA = "replace_fill") %>% head(3) # # A tibble: 3 × 13 # Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM # # 1 VA BWA 63.5 33.1 27.3 7.36 26.8 31.1 13.2 12.0 33.6 11.5 262. # 2 VA BWA 63.5 33.1 27.3 7.36 26.8 31.1 13.2 12.0 33.6 11.5 262. # 3 VA BWA 63.5 33.1 27.3 7.36 26.8 31.1 13.2 12.0 33.6 11.5 262. # Scaling sectoral data by Variable and Country GGDC10S %>% fselect(Variable, Country, AGR:SUM) %>% fgroup_by(Variable, Country) %>% fsd(TRA = "/") %>% head # # A tibble: 6 × 13 # Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV # # 1 VA BWA NA NA NA NA NA NA NA NA NA # 2 VA BWA NA NA NA NA NA NA NA NA NA # 3 VA BWA NA NA NA NA NA NA NA NA NA # 4 VA BWA NA NA NA NA NA NA NA NA NA # 5 VA BWA 0.0270 0.000556 0.000523 3.88e-4 5.11e-4 0.00194 0.00154 5.23e-4 0.00134 # 6 VA BWA 0.0260 0.000397 0.000723 5.03e-4 1.04e-3 0.00220 0.00180 5.83e-4 0.00158 # # ℹ 2 more variables: OTH , SUM ``` The benchmarks below will demonstrate that these internal sweeping and replacement operations fully performed in C++ compute significantly faster than using `dplyr::mutate`, especially as the number of groups grows large. The S3 generic nature of the *Fast Statistical Functions* further allows us to perform grouped mutations on the fly (together with `ftransform` or `fcompute`), without the need of first creating a grouped tibble: ```r # AGR_gmed = TRUE if AGR is greater than it's median value, grouped by Variable and Country # Note: This calls fmedian.default settransform(GGDC10S, AGR_gmed = AGR > fmedian(AGR, list(Variable, Country), TRA = "replace")) tail(GGDC10S, 3) # # A tibble: 3 × 17 # Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV # # 1 EGY MENA Middle Ea… EMP 2010 5206. 29.0 2436. 307. 2733. 2977. 1992. 801. 5539. # 2 EGY MENA Middle Ea… EMP 2011 5186. 27.6 2374. 318. 2795. 3020. 2048. 815. 5636. # 3 EGY MENA Middle Ea… EMP 2012 5161. 24.8 2348. 325. 2931. 3110. 2065. 832. 5736. # # ℹ 3 more variables: OTH , SUM , AGR_gmed # Dividing (scaling) the sectoral data (columns 6 through 16) by their grouped standard deviation settransformv(GGDC10S, 6:16, fsd, list(Variable, Country), TRA = "/", apply = FALSE) tail(GGDC10S, 3) # # A tibble: 3 × 17 # Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV # # 1 EGY MENA Middle Ea… EMP 2010 8.41 2.28 4.32 3.56 3.62 3.75 3.75 3.14 3.80 # 2 EGY MENA Middle Ea… EMP 2011 8.38 2.17 4.21 3.68 3.70 3.81 3.86 3.19 3.86 # 3 EGY MENA Middle Ea… EMP 2012 8.34 1.95 4.17 3.76 3.88 3.92 3.89 3.26 3.93 # # ℹ 3 more variables: OTH , SUM , AGR_gmed rm(GGDC10S) ``` Weights are easily added to any grouped transformation: ```r # This subtracts weighted group means from the data, using SUM column as weights.. GGDC10S %>% fselect(Variable, Country, AGR:SUM) %>% fgroup_by(Variable, Country) %>% fmean(SUM, "-") %>% head # # A tibble: 6 × 13 # Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH # # 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA # 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA # 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA # 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA # 5 VA BWA 37.5 -1301. -13317. -2965. -529. -2746. -6540. -2157. -4431. -7551. -2613. # 6 VA BWA 39.3 -1302. -13318. -2964. -529. -2745. -6540. -2156. -4431. -7550. -2613. ``` Sequential operations are also easily performed: ```r # This scales and then subtracts the median GGDC10S %>% fselect(Variable, Country, AGR:SUM) %>% fgroup_by(Variable, Country) %>% fsd(TRA = "/") %>% fmedian(TRA = "-") # # A tibble: 5,027 × 13 # Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM # *