When I design
longer_dt
and wider_dt
, I could find the
pivot_longer
and pivot_wider
in
tidyr
and melt
and dcast
in
data.table
. Still, designing this API is not easy, as my
goal is to let users use it with least pain. Here we would try to
reproduce the results in the vignette of tidyr
(https://tidyr.tidyverse.org/articles/pivot.html). First
load the packages:
First inspect the data:
relig_income
#> # A tibble: 18 × 11
#> religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Agnostic 27 34 60 81 76 137 122
#> 2 Atheist 12 27 37 52 35 70 73
#> 3 Buddhist 27 21 30 34 33 58 62
#> 4 Catholic 418 617 732 670 638 1116 949
#> 5 Don’t k… 15 14 15 11 10 35 21
#> 6 Evangel… 575 869 1064 982 881 1486 949
#> 7 Hindu 1 9 7 9 11 34 47
#> 8 Histori… 228 244 236 238 197 223 131
#> 9 Jehovah… 20 27 24 24 21 30 15
#> 10 Jewish 19 19 25 25 30 95 69
#> 11 Mainlin… 289 495 619 655 651 1107 939
#> 12 Mormon 29 40 48 51 56 112 85
#> 13 Muslim 6 7 9 10 9 23 16
#> 14 Orthodox 13 17 23 32 32 47 38
#> 15 Other C… 9 7 11 13 13 14 18
#> 16 Other F… 20 33 40 46 49 63 46
#> 17 Other W… 5 2 3 4 2 7 3
#> 18 Unaffil… 217 299 374 365 341 528 407
#> # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#> # `Don't know/refused` <dbl>
In tidyr
, to get the longer format you need:
In tidyfst
, we have:
relig_income %>%
longer_dt("religion",name = "income",value = "count")
#> religion income count
#> <char> <fctr> <num>
#> 1: Agnostic <$10k 27
#> 2: Atheist <$10k 12
#> 3: Buddhist <$10k 27
#> 4: Catholic <$10k 418
#> 5: Don’t know/refused <$10k 15
#> ---
#> 176: Orthodox Don't know/refused 73
#> 177: Other Christian Don't know/refused 18
#> 178: Other Faiths Don't know/refused 71
#> 179: Other World Religions Don't know/refused 8
#> 180: Unaffiliated Don't know/refused 597
Another example from tidyr
:
billboard
# tidyr way:
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
# tidyfst way:
billboard %>%
longer_dt(-"wk",
name = "week",
value = "rank",
na.rm = TRUE
)
# regex could select the groups to keep, and minus could select the reverse
A warning would could come out because the merging column has different data types and do the coercion automatically.
## data
fish_encounters
#> # A tibble: 114 × 3
#> fish station seen
#> <fct> <fct> <int>
#> 1 4842 Release 1
#> 2 4842 I80_1 1
#> 3 4842 Lisbon 1
#> 4 4842 Rstr 1
#> 5 4842 Base_TD 1
#> 6 4842 BCE 1
#> 7 4842 BCW 1
#> 8 4842 BCE2 1
#> 9 4842 BCW2 1
#> 10 4842 MAE 1
#> # ℹ 104 more rows
## tidyr way:
fish_encounters %>%
pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 × 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 NA NA NA NA NA NA
#> 5 4847 1 1 1 NA NA NA NA NA NA NA NA
#> 6 4848 1 1 1 1 NA NA NA NA NA NA NA
#> 7 4849 1 1 NA NA NA NA NA NA NA NA NA
#> 8 4850 1 1 NA 1 1 1 1 NA NA NA NA
#> 9 4851 1 1 NA NA NA NA NA NA NA NA NA
#> 10 4854 1 1 NA NA NA NA NA NA NA NA NA
#> 11 4855 1 1 1 1 1 NA NA NA NA NA NA
#> 12 4857 1 1 1 1 1 1 1 1 1 NA NA
#> 13 4858 1 1 1 1 1 1 1 1 1 1 1
#> 14 4859 1 1 1 1 1 NA NA NA NA NA NA
#> 15 4861 1 1 1 1 1 1 1 1 1 1 1
#> 16 4862 1 1 1 1 1 1 1 1 1 NA NA
#> 17 4863 1 1 NA NA NA NA NA NA NA NA NA
#> 18 4864 1 1 NA NA NA NA NA NA NA NA NA
#> 19 4865 1 1 1 NA NA NA NA NA NA NA NA
## tidyfst way:
fish_encounters %>%
wider_dt(name = "station",value = "seen")
#> Key: <fish>
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2
#> <fctr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1: 4842 1 1 1 1 1 1 1 1 1
#> 2: 4843 1 1 1 1 1 1 1 1 1
#> 3: 4844 1 1 1 1 1 1 1 1 1
#> 4: 4845 1 1 1 1 1 NA NA NA NA
#> 5: 4847 1 1 1 NA NA NA NA NA NA
#> 6: 4848 1 1 1 1 NA NA NA NA NA
#> 7: 4849 1 1 NA NA NA NA NA NA NA
#> 8: 4850 1 1 NA 1 1 1 1 NA NA
#> 9: 4851 1 1 NA NA NA NA NA NA NA
#> 10: 4854 1 1 NA NA NA NA NA NA NA
#> 11: 4855 1 1 1 1 1 NA NA NA NA
#> 12: 4857 1 1 1 1 1 1 1 1 1
#> 13: 4858 1 1 1 1 1 1 1 1 1
#> 14: 4859 1 1 1 1 1 NA NA NA NA
#> 15: 4861 1 1 1 1 1 1 1 1 1
#> 16: 4862 1 1 1 1 1 1 1 1 1
#> 17: 4863 1 1 NA NA NA NA NA NA NA
#> 18: 4864 1 1 NA NA NA NA NA NA NA
#> 19: 4865 1 1 1 NA NA NA NA NA NA
#> 2 variables not shown: [MAE <int>, MAW <int>]
# if no keeped groups are selected, use all except for name and value columns
If you want to fill with 0s, use:
fish_encounters %>%
wider_dt(name = "station",value = "seen",fill = 0)
#> Key: <fish>
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2
#> <fctr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1: 4842 1 1 1 1 1 1 1 1 1
#> 2: 4843 1 1 1 1 1 1 1 1 1
#> 3: 4844 1 1 1 1 1 1 1 1 1
#> 4: 4845 1 1 1 1 1 0 0 0 0
#> 5: 4847 1 1 1 0 0 0 0 0 0
#> 6: 4848 1 1 1 1 0 0 0 0 0
#> 7: 4849 1 1 0 0 0 0 0 0 0
#> 8: 4850 1 1 0 1 1 1 1 0 0
#> 9: 4851 1 1 0 0 0 0 0 0 0
#> 10: 4854 1 1 0 0 0 0 0 0 0
#> 11: 4855 1 1 1 1 1 0 0 0 0
#> 12: 4857 1 1 1 1 1 1 1 1 1
#> 13: 4858 1 1 1 1 1 1 1 1 1
#> 14: 4859 1 1 1 1 1 0 0 0 0
#> 15: 4861 1 1 1 1 1 1 1 1 1
#> 16: 4862 1 1 1 1 1 1 1 1 1
#> 17: 4863 1 1 0 0 0 0 0 0 0
#> 18: 4864 1 1 0 0 0 0 0 0 0
#> 19: 4865 1 1 1 0 0 0 0 0 0
#> 2 variables not shown: [MAE <int>, MAW <int>]
Note that the parameter of name
and value
should always be provided and should be explicit called (with the
parameter names attached).
This example comes from data.table (https://rdatatable.gitlab.io/data.table/articles/datatable-reshape.html), and has been used in tidyr too. We’ll try to do it in tidyfst in this example. If we have a data.frame as below:
family <- fread("family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA")
family
#> family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1
#> <int> <int> <IDat> <IDat> <IDat> <int>
#> 1: 1 30 1998-11-26 2000-01-29 <NA> 1
#> 2: 2 27 1996-06-22 <NA> <NA> 2
#> 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2
#> 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1
#> 5: 5 29 2000-12-05 2005-02-28 <NA> 2
#> 2 variables not shown: [gender_child2 <int>, gender_child3 <int>]
And want to reshape the data.table to be like this:
# family_id age_mother child dob gender
# <int> <int> <char> <char> <char>
# 1: 1 30 child1 1998-11-26 1
# 2: 1 30 child2 2000-01-29 2
# 3: 1 30 child3 <NA> <NA>
# 4: 2 27 child1 1996-06-22 2
# 5: 2 27 child2 <NA> <NA>
# 6: 2 27 child3 <NA> <NA>
# 7: 3 26 child1 2002-07-11 2
# 8: 3 26 child2 2004-04-05 2
# 9: 3 26 child3 2007-09-02 1
# 10: 4 32 child1 2004-10-10 1
# 11: 4 32 child2 2009-08-27 1
# 12: 4 32 child3 2012-07-21 1
# 13: 5 29 child1 2000-12-05 2
# 14: 5 29 child2 2005-02-28 1
# 15: 5 29 child3 <NA> <NA>
The data.table::dcast
and
tidyr::pivot_longer
could transfer it in one step, however,
not so easy to understand. Here we’ll do it step by step to see what
actually happens in this transfer.
family %>%
longer_dt(1:2) %>%
separate_dt("name",into = c("class","child")) %>%
wider_dt(-"class|value",
name = "class",
value = "value")
#> Key: <family_id, age_mother, child>
#> family_id age_mother child dob gender
#> <int> <int> <char> <IDat> <IDat>
#> 1: 1 30 child1 1998-11-26 1970-01-02
#> 2: 1 30 child2 2000-01-29 1970-01-03
#> 3: 1 30 child3 <NA> <NA>
#> 4: 2 27 child1 1996-06-22 1970-01-03
#> 5: 2 27 child2 <NA> <NA>
#> 6: 2 27 child3 <NA> <NA>
#> 7: 3 26 child1 2002-07-11 1970-01-03
#> 8: 3 26 child2 2004-04-05 1970-01-03
#> 9: 3 26 child3 2007-09-02 1970-01-02
#> 10: 4 32 child1 2004-10-10 1970-01-02
#> 11: 4 32 child2 2009-08-27 1970-01-02
#> 12: 4 32 child3 2012-07-21 1970-01-02
#> 13: 5 29 child1 2000-12-05 1970-01-03
#> 14: 5 29 child2 2005-02-28 1970-01-02
#> 15: 5 29 child3 <NA> <NA>
In such a process, we could find that we actually get a longer table,
then separate it, and wider it later. tidyfst is not going to
support the complicated transfer in one step, because it might be easier
to implement, but much harder to understand 3 procedures in 1 step. If
you still prefer that way, use data.table::dcast
and
tidyr::pivot_longer
instead.