The package README uses the gapminder dataset to demonstrate nplyr’s functionality. In this case (and other similar cases) the output from nplyr’s nested operations could be obtained by unnesting and performing grouped dplyr operations.
library(nplyr)
library(dplyr)
gm_nest <-
gapminder::gapminder_unfiltered %>%
tidyr::nest(country_data = -continent)
gm_nest
#> # A tibble: 6 × 2
#> continent country_data
#> <fct> <list>
#> 1 Asia <tibble [578 × 5]>
#> 2 Europe <tibble [1,302 × 5]>
#> 3 Africa <tibble [637 × 5]>
#> 4 Americas <tibble [470 × 5]>
#> 5 FSU <tibble [139 × 5]>
#> 6 Oceania <tibble [187 × 5]>
# we can use nplyr to perform operations on the nested data
gm_nest %>%
nest_filter(country_data, year == max(year)) %>%
nest_mutate(country_data, pop_millions = pop/1000000) %>%
slice_head(n = 1) %>%
tidyr::unnest(country_data)
#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # ℹ 33 more rows
# in this case, we could have obtained the same result with tidyr and dplyr
gm_nest %>%
tidyr::unnest(country_data) %>%
group_by(continent) %>%
filter(year == max(year)) %>%
mutate(pop_millions = pop/1000000) %>%
ungroup() %>%
filter(continent == "Asia")
#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # ℹ 33 more rows
Why, then, might we need to use nplyr? Well, in other scenarios, it may be far more convenient to work with nested data frames or it may not even be possible to unnest!
A motivating example
Consider a set of surveys that an organization might use to gather market data. It is common for organizations to have separate surveys for separate purposes but gather the same baseline set of data across all surveys (for example, a respondent’s age and gender may be recorded across all surveys, but each survey will have a different set of questions). Let’s use two fake surveys with the below questions for this example:
Survey 1: Job
- How old are you? (multiple choice)
- What city do you live in? (multiple choice)
- What field do you work in? (multiple choice)
- Overall, how satisfied are you with your job? (multiple choice)
- What is your annual salary? (numeric entry)
Survey 2: Personal Life
- How old are you? (multiple choice)
- What city do you live in (multiple choice)
- What field do you work in? (multiple choice)
- Overall, how satisfied are you with your personal life (multiple choice)
- Please provide additional detail (text entry)
In this scenario, both surveys are collecting demographic information
— age, location, and industry — but differ in the questions. A
convenient way to get the response files into the environment would be
to use purrr::map()
to read in each file to a nested data frame.
path <- "https://raw.githubusercontent.com/markjrieke/nplyr/main/data-raw/"
surveys <-
tibble::tibble(survey_file = c("job_survey", "personal_survey")) %>%
mutate(survey_data = purrr::map(survey_file, ~readr::read_csv(paste0(path, .x, ".csv"))))
surveys
#> # A tibble: 2 × 2
#> survey_file survey_data
#> <chr> <list>
#> 1 job_survey <spc_tbl_ [500 × 6]>
#> 2 personal_survey <spc_tbl_ [750 × 6]>
tidyr::unnest()
can usually handle idiosyncrasies in layout when unnesting but in this
case unnesting throws an error!
surveys %>%
tidyr::unnest(survey_data)
#> Error in `list_unchop()`:
#> ! Can't combine `x[[1]]$Q5` <double> and `x[[2]]$Q5` <character>.
This is because the surveys share column names but not necessarily
column types! In this case, both data frames contain a column named
“Q5”, but in job_survey
it’s a double and in
personal_survey
it’s a character.
surveys %>%
slice(1) %>%
tidyr::unnest(survey_data) %>%
glimpse()
#> Rows: 500
#> Columns: 7
#> $ survey_file <chr> "job_survey", "job_survey", "job_survey", "job_survey", "j…
#> $ survey_name <chr> "job", "job", "job", "job", "job", "job", "job", "job", "j…
#> $ Q1 <dbl> 100, 81, 51, 81, 80, 32, 65, 57, 43, 94, 25, 83, 61, 66, 8…
#> $ Q2 <chr> "Austin", "San Antonio", "Austin", "Austin", "Dallas", "Fo…
#> $ Q3 <chr> "Consulting", "Consulting", "Consulting", "Technology", "C…
#> $ Q4 <chr> "Somewhat dissatisfied", "Neither satisfied nor dissatisfi…
#> $ Q5 <dbl> 163, 48, 190, 25, 143, 233, 43, 243, 158, 235, 245, 195, 2…
surveys %>%
slice(2) %>%
tidyr::unnest(survey_data) %>%
glimpse()
#> Rows: 750
#> Columns: 7
#> $ survey_file <chr> "personal_survey", "personal_survey", "personal_survey", "…
#> $ survey_name <chr> "personal", "personal", "personal", "personal", "personal"…
#> $ Q1 <dbl> 91, 32, 40, 23, 88, 69, 96, 40, 57, 40, 39, 70, 29, 38, 57…
#> $ Q2 <chr> "Austin", "San Antonio", "San Antonio", "Austin", "Dallas"…
#> $ Q3 <chr> "Energy", "Healthcare", "Consulting", "Consulting", "Techn…
#> $ Q4 <chr> "Neither satisfied nor dissatisfied", "Extremely satisfied…
#> $ Q5 <chr> "Blandit eros! A, ligula facilisis imperdiet! Interdum pla…
We could potentially get around this issue with unnesting by reading
in all columns as characters via
readr::read_csv(x, col_types = cols(.default = "c"))
, but
this presents its own challenges. Q5
would still be better
represented as a double in job_survey
and from the survey
question text Q4
has similar, but distinctly different,
meanings across the survey files.
This is where nplyr comes into play! Rather than malign the data types or create separate objects for each survey file, we can use nplyr to perform operations directly on the nested data frames.
surveys <-
surveys %>%
nest_mutate(survey_data,
age_group = if_else(Q1 < 65, "Adult", "Retirement Age")) %>%
nest_group_by(survey_data, Q3) %>%
nest_add_count(survey_data,
name = "n_respondents_in_industry") %>%
nest_mutate(survey_data,
median_industry_age = median(Q1)) %>%
nest_ungroup(survey_data)
surveys %>%
slice(1) %>%
tidyr::unnest(survey_data)
#> # A tibble: 500 × 10
#> survey_file survey_name Q1 Q2 Q3 Q4 Q5 age_group
#> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 job_survey job 100 Austin Consulting Some… 163 Retireme…
#> 2 job_survey job 81 San Antonio Consulting Neit… 48 Retireme…
#> 3 job_survey job 51 Austin Consulting Extr… 190 Adult
#> 4 job_survey job 81 Austin Technology Extr… 25 Retireme…
#> 5 job_survey job 80 Dallas Consulting Extr… 143 Retireme…
#> 6 job_survey job 32 Fort Worth Energy Some… 233 Adult
#> 7 job_survey job 65 Dallas Consulting Some… 43 Retireme…
#> 8 job_survey job 57 Houston Healthcare Some… 243 Adult
#> 9 job_survey job 43 Dallas Government S… Neit… 158 Adult
#> 10 job_survey job 94 Fort Worth Healthcare Extr… 235 Retireme…
#> # ℹ 490 more rows
#> # ℹ 2 more variables: n_respondents_in_industry <int>,
#> # median_industry_age <dbl>
surveys %>%
slice(2) %>%
tidyr::unnest(survey_data)
#> # A tibble: 750 × 10
#> survey_file survey_name Q1 Q2 Q3 Q4 Q5 age_group
#> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 personal_survey personal 91 Austin Energy Neit… Blan… Retireme…
#> 2 personal_survey personal 32 San Antonio Healthca… Extr… Elem… Adult
#> 3 personal_survey personal 40 San Antonio Consulti… Some… Eget… Adult
#> 4 personal_survey personal 23 Austin Consulti… Extr… Scel… Adult
#> 5 personal_survey personal 88 Dallas Technolo… Neit… Aene… Retireme…
#> 6 personal_survey personal 69 Fort Worth Technolo… Neit… Inte… Retireme…
#> 7 personal_survey personal 96 Houston Healthca… Extr… Blan… Retireme…
#> 8 personal_survey personal 40 Houston Consulti… Extr… Scel… Adult
#> 9 personal_survey personal 57 Fort Worth Energy Extr… Pede… Adult
#> 10 personal_survey personal 40 Fort Worth Healthca… Extr… Phar… Adult
#> # ℹ 740 more rows
#> # ℹ 2 more variables: n_respondents_in_industry <int>,
#> # median_industry_age <dbl>