Efficient list unnesting

The tutorial vignette("1-when-to-use-rrapply") describes the use of rrapply() when applied directly to data formatted as a nested list. If there is no specific reason to keep the data in the form of a nested list, it is often more practical to transform the nested list into a more manageable rectangular format and execute any further data processing steps on the unnested object (e.g. a data.frame). For this purpose, rrapply() includes the options how = "melt" and how = "bind" to unnest a nested list either to a long or a wide data.frame. The following sections explain these two options in more detail and provide several examples on their usage.

Unnest to long data.frame with how = "melt"

The option how = "melt" unnests a nested list to a long or melted data.frame similar in format to reshape2::melt() applied to a nested list. The rows of the melted data.frame contain the individual node paths of the elements in the nested list after pruning (based on the condition or classes arguments). The "value" column is a vector- or list-column with the values of the terminal nodes identical to the result returned by how = "flatten", see also vignette("1-when-to-use-rrapply").

In comparison to reshape2::melt(), rrapply() provides additional flexibility to filter or transform specific list elements before melting a nested list using e.g. the f, classes or condition arguments. More importantly, rrapply() is optimized specifically for nested lists, whereas reshape2::melt() was mainly aimed at melting data.frames before it was superseded by tidyr::gather() and the more recent tidyr::pivot_longer(). For this reason, reshape2::melt() tends to be quite slow when applied to larger nested lists.

For illustration purposes, we use the same dataset renewable_energy_by_country as in vignette("1-when-to-use-rrapply"), a nested list containing the per country shares of renewable energy as a percentage in the total energy consumption in 2016.

library(rrapply)
data("renewable_energy_by_country")

First, let us convert the nested list to a melted data.frame:

system.time(
  renewable_energy_melt <- rrapply(renewable_energy_by_country, how = "melt")
)
#>    user  system elapsed 
#>   0.002   0.001   0.002

head(renewable_energy_melt, 10)
#>       L1     L2                 L3             L4
#> 1  World Africa    Northern Africa        Algeria
#> 2  World Africa    Northern Africa          Egypt
#> 3  World Africa    Northern Africa          Libya
#> 4  World Africa    Northern Africa        Morocco
#> 5  World Africa    Northern Africa          Sudan
#> 6  World Africa    Northern Africa        Tunisia
#> 7  World Africa    Northern Africa Western Sahara
#> 8  World Africa Sub-Saharan Africa Eastern Africa
#> 9  World Africa Sub-Saharan Africa Eastern Africa
#> 10 World Africa Sub-Saharan Africa Eastern Africa
#>                                L5 value
#> 1                            <NA>  0.08
#> 2                            <NA>  5.69
#> 3                            <NA>  1.64
#> 4                            <NA> 11.02
#> 5                            <NA> 61.64
#> 6                            <NA> 12.47
#> 7                            <NA>    NA
#> 8  British Indian Ocean Territory    NA
#> 9                         Burundi 89.22
#> 10                        Comoros 41.92

As data processing and reshaping for data.frames is familiar R territory, any subsequent processing tasks are likely more straightforward to execute using the melted data.frame than using the original nested list. For instance, we can easily filter all Western European countries with subset() (or using e.g. a dplyr or data.table):

renewable_energy_melt_west_eu <- subset(renewable_energy_melt, L3 == "Western Europe")

renewable_energy_melt_west_eu
#>        L1     L2             L3            L4   L5 value
#> 212 World Europe Western Europe       Austria <NA> 34.67
#> 213 World Europe Western Europe       Belgium <NA>  9.14
#> 214 World Europe Western Europe        France <NA> 14.74
#> 215 World Europe Western Europe       Germany <NA> 14.17
#> 216 World Europe Western Europe Liechtenstein <NA> 62.93
#> 217 World Europe Western Europe    Luxembourg <NA> 13.54
#> 218 World Europe Western Europe        Monaco <NA>    NA
#> 219 World Europe Western Europe   Netherlands <NA>  5.78
#> 220 World Europe Western Europe   Switzerland <NA> 25.49

For completeness, note that a similar result can also be obtained directly with rrapply() using the .xparents argument:

rrapply(
  renewable_energy_by_country, 
  condition = function(x, .xparents) "Western Europe" %in% .xparents,
  how = "melt"
)
#>      L1     L2             L3            L4 value
#> 1 World Europe Western Europe       Austria 34.67
#> 2 World Europe Western Europe       Belgium  9.14
#> 3 World Europe Western Europe        France 14.74
#> 4 World Europe Western Europe       Germany 14.17
#> 5 World Europe Western Europe Liechtenstein 62.93
#> 6 World Europe Western Europe    Luxembourg 13.54
#> 7 World Europe Western Europe        Monaco    NA
#> 8 World Europe Western Europe   Netherlands  5.78
#> 9 World Europe Western Europe   Switzerland 25.49

Here, the L5 column is no longer present as the pruned list does not contain any elements at this depth before melting to a long data.frame.

Now let us return the same results with reshape2::melt(),

system.time(
  renewable_energy_melt_reshape2 <- reshape2::melt(renewable_energy_by_country)
)
#>    user  system elapsed 
#>   0.133   0.011   0.145
head(renewable_energy_melt_reshape2, 10)
#>    value             L4                             L5                 L3
#> 1   0.08        Algeria                           <NA>    Northern Africa
#> 2   5.69          Egypt                           <NA>    Northern Africa
#> 3   1.64          Libya                           <NA>    Northern Africa
#> 4  11.02        Morocco                           <NA>    Northern Africa
#> 5  61.64          Sudan                           <NA>    Northern Africa
#> 6  12.47        Tunisia                           <NA>    Northern Africa
#> 7     NA Western Sahara                           <NA>    Northern Africa
#> 8     NA Eastern Africa British Indian Ocean Territory Sub-Saharan Africa
#> 9  89.22 Eastern Africa                        Burundi Sub-Saharan Africa
#> 10 41.92 Eastern Africa                        Comoros Sub-Saharan Africa
#>        L2    L1
#> 1  Africa World
#> 2  Africa World
#> 3  Africa World
#> 4  Africa World
#> 5  Africa World
#> 6  Africa World
#> 7  Africa World
#> 8  Africa World
#> 9  Africa World
#> 10 Africa World

rrapply() orders the columns differently by default and the "value" column might follow slightly different coercion rules, but other than that the data contained in the melted data.frame is identical. For a medium-sized list as used here, the computational speed of reshape2::melt() is acceptable for practical usage. However, its computational efficiency quickly decreases when melting larger or more deeply nested lists:

## helper function to generate named nested list
new_list <- function(n, dmax, d = 0, name = NULL) {
  x <- vector(mode = "list", length = n)
  names(x) <- if(!is.null(name)) paste(name, seq_len(n), sep = ".") else seq_len(n)
  for(i in seq_len(n)) {
    if(d + 1 < dmax) {
      x[[i]] <- Recall(n, dmax, d + 1, paste(c(name, i), collapse = "."))
    } else {
      x[[i]] <- 1L
    }
  }
  return(x)
}

## generate a large shallow list with 3 layers and a total of 10^6 elements
shallow_list <- new_list(n = 100, dmax = 3)
str(shallow_list, list.len = 2)
#> List of 100
#>  $ 1  :List of 100
#>   ..$ 1.1  :List of 100
#>   .. ..$ 1.1.1  : int 1
#>   .. ..$ 1.1.2  : int 1
#>   .. .. [list output truncated]
#>   ..$ 1.2  :List of 100
#>   .. ..$ 1.2.1  : int 1
#>   .. ..$ 1.2.2  : int 1
#>   .. .. [list output truncated]
#>   .. [list output truncated]
#>  $ 2  :List of 100
#>   ..$ 2.1  :List of 100
#>   .. ..$ 2.1.1  : int 1
#>   .. ..$ 2.1.2  : int 1
#>   .. .. [list output truncated]
#>   ..$ 2.2  :List of 100
#>   .. ..$ 2.2.1  : int 1
#>   .. ..$ 2.2.2  : int 1
#>   .. .. [list output truncated]
#>   .. [list output truncated]
#>   [list output truncated]

## benchmark timing with rrapply
system.time(shallow_melt <- rrapply(shallow_list, how = "melt")) 
#>    user  system elapsed 
#>   1.147   0.040   1.187
head(shallow_melt)
#>   L1  L2    L3 value
#> 1  1 1.1 1.1.1     1
#> 2  1 1.1 1.1.2     1
#> 3  1 1.1 1.1.3     1
#> 4  1 1.1 1.1.4     1
#> 5  1 1.1 1.1.5     1
#> 6  1 1.1 1.1.6     1

## benchmark timing with reshape2::melt
system.time(shallow_melt_reshape2 <- reshape2::melt(shallow_list))
#>    user  system elapsed 
#> 160.264   0.048 160.342
head(shallow_melt_reshape2)
#>   value    L3  L2 L1
#> 1     1 1.1.1 1.1  1
#> 2     1 1.1.2 1.1  1
#> 3     1 1.1.3 1.1  1
#> 4     1 1.1.4 1.1  1
#> 5     1 1.1.5 1.1  1
#> 6     1 1.1.6 1.1  1
## generate a deeply nested list with 18 layers and a total of 2^18 elements
deep_list <- new_list(n = 2, dmax = 18)
str(deep_list, max.level = 3)
#> List of 2
#>  $ 1:List of 2
#>   ..$ 1.1:List of 2
#>   .. ..$ 1.1.1:List of 2
#>   .. ..$ 1.1.2:List of 2
#>   ..$ 1.2:List of 2
#>   .. ..$ 1.2.1:List of 2
#>   .. ..$ 1.2.2:List of 2
#>  $ 2:List of 2
#>   ..$ 2.1:List of 2
#>   .. ..$ 2.1.1:List of 2
#>   .. ..$ 2.1.2:List of 2
#>   ..$ 2.2:List of 2
#>   .. ..$ 2.2.1:List of 2
#>   .. ..$ 2.2.2:List of 2

## benchmark timing with rrapply
system.time(deep_melt <- rrapply(deep_list, how = "melt")) 
#>    user  system elapsed 
#>   0.313   0.048   0.361

## benchmark timing with reshape2::melt
system.time(deep_melt_reshape2 <- reshape2::melt(deep_list))
#>    user  system elapsed 
#> 131.846   0.092 131.957

Although unlikely to encounter such large or deeply nested lists in practice, these artificial examples serve to illustrate that reshape2::melt() is not very well-suited to convert large nested lists to melted data.frames.

Unnest to wide data.frame with how = "bind"

The option how = "bind" unnests a nested list to a wide data.frame and is only useful to unnest nested lists containing repeated observations of the same variables. Consider for instance the pokedex dataset as in vignette("1-when-to-use-rrapply"), a nested list containing up to 17 property variables for each of the 151 original Pokémon.

data("pokedex")

## all 151 Pokemon entries
str(pokedex, list.len = 3)
#> List of 1
#>  $ pokemon:List of 151
#>   ..$ :List of 16
#>   .. ..$ id            : int 1
#>   .. ..$ num           : chr "001"
#>   .. ..$ name          : chr "Bulbasaur"
#>   .. .. [list output truncated]
#>   ..$ :List of 17
#>   .. ..$ id            : int 2
#>   .. ..$ num           : chr "002"
#>   .. ..$ name          : chr "Ivysaur"
#>   .. .. [list output truncated]
#>   ..$ :List of 15
#>   .. ..$ id            : int 3
#>   .. ..$ num           : chr "003"
#>   .. ..$ name          : chr "Venusaur"
#>   .. .. [list output truncated]
#>   .. [list output truncated]

## single Pokemon entry
str(pokedex[["pokemon"]][[1]])
#> List of 16
#>  $ id            : int 1
#>  $ num           : chr "001"
#>  $ name          : chr "Bulbasaur"
#>  $ img           : chr "http://www.serebii.net/pokemongo/pokemon/001.png"
#>  $ type          : chr [1:2] "Grass" "Poison"
#>  $ height        : chr "0.71 m"
#>  $ weight        : chr "6.9 kg"
#>  $ candy         : chr "Bulbasaur Candy"
#>  $ candy_count   : int 25
#>  $ egg           : chr "2 km"
#>  $ spawn_chance  : num 0.69
#>  $ avg_spawns    : int 69
#>  $ spawn_time    : chr "20:00"
#>  $ multipliers   : num 1.58
#>  $ weaknesses    : chr [1:4] "Fire" "Ice" "Flying" "Psychic"
#>  $ next_evolution:List of 2
#>   ..$ :List of 2
#>   .. ..$ num : chr "002"
#>   .. ..$ name: chr "Ivysaur"
#>   ..$ :List of 2
#>   .. ..$ num : chr "003"
#>   .. ..$ name: chr "Venusaur"

Setting how = "bind" unnests the pokedex list to a wide data.frame with a single row entry for each Pokémon:

pokedex_wide <- rrapply(pokedex, how = "bind")

## display few data.frame columns
head(pokedex_wide[, c(1:3, 5:10)], n = 5)
#>   pokemon.id pokemon.num pokemon.name  pokemon.type pokemon.height
#> 1          1         001    Bulbasaur Grass, Poison         0.71 m
#> 2          2         002      Ivysaur Grass, Poison         0.99 m
#> 3          3         003     Venusaur Grass, Poison         2.01 m
#> 4          4         004   Charmander          Fire         0.61 m
#> 5          5         005   Charmeleon          Fire         1.09 m
#>   pokemon.weight    pokemon.candy pokemon.candy_count pokemon.egg
#> 1         6.9 kg  Bulbasaur Candy                  25        2 km
#> 2        13.0 kg  Bulbasaur Candy                 100 Not in Eggs
#> 3       100.0 kg  Bulbasaur Candy                  NA Not in Eggs
#> 4         8.5 kg Charmander Candy                  25        2 km
#> 5        19.0 kg Charmander Candy                 100 Not in Eggs

## display all data.frame columns
str(pokedex_wide, max.level = 1, vec.len = 1)
#> 'data.frame':    151 obs. of  25 variables:
#>  $ pokemon.id                   : int  1 2 ...
#>  $ pokemon.num                  : chr  "001" ...
#>  $ pokemon.name                 : chr  "Bulbasaur" ...
#>  $ pokemon.img                  : chr  "http://www.serebii.net/pokemongo/pokemon/001.png" ...
#>  $ pokemon.type                 :List of 151
#>  $ pokemon.height               : chr  "0.71 m" ...
#>  $ pokemon.weight               : chr  "6.9 kg" ...
#>  $ pokemon.candy                : chr  "Bulbasaur Candy" ...
#>  $ pokemon.candy_count          : int  25 100 ...
#>  $ pokemon.egg                  : chr  "2 km" ...
#>  $ pokemon.spawn_chance         : num  0.69 0.042 ...
#>  $ pokemon.avg_spawns           : num  69 4.2 ...
#>  $ pokemon.spawn_time           : chr  "20:00" ...
#>  $ pokemon.multipliers          :List of 151
#>  $ pokemon.weaknesses           :List of 151
#>  $ pokemon.next_evolution.1.num : chr  "002" ...
#>  $ pokemon.next_evolution.1.name: chr  "Ivysaur" ...
#>  $ pokemon.next_evolution.2.num : chr  "003" ...
#>  $ pokemon.next_evolution.2.name: chr  "Venusaur" ...
#>  $ pokemon.prev_evolution.1.num : chr  NA ...
#>  $ pokemon.prev_evolution.1.name: chr  NA ...
#>  $ pokemon.prev_evolution.2.num : chr  NA ...
#>  $ pokemon.prev_evolution.2.name: chr  NA ...
#>  $ pokemon.next_evolution.3.num : chr  NA ...
#>  $ pokemon.next_evolution.3.name: chr  NA ...

Each Pokémon sublist has been flattened to a single wide row in the data.frame. The 151 rows are stacked and aligned by matching variable names, with missing variables replaced by NAs (similar to data.table::rbindlist(..., fill = TRUE)). Note that any nested variables, such as next_evolution and prev_evolution, are unnested as individual data.frame columns similar to repeated application of tidyr::unnest_wider() to a data.frame with nested list-columns.

Remark: The discovery of repeated observations (e.g. individual Pokémon) is based on a depth first search of the nested list. Conceptually, the nested list is traversed to detect any multi-element sublists without names or with the same name repeated for each list element. When such a sublist is detected, each of its elements is flattened and row-binded into a wide data.frame.

Comparison to common alternatives

Several common alternatives to unnest lists containing repeated observations also eluded to in the previous section include data.table::rbindlist(), dplyr::bind_rows(), and tidyr’s dedicated rectangling functions unnest_longer(), unnest_wider() and hoist().

The first two functions are most useful for lists of repeated data.frames (or lists) containing no further levels of nesting (e.g. data.frames without list-columns):

## bind_rows() works fine with simple lists 
pokedex_wide_dplyr <- dplyr::bind_rows(lapply(pokedex[["pokemon"]], `[`, 1:4))
head(pokedex_wide_dplyr)
#> # A tibble: 6 x 4
#>      id num   name       img                                             
#>   <int> <chr> <chr>      <chr>                                           
#> 1     1 001   Bulbasaur  http://www.serebii.net/pokemongo/pokemon/001.png
#> 2     2 002   Ivysaur    http://www.serebii.net/pokemongo/pokemon/002.png
#> 3     3 003   Venusaur   http://www.serebii.net/pokemongo/pokemon/003.png
#> 4     4 004   Charmander http://www.serebii.net/pokemongo/pokemon/004.png
#> 5     5 005   Charmeleon http://www.serebii.net/pokemongo/pokemon/005.png
#> 6     6 006   Charizard  http://www.serebii.net/pokemongo/pokemon/006.png

## but does not work well with list-columns
tryCatch(dplyr::bind_rows(lapply(pokedex[["pokemon"]], `[`, 1:5)), error = function(err) err$message)  
#> [1] "Internal error in `vec_assign()`: `value` should have been recycled to fit `x`."

## rbindlist() works fine with simple lists 
pokedex_wide_dt <- data.table::rbindlist(lapply(pokedex[["pokemon"]], `[`, 1:4))  
head(pokedex_wide_dt)
#>    id num       name                                              img
#> 1:  1 001  Bulbasaur http://www.serebii.net/pokemongo/pokemon/001.png
#> 2:  2 002    Ivysaur http://www.serebii.net/pokemongo/pokemon/002.png
#> 3:  3 003   Venusaur http://www.serebii.net/pokemongo/pokemon/003.png
#> 4:  4 004 Charmander http://www.serebii.net/pokemongo/pokemon/004.png
#> 5:  5 005 Charmeleon http://www.serebii.net/pokemongo/pokemon/005.png
#> 6:  6 006  Charizard http://www.serebii.net/pokemongo/pokemon/006.png

## but is not ideal for nested variables
pokedex_wide_dt2 <- data.table::rbindlist(lapply(pokedex[["pokemon"]], `[`, c("name", "prev_evolution")), fill = TRUE)
#> Warning in data.table::rbindlist(lapply(pokedex[["pokemon"]], `[`, c("name", :
#> Column 2 ['NA'] of item 1 is length 0. This (and 78 others like it) has been
#> filled with NA (NULL for list columns) to make each item uniform.
head(pokedex_wide_dt2)
#>          name NA prev_evolution
#> 1:  Bulbasaur NA               
#> 2:    Ivysaur NA      <list[2]>
#> 3:   Venusaur NA      <list[2]>
#> 4:   Venusaur NA      <list[2]>
#> 5: Charmander NA               
#> 6: Charmeleon NA      <list[2]>

The rectangling functions in the tidyr-package offer a lot more flexibility. A similar result as rrapply(pokedex, how = "bind") can be obtained with:

library(tidyr)
library(tibble)

pokedex_wide_tidyr <- as_tibble(pokedex) %>%
  unnest_wider(pokemon) %>%
  unnest_wider(next_evolution, names_sep = ".") %>%
  unnest_wider(prev_evolution, names_sep = ".") %>%
  unnest_wider(next_evolution.1, names_sep = ".") %>%
  unnest_wider(next_evolution.2, names_sep = ".") %>%
  unnest_wider(next_evolution.3, names_sep = ".") %>%
  unnest_wider(prev_evolution.1, names_sep = ".") %>%
  unnest_wider(prev_evolution.2, names_sep = ".") 

pokedex_wide_tidyr
#> # A tibble: 151 x 25
#>       id num   name  img   type  height weight candy candy_count egg  
#>    <int> <chr> <chr> <chr> <lis> <chr>  <chr>  <chr>       <int> <chr>
#>  1     1 001   Bulb… http… <chr… 0.71 m 6.9 kg Bulb…          25 2 km 
#>  2     2 002   Ivys… http… <chr… 0.99 m 13.0 … Bulb…         100 Not …
#>  3     3 003   Venu… http… <chr… 2.01 m 100.0… Bulb…          NA Not …
#>  4     4 004   Char… http… <chr… 0.61 m 8.5 kg Char…          25 2 km 
#>  5     5 005   Char… http… <chr… 1.09 m 19.0 … Char…         100 Not …
#>  6     6 006   Char… http… <chr… 1.70 m 90.5 … Char…          NA Not …
#>  7     7 007   Squi… http… <chr… 0.51 m 9.0 kg Squi…          25 2 km 
#>  8     8 008   Wart… http… <chr… 0.99 m 22.5 … Squi…         100 Not …
#>  9     9 009   Blas… http… <chr… 1.60 m 85.5 … Squi…          NA Not …
#> 10    10 010   Cate… http… <chr… 0.30 m 2.9 kg Cate…          12 2 km 
#> # … with 141 more rows, and 15 more variables: spawn_chance <dbl>,
#> #   avg_spawns <dbl>, spawn_time <chr>, multipliers <list>, weaknesses <list>,
#> #   next_evolution.1.num <chr>, next_evolution.1.name <chr>,
#> #   next_evolution.2.num <chr>, next_evolution.2.name <chr>,
#> #   next_evolution.3.num <chr>, next_evolution.3.name <chr>,
#> #   prev_evolution.1.num <chr>, prev_evolution.1.name <chr>,
#> #   prev_evolution.2.num <chr>, prev_evolution.2.name <chr>

The option how = "bind" in rrapply() is arguably less flexible as it always expands the nested list to a data.frame that is as wide as possible. On the other hand, the increased flexibility and interpretability in tidyr’s rectangling functions come at the cost of computational efficiency, which can quickly become a bottleneck when unnesting large nested lists:

## replicate the original pokedex 1000 times
pokedex_large <- list(pokemon = do.call(c, replicate(1E3, pokedex[["pokemon"]], simplify = FALSE)))

system.time({
  rrapply(pokedex_large, how = "bind")
})
#>    user  system elapsed 
#>   1.440   0.016   1.456

## unnest only first layer of next_evolution and prev_evolution
system.time({
  as_tibble(pokedex_large) %>%
    unnest_wider(pokemon) %>%
    unnest_wider(next_evolution, names_sep = ".") %>%
    unnest_wider(prev_evolution, names_sep = ".") 
})
#>    user  system elapsed 
#> 162.144   0.316 162.461

Note that the chained calls to unnest_wider() above only unnest the first layer of the list-columns next_evolution and prev_evolution, and not any of the resulting children list-columns, which would only further increase computation time.

Remark: The following approach to simultaneously unnest the nested list and post-process the resulting data.frame columns is not particularly efficient:

## Combine Pokemon evolution names in single call
system.time({
  pokedex_large_evolutions <- rrapply(
    pokedex_large,
    classes = c("character", "list"),
    condition = function(x, .xparents) any(grepl("name|evolution", .xparents)),
    f = function(x) if(is.list(x)) sapply(x, `[[`, "name") else x,
    how = "bind"
  )
})
#>    user  system elapsed 
#>  15.002   0.008  15.010

head(pokedex_large_evolutions, n = 9)
#>   pokemon.name pokemon.next_evolution pokemon.prev_evolution
#> 1    Bulbasaur      Ivysaur, Venusaur                     NA
#> 2      Ivysaur               Venusaur              Bulbasaur
#> 3     Venusaur                     NA     Bulbasaur, Ivysaur
#> 4   Charmander  Charmeleon, Charizard                     NA
#> 5   Charmeleon              Charizard             Charmander
#> 6    Charizard                     NA Charmander, Charmeleon
#> 7     Squirtle   Wartortle, Blastoise                     NA
#> 8    Wartortle              Blastoise               Squirtle
#> 9    Blastoise                     NA    Squirtle, Wartortle

The reason is that the condition and f functions have to be applied to each node in the nested list individually and cannot be vectorized in any way, which makes this call very time-consuming. It is likely more efficient to post-process the data.frame outside the call to rrapply(), e.g. using data.table:

library(data.table, quietly = TRUE)

## Combine Pokemon evolution names in multiple calls
system.time({
  pokedex_large_evolutions2 <- as.data.table(rrapply(pokedex_large, how = "bind"))
  pokedex_large_evolutions2[, pokemon.next_evolution := as.list(transpose(.SD)), .SDcols = patterns("next_evolution.*name")]
  pokedex_large_evolutions2[, pokemon.prev_evolution := as.list(transpose(.SD)), .SDcols = patterns("prev_evolution.*name")]
})
#>    user  system elapsed 
#>   3.488   0.000   3.487

head(pokedex_large_evolutions2[, .(pokemon.name, pokemon.next_evolution, pokemon.prev_evolution)])
#>    pokemon.name  pokemon.next_evolution pokemon.prev_evolution
#> 1:    Bulbasaur     Ivysaur,Venusaur,NA                  NA,NA
#> 2:      Ivysaur          Venusaur,NA,NA           Bulbasaur,NA
#> 3:     Venusaur                NA,NA,NA      Bulbasaur,Ivysaur
#> 4:   Charmander Charmeleon,Charizard,NA                  NA,NA
#> 5:   Charmeleon         Charizard,NA,NA          Charmander,NA
#> 6:    Charizard                NA,NA,NA  Charmander,Charmeleon

Additional examples

We conclude this section by replicating some of the demonstrating examples in the tidyr vignette https://tidyr.tidyverse.org/articles/rectangle.html. The nested list datasets are publicly available in the repurrrsive-package.

library(repurrrsive)

## github users dataset
str(gh_users, list.len = 3)
#> List of 6
#>  $ :List of 30
#>   ..$ login              : chr "gaborcsardi"
#>   ..$ id                 : int 660288
#>   ..$ avatar_url         : chr "https://avatars.githubusercontent.com/u/660288?v=3"
#>   .. [list output truncated]
#>  $ :List of 30
#>   ..$ login              : chr "jennybc"
#>   ..$ id                 : int 599454
#>   ..$ avatar_url         : chr "https://avatars.githubusercontent.com/u/599454?v=3"
#>   .. [list output truncated]
#>  $ :List of 30
#>   ..$ login              : chr "jtleek"
#>   ..$ id                 : int 1571674
#>   ..$ avatar_url         : chr "https://avatars.githubusercontent.com/u/1571674?v=3"
#>   .. [list output truncated]
#>   [list output truncated]

## return all variables in result
gh_users_wide <- rrapply(gh_users, how = "bind")

head(gh_users_wide[, c("login", "followers", "url", "html_url")])
#>         login followers                                      url
#> 1 gaborcsardi       303 https://api.github.com/users/gaborcsardi
#> 2     jennybc       780     https://api.github.com/users/jennybc
#> 3      jtleek      3958      https://api.github.com/users/jtleek
#> 4  juliasilge       115  https://api.github.com/users/juliasilge
#> 5      leeper       213      https://api.github.com/users/leeper
#> 6    masalmon        34    https://api.github.com/users/masalmon
#>                         html_url
#> 1 https://github.com/gaborcsardi
#> 2     https://github.com/jennybc
#> 3      https://github.com/jtleek
#> 4  https://github.com/juliasilge
#> 5      https://github.com/leeper
#> 6    https://github.com/masalmon

## github repos dataset
str(gh_repos, list.len = 2)
#> List of 6
#>  $ :List of 30
#>   ..$ :List of 68
#>   .. ..$ id               : int 61160198
#>   .. ..$ name             : chr "after"
#>   .. .. [list output truncated]
#>   ..$ :List of 68
#>   .. ..$ id               : int 40500181
#>   .. ..$ name             : chr "argufy"
#>   .. .. [list output truncated]
#>   .. [list output truncated]
#>  $ :List of 30
#>   ..$ :List of 68
#>   .. ..$ id               : int 14756210
#>   .. ..$ name             : chr "2013-11_sfu"
#>   .. .. [list output truncated]
#>   ..$ :List of 68
#>   .. ..$ id               : int 14152301
#>   .. ..$ name             : chr "2014-01-27-miami"
#>   .. .. [list output truncated]
#>   .. [list output truncated]
#>   [list output truncated]

## return only a subset of variables
gh_repos_wide <- rrapply(
  unlist(gh_repos, recursive = FALSE),
  condition = function(x, .xname) .xname %in% c("login", "name", "homepage", "watchers_count"),
  how = "bind"
)

head(gh_repos_wide)
#>          name owner.login homepage watchers_count
#> 1       after gaborcsardi     NULL              5
#> 2      argufy gaborcsardi     NULL             19
#> 3         ask gaborcsardi     NULL              5
#> 4 baseimports gaborcsardi     NULL              0
#> 5      citest gaborcsardi     NULL              0
#> 6  clisymbols gaborcsardi                      18

## GoT characters dataset
str(got_chars, list.len = 3)
#> List of 30
#>  $ :List of 18
#>   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1022"
#>   ..$ id         : int 1022
#>   ..$ name       : chr "Theon Greyjoy"
#>   .. [list output truncated]
#>  $ :List of 18
#>   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1052"
#>   ..$ id         : int 1052
#>   ..$ name       : chr "Tyrion Lannister"
#>   .. [list output truncated]
#>  $ :List of 18
#>   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1074"
#>   ..$ id         : int 1074
#>   ..$ name       : chr "Victarion Greyjoy"
#>   .. [list output truncated]
#>   [list output truncated]

## return only list-columns
got_chars_wide <- rrapply(got_chars, how = "bind")
got_chars_wide <- rrapply(got_chars_wide, classes = "list", how = "prune")

head(as_tibble(got_chars_wide))
#> # A tibble: 6 x 7
#>   titles    aliases    allegiances books     povBooks  tvSeries  playedBy 
#>   <list>    <list>     <list>      <list>    <list>    <list>    <list>   
#> 1 <chr [3]> <chr [4]>  <chr [1]>   <chr [3]> <chr [2]> <chr [6]> <chr [1]>
#> 2 <chr [2]> <chr [11]> <chr [1]>   <chr [2]> <chr [4]> <chr [6]> <chr [1]>
#> 3 <chr [2]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [1]> <chr [1]>
#> 4 <chr [1]> <chr [1]>  <lgl [1]>   <chr [1]> <chr [1]> <chr [1]> <chr [1]>
#> 5 <chr [1]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [2]> <chr [1]>
#> 6 <chr [1]> <chr [1]>  <lgl [1]>   <chr [2]> <chr [1]> <chr [1]> <chr [1]>

Efficient unmelting of melted data.frames

Let’s return to the example of the melted long data.frame in the first section containing only the renewable energy shares of Western European countries:

renewable_energy_melt_west_eu
#>        L1     L2             L3            L4   L5 value
#> 212 World Europe Western Europe       Austria <NA> 34.67
#> 213 World Europe Western Europe       Belgium <NA>  9.14
#> 214 World Europe Western Europe        France <NA> 14.74
#> 215 World Europe Western Europe       Germany <NA> 14.17
#> 216 World Europe Western Europe Liechtenstein <NA> 62.93
#> 217 World Europe Western Europe    Luxembourg <NA> 13.54
#> 218 World Europe Western Europe        Monaco <NA>    NA
#> 219 World Europe Western Europe   Netherlands <NA>  5.78
#> 220 World Europe Western Europe   Switzerland <NA> 25.49

Suppose that this data.frame needs to be converted back to a nested list object in order to write it to a JSON- or XML-object, or perhaps for some tree visualization purpose. Writing a recursive function to restore the nested list can prove to be quite a time-consuming and error-prone task. Base R’s unlist() function has an inverse function relist(), but relist() requires a skeleton nested list to repopulate, and such a skeleton is clearly unavailable in the current context. In particular, since we have filtered entries from the melted data.frame, we can no longer use the original list as a template object, without filtering nodes from the original list as well.

To this purpose, rrapply() includes the additional option how = "unmelt" that performs the inverse operation of how = "melt". No skeleton object is needed in this case, only an ordinary data.frame in the format returned by how = "melt". To illustrate, we can convert the melted data.frame above to a nested list as follows:

renewable_energy_west_eu_unmelt <- rrapply(renewable_energy_melt_west_eu, how = "unmelt")
str(renewable_energy_west_eu_unmelt, give.attr = FALSE)
#> List of 1
#>  $ World:List of 1
#>   ..$ Europe:List of 1
#>   .. ..$ Western Europe:List of 9
#>   .. .. ..$ Austria      : num 34.7
#>   .. .. ..$ Belgium      : num 9.14
#>   .. .. ..$ France       : num 14.7
#>   .. .. ..$ Germany      : num 14.2
#>   .. .. ..$ Liechtenstein: num 62.9
#>   .. .. ..$ Luxembourg   : num 13.5
#>   .. .. ..$ Monaco       : num NA
#>   .. .. ..$ Netherlands  : num 5.78
#>   .. .. ..$ Switzerland  : num 25.5

Remark 1: how = "unmelt" is based on a greedy approach parsing data.frame rows as list elements starting from the top of the data.frame. That is, rrapply() continues collecting children nodes as long as the parent node name remains unchanged. If, for instance, we wish to create two separate nodes (on the same level) with the name "Western Europe", these nodes should not be listed after one another in the melted data.frame as rrapply() will group all children under a single "Western Europe" list element.

Remark 2: Internally, how = "unmelt" reconstructs a nested list from the melted data.frame and subsequently follows the same conceptual framework as how = "replace". Any other function arguments, such as f and condition should therefore be used in exactly the same way as one would use them for how = "replace" applied to a nested list object.

Remark 3: how = "unmelt" does (currently) not restore the attributes of intermediate list nodes and is therefore not an exact inverse of how = "melt". The other way around will always produce the same results:

renewable_energy_unmelt <- rrapply(renewable_energy_melt, how = "unmelt")
renewable_energy_remelt <- rrapply(renewable_energy_unmelt, how = "melt")

identical(renewable_energy_melt, renewable_energy_remelt)
#> [1] TRUE

In terms of computational effort, rrapply()’s how = "unmelt" can be equally or more efficient than base R’s relist() even though there is no template list object that can be repopulated. This is illustrated using the large list objects generated above:

## deeply nested list with 18 layers and a total of 2^18 elements
## benchmark timing with rrapply how = "unmelt"
system.time(deep_unmelt <- rrapply(deep_melt, how = "unmelt")) 
#>    user  system elapsed 
#>   0.169   0.000   0.170
identical(deep_unmelt, deep_list)
#> [1] TRUE

## benchmark timing with relist
deep_unlist <- unlist(as.relistable(deep_list))
system.time(deep_relist <- relist(deep_unlist))
#>    user  system elapsed 
#>  22.327   0.000  22.328
## large shallow list with 3 layers and a total of 10^6 elements
## benchmark timing with rrapply how = "unmelt"
system.time(shallow_unmelt <- rrapply(shallow_melt, how = "unmelt")) 
#>    user  system elapsed 
#>   0.133   0.000   0.133
identical(shallow_unmelt, shallow_list)
#> [1] TRUE

## benchmark timing with relist
shallow_unlist <- unlist(as.relistable(shallow_list))
system.time(shallow_relist <- relist(shallow_unlist))
#>    user  system elapsed 
#>   7.913   0.000   7.913