Efficient list melting

The tutorial vignette("when-to-use-rrapply") describes the use of rrapply() when applied directly to data formatted as a nested list. If there is no specific reason to keep the data in the form of a nested list, it is often more practical to transform the nested list into a more manageable rectangular format and execute any further data processing steps on the unnested object (e.g. a data.frame).

For this purpose, rrapply() includes the option how = "melt", which returns a melted data.frame of the nested list similar in format to the "list"-method of reshape2::melt(). The rows of the melted data.frame contain the individual node paths of the elements in the nested list after pruning (based on the condition or classes arguments). The "value" column is a list-column with the values of the terminal nodes analogous to the flattened list returned by how = "flatten", see also vignette("when-to-use-rrapply").

In comparison to reshape2::melt(), rrapply() provides additional flexibility to filter or transform specific list elements before melting a nested list using e.g. the f, classes or condition arguments. More importantly, rrapply() is optimized specifically for nested lists, whereas reshape2::melt() was mainly aimed at melting data.frames before it was superseded by tidyr::gather() and the more recent tidyr::pivot_longer(). For this reason, reshape2::melt() tends to be quite slow when applied to larger nested lists as demonstrated below.

For illustration purposes, we use the same dataset renewable_energy_by_country as in vignette("when-to-use-rrapply"), a nested list containing the shares of renewable energy as a percentage in the total energy consumption per country in 2016.

library(rrapply)
data("renewable_energy_by_country")

First, let us convert the nested list to a melted data.frame:

system.time(
  renewable_energy_melted <- rrapply(renewable_energy_by_country, how = "melt")
)
#>    user  system elapsed 
#>   0.001   0.000   0.002

head(renewable_energy_melted, 10)
#>       L1     L2                 L3             L4
#> 1  World Africa    Northern Africa        Algeria
#> 2  World Africa    Northern Africa          Egypt
#> 3  World Africa    Northern Africa          Libya
#> 4  World Africa    Northern Africa        Morocco
#> 5  World Africa    Northern Africa          Sudan
#> 6  World Africa    Northern Africa        Tunisia
#> 7  World Africa    Northern Africa Western Sahara
#> 8  World Africa Sub-Saharan Africa Eastern Africa
#> 9  World Africa Sub-Saharan Africa Eastern Africa
#> 10 World Africa Sub-Saharan Africa Eastern Africa
#>                                L5 value
#> 1                            <NA>  0.08
#> 2                            <NA>  5.69
#> 3                            <NA>  1.64
#> 4                            <NA> 11.02
#> 5                            <NA> 61.64
#> 6                            <NA> 12.47
#> 7                            <NA>    NA
#> 8  British Indian Ocean Territory    NA
#> 9                         Burundi 89.22
#> 10                        Comoros 41.92

As data processing and reshaping for data.frames is familiar R territory, any subsequent processing tasks are likely more straightforward to execute using the melted data.frame than using the original nested list. For instance, we can easily filter all Western European countries with base R’s subset (or analogous dplyr or data.table commands among others):

renewable_energy_melted_west_eu <- subset(renewable_energy_melted, L3 == "Western Europe")

renewable_energy_melted_west_eu
#>        L1     L2             L3            L4   L5 value
#> 212 World Europe Western Europe       Austria <NA> 34.67
#> 213 World Europe Western Europe       Belgium <NA>  9.14
#> 214 World Europe Western Europe        France <NA> 14.74
#> 215 World Europe Western Europe       Germany <NA> 14.17
#> 216 World Europe Western Europe Liechtenstein <NA> 62.93
#> 217 World Europe Western Europe    Luxembourg <NA> 13.54
#> 218 World Europe Western Europe        Monaco <NA>    NA
#> 219 World Europe Western Europe   Netherlands <NA>  5.78
#> 220 World Europe Western Europe   Switzerland <NA> 25.49

For completeness, note that a similar result can be also be obtained directly with rrapply using the .xparents argument:

rrapply(
  renewable_energy_by_country,
  condition = function(x, .xparents) "Western Europe" %in% .xparents,
  how = "melt"
)
#>      L1     L2             L3            L4 value
#> 1 World Europe Western Europe       Austria 34.67
#> 2 World Europe Western Europe       Belgium  9.14
#> 3 World Europe Western Europe        France 14.74
#> 4 World Europe Western Europe       Germany 14.17
#> 5 World Europe Western Europe Liechtenstein 62.93
#> 6 World Europe Western Europe    Luxembourg 13.54
#> 7 World Europe Western Europe        Monaco    NA
#> 8 World Europe Western Europe   Netherlands  5.78
#> 9 World Europe Western Europe   Switzerland 25.49

Here, the L5 column is no longer present as the pruned list does not contain any elements at this depth before melting to a data.frame.

The .xparents argument is currently only available in the development version of rrapply installed with devtools::install_github("JorisChau/rrapply").

Now let us return the same results with reshape2::melt(),

system.time(
  renewable_energy_melted_reshape2 <- reshape2::melt(renewable_energy_by_country)
)
#>    user  system elapsed 
#>   0.133   0.007   0.140
head(renewable_energy_melted_reshape2, 10)
#>    value             L4                             L5                 L3
#> 1   0.08        Algeria                           <NA>    Northern Africa
#> 2   5.69          Egypt                           <NA>    Northern Africa
#> 3   1.64          Libya                           <NA>    Northern Africa
#> 4  11.02        Morocco                           <NA>    Northern Africa
#> 5  61.64          Sudan                           <NA>    Northern Africa
#> 6  12.47        Tunisia                           <NA>    Northern Africa
#> 7     NA Western Sahara                           <NA>    Northern Africa
#> 8     NA Eastern Africa British Indian Ocean Territory Sub-Saharan Africa
#> 9  89.22 Eastern Africa                        Burundi Sub-Saharan Africa
#> 10 41.92 Eastern Africa                        Comoros Sub-Saharan Africa
#>        L2    L1
#> 1  Africa World
#> 2  Africa World
#> 3  Africa World
#> 4  Africa World
#> 5  Africa World
#> 6  Africa World
#> 7  Africa World
#> 8  Africa World
#> 9  Africa World
#> 10 Africa World

rrapply() orders the columns differently by default and produces a "value" list-column instead of reshape2:melt()’s numeric "value" column, but other than that the data contained in the melted data.frame is identical. For a medium-sized list as used here, the computational speed of reshape2::melt() is acceptable for practical usage. However, its computational efficiency quickly decreases when melting larger or more deeply nested lists:

## generate a large shallow list with 3 layers and a total of 10^6 elements
ls_shallow <- rrapply(
  setNames(replicate(100L, NA, simplify = FALSE), as.character(1:100)),
  condition = function(x, .xpos) length(.xpos) < 3,
  f = function(x, .xpos) setNames(replicate(100L, 1L, simplify = FALSE),
                                  paste(paste(.xpos, collapse = "."), as.character(1:100), sep = ".")),
  feverywhere = "recurse"
)

## benchmark timing with rrapply how = "melt"
system.time(ls_shallow_melted <- rrapply(ls_shallow, how = "melt"))
#>    user  system elapsed 
#>   0.485   0.020   0.506
head(ls_shallow_melted)
#>   L1  L2    L3 value
#> 1  1 1.1 1.1.1     1
#> 2  1 1.1 1.1.2     1
#> 3  1 1.1 1.1.3     1
#> 4  1 1.1 1.1.4     1
#> 5  1 1.1 1.1.5     1
#> 6  1 1.1 1.1.6     1

## benchmark timing with reshape2::melt
system.time(ls_shallow_melted_reshape2 <- reshape2::melt(ls_shallow))
#>    user  system elapsed 
#> 191.471   0.047 191.537
head(ls_shallow_melted_reshape2)
#>   value    L3  L2 L1
#> 1     1 1.1.1 1.1  1
#> 2     1 1.1.2 1.1  1
#> 3     1 1.1.3 1.1  1
#> 4     1 1.1.4 1.1  1
#> 5     1 1.1.5 1.1  1
#> 6     1 1.1.6 1.1  1
## generate a deeply nested list with 18 layers and a total of 2^18 elements
ls_deep <- rrapply(
  list("1" = NA, "2" = NA),
  condition = function(x, .xpos) length(.xpos) < 18,
  f = function(x, .xpos) setNames(list(1L, 2L),
                                  paste(paste(.xpos, collapse = "."), c("1", "2"), sep = ".")),
  feverywhere = "recurse"
)

## benchmark timing with rrapply how = "melt"
system.time(ls_deep_melted <- rrapply(ls_deep, how = "melt"))
#>    user  system elapsed 
#>   0.380   0.012   0.393

## benchmark timing with reshape2::melt
system.time(ls_deep_melted_reshape2 <- reshape2::melt(ls_deep))
#>    user  system elapsed 
#> 167.753   0.072 167.841

Although unlikely to encounter such large or deeply nested lists in practice, these artificial examples serve to illustrate that reshape2::melt is not very well-suited to convert large nested lists to melted data.frames.

Efficient unmelting of melted data.frames

The option how = "unmelt" is currently only available in the development version of rrapply installed with devtools::install_github("JorisChau/rrapply").

Let’s return to the example of the melted data.frame containing only the renewable energy shares of Western European countries:

renewable_energy_melted_west_eu
#>        L1     L2             L3            L4   L5 value
#> 212 World Europe Western Europe       Austria <NA> 34.67
#> 213 World Europe Western Europe       Belgium <NA>  9.14
#> 214 World Europe Western Europe        France <NA> 14.74
#> 215 World Europe Western Europe       Germany <NA> 14.17
#> 216 World Europe Western Europe Liechtenstein <NA> 62.93
#> 217 World Europe Western Europe    Luxembourg <NA> 13.54
#> 218 World Europe Western Europe        Monaco <NA>    NA
#> 219 World Europe Western Europe   Netherlands <NA>  5.78
#> 220 World Europe Western Europe   Switzerland <NA> 25.49

Suppose that this data.frame needs to be converted back to a nested list object in order to write it to a JSON- or XML-object, or for some other purpose. Writing a recursive function to restore the nested list can prove to be quite a time-consuming task. Base R’s unlist() function has an inverse function relist(), but relist() requires a skeleton nested list to repopulate, and such a skeleton is clearly unavailable in the current context. In particular, since we have filtered entries from the melted data.frame, we can no longer use the original list as a template object, without filtering nodes from the original list as well.

To this purpose, rrapply() includes the additional option how = "unmelt" that performs the inverse operation of how = "melt". No skeleton object is needed in this case, only a data.frame in the format returned by how = "melt". To illustrate, we can convert the melted data.frame above to a nested list as follows:

renewable_energy_west_eu_unmelted <- rrapply(renewable_energy_melted_west_eu, how = "unmelt")
str(renewable_energy_west_eu_unmelted, give.attr = FALSE)
#> List of 1
#>  $ World:List of 1
#>   ..$ Europe:List of 1
#>   .. ..$ Western Europe:List of 9
#>   .. .. ..$ Austria      : num 34.7
#>   .. .. ..$ Belgium      : num 9.14
#>   .. .. ..$ France       : num 14.7
#>   .. .. ..$ Germany      : num 14.2
#>   .. .. ..$ Liechtenstein: num 62.9
#>   .. .. ..$ Luxembourg   : num 13.5
#>   .. .. ..$ Monaco       : logi NA
#>   .. .. ..$ Netherlands  : num 5.78
#>   .. .. ..$ Switzerland  : num 25.5

Remark 1: how = "unmelt" is based on a greedy approach parsing data.frame rows as list elements starting from the top of the data.frame. That is, rrapply() continues collecting children nodes as long as the parent node name remains unchanged. If, for instance, we wish to create two separate nodes (on the same level) with the name "Western Europe", these nodes should not be listed after one another in the melted data.frame as rrapply() will group all children under a single "Western Europe" list element.

Remark 2: Internally, how = "unmelt" reconstructs a nested list from the melted data.frame and subsequently follows the same conceptual framework as how = "replace". Any other function arguments, such as f and condition should therefore be used in exactly the same way as one would use them for how = "replace" applied to a nested list object.

Remark 3: how = "unmelt" does (currently) not restore the attributes of intermediate list nodes and is therefore not an exact inverse of how = "melt". The other way around will always produce the same results:

renewable_energy_unmelted <- rrapply(renewable_energy_melted, how = "unmelt")
renewable_energy_remelted <- rrapply(renewable_energy_unmelted, how = "melt")

all.equal(renewable_energy_melted, renewable_energy_remelted)
#> [1] TRUE

In terms of computational effort, rrapply()’s how = "unmelt" can be equally or more efficient than base R’s relist() even though there is no template list object that can be repopulated. This is illustrated using the large list objects generated above,

## deeply nested list with 18 layers and a total of 2^18 elements
## benchmark timing with rrapply how = "unmelt"
system.time(ls_deep_unmelted <- rrapply(ls_deep_melted, how = "unmelt"))
#>    user  system elapsed 
#>   0.168   0.000   0.168
all.equal(ls_deep_unmelted, ls_deep)
#> [1] TRUE

## benchmark timing with relist
ls_deep_unlist <- unlist(as.relistable(ls_deep))
system.time(ls_deep_relist <- relist(ls_deep_unlist))
#>    user  system elapsed 
#>  24.626   0.088  24.716

## large shallow list with 3 layers and a total of 10^6 elements
## benchmark timing with rrapply how = "unmelt"
system.time(ls_shallow_unmelted <- rrapply(ls_shallow_melted, how = "unmelt"))
#>    user  system elapsed 
#>   0.112   0.000   0.111
all.equal(ls_shallow_unmelted, ls_shallow)
#> [1] TRUE

## benchmark timing with relist
ls_shallow_unlist <- unlist(as.relistable(ls_shallow))
system.time(ls_shallow_relist <- relist(ls_shallow_unlist))
#>    user  system elapsed 
#>   7.259   0.024   7.283