vignettes/articles/efficient-melting-unmelting.Rmd
efficient-melting-unmelting.Rmd
The tutorial vignette("when-to-use-rrapply")
describes the use of rrapply()
when applied directly to data formatted as a nested list. If there is no specific reason to keep the data in the form of a nested list, it is often more practical to transform the nested list into a more manageable rectangular format and execute any further data processing steps on the unnested object (e.g. a data.frame).
For this purpose, rrapply()
includes the option how = "melt"
, which returns a melted data.frame of the nested list similar in format to the "list"
-method of reshape2::melt()
. The rows of the melted data.frame contain the individual node paths of the elements in the nested list after pruning (based on the condition
or classes
arguments). The "value"
column is a list-column with the values of the terminal nodes analogous to the flattened list returned by how = "flatten"
, see also vignette("when-to-use-rrapply")
.
In comparison to reshape2::melt()
, rrapply()
provides additional flexibility to filter or transform specific list elements before melting a nested list using e.g. the f
, classes
or condition
arguments. More importantly, rrapply()
is optimized specifically for nested lists, whereas reshape2::melt()
was mainly aimed at melting data.frames before it was superseded by tidyr::gather()
and the more recent tidyr::pivot_longer()
. For this reason, reshape2::melt()
tends to be quite slow when applied to larger nested lists as demonstrated below.
For illustration purposes, we use the same dataset renewable_energy_by_country
as in vignette("when-to-use-rrapply")
, a nested list containing the shares of renewable energy as a percentage in the total energy consumption per country in 2016.
First, let us convert the nested list to a melted data.frame:
system.time( renewable_energy_melted <- rrapply(renewable_energy_by_country, how = "melt") ) #> user system elapsed #> 0.001 0.000 0.002 head(renewable_energy_melted, 10) #> L1 L2 L3 L4 #> 1 World Africa Northern Africa Algeria #> 2 World Africa Northern Africa Egypt #> 3 World Africa Northern Africa Libya #> 4 World Africa Northern Africa Morocco #> 5 World Africa Northern Africa Sudan #> 6 World Africa Northern Africa Tunisia #> 7 World Africa Northern Africa Western Sahara #> 8 World Africa Sub-Saharan Africa Eastern Africa #> 9 World Africa Sub-Saharan Africa Eastern Africa #> 10 World Africa Sub-Saharan Africa Eastern Africa #> L5 value #> 1 <NA> 0.08 #> 2 <NA> 5.69 #> 3 <NA> 1.64 #> 4 <NA> 11.02 #> 5 <NA> 61.64 #> 6 <NA> 12.47 #> 7 <NA> NA #> 8 British Indian Ocean Territory NA #> 9 Burundi 89.22 #> 10 Comoros 41.92
As data processing and reshaping for data.frames is familiar R territory, any subsequent processing tasks are likely more straightforward to execute using the melted data.frame than using the original nested list. For instance, we can easily filter all Western European countries with base R’s subset
(or analogous dplyr
or data.table
commands among others):
renewable_energy_melted_west_eu <- subset(renewable_energy_melted, L3 == "Western Europe") renewable_energy_melted_west_eu #> L1 L2 L3 L4 L5 value #> 212 World Europe Western Europe Austria <NA> 34.67 #> 213 World Europe Western Europe Belgium <NA> 9.14 #> 214 World Europe Western Europe France <NA> 14.74 #> 215 World Europe Western Europe Germany <NA> 14.17 #> 216 World Europe Western Europe Liechtenstein <NA> 62.93 #> 217 World Europe Western Europe Luxembourg <NA> 13.54 #> 218 World Europe Western Europe Monaco <NA> NA #> 219 World Europe Western Europe Netherlands <NA> 5.78 #> 220 World Europe Western Europe Switzerland <NA> 25.49
For completeness, note that a similar result can be also be obtained directly with rrapply
using the .xparents
argument:
rrapply( renewable_energy_by_country, condition = function(x, .xparents) "Western Europe" %in% .xparents, how = "melt" ) #> L1 L2 L3 L4 value #> 1 World Europe Western Europe Austria 34.67 #> 2 World Europe Western Europe Belgium 9.14 #> 3 World Europe Western Europe France 14.74 #> 4 World Europe Western Europe Germany 14.17 #> 5 World Europe Western Europe Liechtenstein 62.93 #> 6 World Europe Western Europe Luxembourg 13.54 #> 7 World Europe Western Europe Monaco NA #> 8 World Europe Western Europe Netherlands 5.78 #> 9 World Europe Western Europe Switzerland 25.49
Here, the L5
column is no longer present as the pruned list does not contain any elements at this depth before melting to a data.frame.
The .xparents
argument is currently only available in the development version of rrapply
installed with devtools::install_github("JorisChau/rrapply")
.
Now let us return the same results with reshape2::melt()
,
system.time( renewable_energy_melted_reshape2 <- reshape2::melt(renewable_energy_by_country) ) #> user system elapsed #> 0.133 0.007 0.140
head(renewable_energy_melted_reshape2, 10) #> value L4 L5 L3 #> 1 0.08 Algeria <NA> Northern Africa #> 2 5.69 Egypt <NA> Northern Africa #> 3 1.64 Libya <NA> Northern Africa #> 4 11.02 Morocco <NA> Northern Africa #> 5 61.64 Sudan <NA> Northern Africa #> 6 12.47 Tunisia <NA> Northern Africa #> 7 NA Western Sahara <NA> Northern Africa #> 8 NA Eastern Africa British Indian Ocean Territory Sub-Saharan Africa #> 9 89.22 Eastern Africa Burundi Sub-Saharan Africa #> 10 41.92 Eastern Africa Comoros Sub-Saharan Africa #> L2 L1 #> 1 Africa World #> 2 Africa World #> 3 Africa World #> 4 Africa World #> 5 Africa World #> 6 Africa World #> 7 Africa World #> 8 Africa World #> 9 Africa World #> 10 Africa World
rrapply()
orders the columns differently by default and produces a "value"
list-column instead of reshape2:melt()
’s numeric "value"
column, but other than that the data contained in the melted data.frame is identical. For a medium-sized list as used here, the computational speed of reshape2::melt()
is acceptable for practical usage. However, its computational efficiency quickly decreases when melting larger or more deeply nested lists:
## generate a large shallow list with 3 layers and a total of 10^6 elements ls_shallow <- rrapply( setNames(replicate(100L, NA, simplify = FALSE), as.character(1:100)), condition = function(x, .xpos) length(.xpos) < 3, f = function(x, .xpos) setNames(replicate(100L, 1L, simplify = FALSE), paste(paste(.xpos, collapse = "."), as.character(1:100), sep = ".")), feverywhere = "recurse" ) ## benchmark timing with rrapply how = "melt" system.time(ls_shallow_melted <- rrapply(ls_shallow, how = "melt")) #> user system elapsed #> 0.485 0.020 0.506 head(ls_shallow_melted) #> L1 L2 L3 value #> 1 1 1.1 1.1.1 1 #> 2 1 1.1 1.1.2 1 #> 3 1 1.1 1.1.3 1 #> 4 1 1.1 1.1.4 1 #> 5 1 1.1 1.1.5 1 #> 6 1 1.1 1.1.6 1 ## benchmark timing with reshape2::melt system.time(ls_shallow_melted_reshape2 <- reshape2::melt(ls_shallow)) #> user system elapsed #> 191.471 0.047 191.537 head(ls_shallow_melted_reshape2) #> value L3 L2 L1 #> 1 1 1.1.1 1.1 1 #> 2 1 1.1.2 1.1 1 #> 3 1 1.1.3 1.1 1 #> 4 1 1.1.4 1.1 1 #> 5 1 1.1.5 1.1 1 #> 6 1 1.1.6 1.1 1
## generate a deeply nested list with 18 layers and a total of 2^18 elements ls_deep <- rrapply( list("1" = NA, "2" = NA), condition = function(x, .xpos) length(.xpos) < 18, f = function(x, .xpos) setNames(list(1L, 2L), paste(paste(.xpos, collapse = "."), c("1", "2"), sep = ".")), feverywhere = "recurse" ) ## benchmark timing with rrapply how = "melt" system.time(ls_deep_melted <- rrapply(ls_deep, how = "melt")) #> user system elapsed #> 0.380 0.012 0.393 ## benchmark timing with reshape2::melt system.time(ls_deep_melted_reshape2 <- reshape2::melt(ls_deep)) #> user system elapsed #> 167.753 0.072 167.841
Although unlikely to encounter such large or deeply nested lists in practice, these artificial examples serve to illustrate that reshape2::melt
is not very well-suited to convert large nested lists to melted data.frames.
The option how = "unmelt"
is currently only available in the development version of rrapply
installed with devtools::install_github("JorisChau/rrapply")
.
Let’s return to the example of the melted data.frame containing only the renewable energy shares of Western European countries:
renewable_energy_melted_west_eu #> L1 L2 L3 L4 L5 value #> 212 World Europe Western Europe Austria <NA> 34.67 #> 213 World Europe Western Europe Belgium <NA> 9.14 #> 214 World Europe Western Europe France <NA> 14.74 #> 215 World Europe Western Europe Germany <NA> 14.17 #> 216 World Europe Western Europe Liechtenstein <NA> 62.93 #> 217 World Europe Western Europe Luxembourg <NA> 13.54 #> 218 World Europe Western Europe Monaco <NA> NA #> 219 World Europe Western Europe Netherlands <NA> 5.78 #> 220 World Europe Western Europe Switzerland <NA> 25.49
Suppose that this data.frame needs to be converted back to a nested list object in order to write it to a JSON- or XML-object, or for some other purpose. Writing a recursive function to restore the nested list can prove to be quite a time-consuming task. Base R’s unlist()
function has an inverse function relist()
, but relist()
requires a skeleton
nested list to repopulate, and such a skeleton
is clearly unavailable in the current context. In particular, since we have filtered entries from the melted data.frame, we can no longer use the original list as a template object, without filtering nodes from the original list as well.
To this purpose, rrapply()
includes the additional option how = "unmelt"
that performs the inverse operation of how = "melt"
. No skeleton object is needed in this case, only a data.frame in the format returned by how = "melt"
. To illustrate, we can convert the melted data.frame above to a nested list as follows:
renewable_energy_west_eu_unmelted <- rrapply(renewable_energy_melted_west_eu, how = "unmelt") str(renewable_energy_west_eu_unmelted, give.attr = FALSE) #> List of 1 #> $ World:List of 1 #> ..$ Europe:List of 1 #> .. ..$ Western Europe:List of 9 #> .. .. ..$ Austria : num 34.7 #> .. .. ..$ Belgium : num 9.14 #> .. .. ..$ France : num 14.7 #> .. .. ..$ Germany : num 14.2 #> .. .. ..$ Liechtenstein: num 62.9 #> .. .. ..$ Luxembourg : num 13.5 #> .. .. ..$ Monaco : logi NA #> .. .. ..$ Netherlands : num 5.78 #> .. .. ..$ Switzerland : num 25.5
Remark 1: how = "unmelt"
is based on a greedy approach parsing data.frame rows as list elements starting from the top of the data.frame. That is, rrapply()
continues collecting children nodes as long as the parent node name remains unchanged. If, for instance, we wish to create two separate nodes (on the same level) with the name "Western Europe"
, these nodes should not be listed after one another in the melted data.frame as rrapply()
will group all children under a single "Western Europe"
list element.
Remark 2: Internally, how = "unmelt"
reconstructs a nested list from the melted data.frame and subsequently follows the same conceptual framework as how = "replace"
. Any other function arguments, such as f
and condition
should therefore be used in exactly the same way as one would use them for how = "replace"
applied to a nested list object.
Remark 3: how = "unmelt"
does (currently) not restore the attributes of intermediate list nodes and is therefore not an exact inverse of how = "melt"
. The other way around will always produce the same results:
renewable_energy_unmelted <- rrapply(renewable_energy_melted, how = "unmelt") renewable_energy_remelted <- rrapply(renewable_energy_unmelted, how = "melt") all.equal(renewable_energy_melted, renewable_energy_remelted) #> [1] TRUE
In terms of computational effort, rrapply()
’s how = "unmelt"
can be equally or more efficient than base R’s relist()
even though there is no template list object that can be repopulated. This is illustrated using the large list objects generated above,
## deeply nested list with 18 layers and a total of 2^18 elements ## benchmark timing with rrapply how = "unmelt" system.time(ls_deep_unmelted <- rrapply(ls_deep_melted, how = "unmelt")) #> user system elapsed #> 0.168 0.000 0.168 all.equal(ls_deep_unmelted, ls_deep) #> [1] TRUE ## benchmark timing with relist ls_deep_unlist <- unlist(as.relistable(ls_deep)) system.time(ls_deep_relist <- relist(ls_deep_unlist)) #> user system elapsed #> 24.626 0.088 24.716 ## large shallow list with 3 layers and a total of 10^6 elements ## benchmark timing with rrapply how = "unmelt" system.time(ls_shallow_unmelted <- rrapply(ls_shallow_melted, how = "unmelt")) #> user system elapsed #> 0.112 0.000 0.111 all.equal(ls_shallow_unmelted, ls_shallow) #> [1] TRUE ## benchmark timing with relist ls_shallow_unlist <- unlist(as.relistable(ls_shallow)) system.time(ls_shallow_relist <- relist(ls_shallow_unlist)) #> user system elapsed #> 7.259 0.024 7.283