List recursion in R

The nested list below outlines the genealogy of several famous mathematicians. Each list element contains an additional "given" attribute with the mathematician’s given name. The numeric values at the leaf elements are the total number of descendants according to the Mathematics Genealogy Project as of June 2020. If no descendants are available there is a missing value present at the leaf node.

students <- list(
  Bernoulli = structure(list(
    Bernoulli = structure(list(
      Bernoulli = structure(1, given = "Daniel"),
      Euler = structure(list(
        Euler = structure(NA, given = "Johann"),
        Lagrange = structure(list(
          Fourier = structure(68679, given = "Jean-Baptiste"), 
          Plana = structure(NA, given = "Giovanni"),
          Poisson = structure(113435, given = "Simeon")
        ), given = "Joseph")
      ), given = "Leonhard")
    ), given = "Johann"),
    Bernoulli = structure(NA, given = "Nikolaus")
  ), given = "Jacob")
)

str(students, give.attr = FALSE)
#> List of 1
#>  $ Bernoulli:List of 2
#>   ..$ Bernoulli:List of 2
#>   .. ..$ Bernoulli: num 1
#>   .. ..$ Euler    :List of 2
#>   .. .. ..$ Euler   : logi NA
#>   .. .. ..$ Lagrange:List of 3
#>   .. .. .. ..$ Fourier: num 68679
#>   .. .. .. ..$ Plana  : logi NA
#>   .. .. .. ..$ Poisson: num 113435
#>   ..$ Bernoulli: logi NA

Consider the following exercise in list recursion:

Filter all descendants of ‘Leonhard Euler’ while keeping the original list structure and replace all missing values by zero.

Here is a possible (not so efficient) solution using recursion with the Recall() function:

prune_replace_Euler <- function(x) {
  i <- 1
  while(i <= length(x)) {
    if(identical(names(x)[i], "Euler") & identical(attr(x[[i]], "given"), "Leonhard")) {
      x[[i]] <- rapply(x[[i]], f = function(x) replace(x, is.na(x), 0), how = "replace")
      i <- i + 1
    } else {
      if(is.list(x[[i]])) {
        val <- Recall(x[[i]])
        x[[i]] <- val
        i <- i + !is.null(val)
      } else {
        x[[i]] <- NULL
      }
      if(all(sapply(x, is.null))) {
        x <- NULL
      }
    }
  }
  return(x)
}

str(prune_replace_Euler(students), give.attr = FALSE)
#> List of 1
#>  $ Bernoulli:List of 1
#>   ..$ Bernoulli:List of 1
#>   .. ..$ Euler:List of 2
#>   .. .. ..$ Euler   : num 0
#>   .. .. ..$ Lagrange:List of 3
#>   .. .. .. ..$ Fourier: num 68679
#>   .. .. .. ..$ Plana  : num 0
#>   .. .. .. ..$ Poisson: num 113435

This works, but is hardly the kind of code we would like to write for such a seemingly simple data exploration question. Moreover, this code is not very easy to follow or reason about, which makes it time-consuming to update or modify for future tasks.

Another approach would be to unnest the list in a more manageable (e.g. rectangular) format or use specialized packages such as igraph or data.tree to make pruning or modifying node entries more straightforward. Note that attention must be paid to correctly include the node attributes in the transformed object as the node names themselves are not unique in the given example. This is a sensible approach for more extensive data analysis, but ideally we would like to keep the list in its original format for simple data exploration to reduce the number of processing steps and minimize the possibility of introducing mistakes in the code.

The recursive function above makes use of rapply(), a member of the base-R apply-family of functions, that allows to apply a function recursively to the elements of a nested list and decide how the returned result is structured. If you are not familiar with the rapply() function it might be useful to first read the function documentation help("rapply") or check out the first section of the rrapply-package vignette (browseVignettes(package = "rrapply")).

Although quite useful, the rapply()-function is not always sufficiently flexible in practice, e.g. for pruning elements of a nested list (as demonstrated above). The rrapply() function is an attempt to enhance and update base rapply() to make it more generally applicable in the context of list recursion. The rrapply() function builds on the native implementation of base rapply() in R’s C-interface and for this reason requires no other external dependencies.

When to use `rrapply()`

For illustration purposes, we will use the datasets renewable_energy_by_country and pokedex included in the rrapply-package. renewable_energy_by_country is a nested list containing the shares of renewable energy as a percentage in the total energy consumption per country in 2016. The data is publicly available at the United Nations Open SDG Data Hub UNSD-SDG07. The 249 countries and areas are structured based on their geographical location according to the United Nations M49 standard UNSD-M49 The numeric values listed for each country are percentages, if no data is available the value of the country is NA. pokedex is a nested list containing various property values for each of the 151 original Pokémon.

library(rrapply)
data("renewable_energy_by_country")

For convenience, we subset only the values for countries and areas in Oceania from renewable_energy_by_country,

renewable_oceania <- renewable_energy_by_country[["World"]]["Oceania"]
str(renewable_oceania, list.len = 3, give.attr = FALSE)
#> List of 1
#>  $ Oceania:List of 4
#>   ..$ Australia and New Zealand:List of 6
#>   .. ..$ Australia                        : num 9.32
#>   .. ..$ Christmas Island                 : logi NA
#>   .. ..$ Cocos (Keeling) Islands          : logi NA
#>   .. .. [list output truncated]
#>   ..$ Melanesia                :List of 5
#>   .. ..$ Fiji            : num 24.4
#>   .. ..$ New Caledonia   : num 4.03
#>   .. ..$ Papua New Guinea: num 50.3
#>   .. .. [list output truncated]
#>   ..$ Micronesia               :List of 8
#>   .. ..$ Guam                                : num 3.03
#>   .. ..$ Kiribati                            : num 45.4
#>   .. ..$ Marshall Islands                    : num 11.8
#>   .. .. [list output truncated]
#>   .. [list output truncated]

List pruning and unnesting

With base rapply(), there is no convenient way to prune or filter elements from the input list. The rrapply() function adds an option how = "prune" to prune all list elements not subject to application of the function f from a nested list. The original list structure is retained, similar to the non-pruned versions how = "replace" and how = "list". Using how = "prune", we can for instance drop all NA elements from the list while preserving the original list structure:

## Drop all logical NA's while preserving list structure 
na_drop_oceania <- rrapply(
  renewable_oceania,
  f = identity,
  classes = "numeric",
  how = "prune"
)
str(na_drop_oceania, list.len = 3, give.attr = FALSE)
#> List of 1
#>  $ Oceania:List of 4
#>   ..$ Australia and New Zealand:List of 2
#>   .. ..$ Australia  : num 9.32
#>   .. ..$ New Zealand: num 32.8
#>   ..$ Melanesia                :List of 5
#>   .. ..$ Fiji            : num 24.4
#>   .. ..$ New Caledonia   : num 4.03
#>   .. ..$ Papua New Guinea: num 50.3
#>   .. .. [list output truncated]
#>   ..$ Micronesia               :List of 7
#>   .. ..$ Guam                            : num 3.03
#>   .. ..$ Kiribati                        : num 45.4
#>   .. ..$ Marshall Islands                : num 11.8
#>   .. .. [list output truncated]
#>   .. [list output truncated]

Instead, set how = "flatten" to return a flattened unnested version of the pruned list. This is more efficient than first returning the pruned list with how = "prune" and unlisting or flattening the list in a subsequent step.

## Drop all logical NA's and return unnested list
na_drop_oceania2 <- rrapply(
  renewable_oceania,
  f = identity,
  classes = "numeric",
  how = "flatten"
)
head(na_drop_oceania2, n = 10)
#>        Australia      New Zealand             Fiji    New Caledonia 
#>             9.32            32.76            24.36             4.03 
#> Papua New Guinea  Solomon Islands          Vanuatu             Guam 
#>            50.34            65.73            33.67             3.03 
#>         Kiribati Marshall Islands 
#>            45.43            11.75

Or, use how = "melt" to return a melted data.frame of the pruned list similar in format to reshape2::melt() applied to a nested list. The rows of the melted data.frame contain the node paths of the elements in the pruned list. The "value" column contains the values of the terminal nodes analogous to the flattened list returned by how = "flatten".

## Drop all logical NA's and return melted data.frame
na_drop_oceania3 <- rrapply(
  renewable_oceania,
  f = identity,
  classes = "numeric",
  how = "melt"
)
head(na_drop_oceania3)
#>        L1                        L2               L3 value
#> 1 Oceania Australia and New Zealand        Australia  9.32
#> 2 Oceania Australia and New Zealand      New Zealand 32.76
#> 3 Oceania                 Melanesia             Fiji 24.36
#> 4 Oceania                 Melanesia    New Caledonia  4.03
#> 5 Oceania                 Melanesia Papua New Guinea 50.34
#> 6 Oceania                 Melanesia  Solomon Islands 65.73

If no names are present in a sublist of the input list, how = "melt" replaces the names in the melted data.frame by list element indices "1", "2", etc.:

## Remove all names at L2 
## (skip this for now, these arguments are explained in the following sections)
oceania_unnamed <- rrapply(
  renewable_oceania,
  classes = "list",
  condition = function(x, .xname) .xname == "Oceania",
  f = unname
)

## Drop all logical NA's and return melted data.frame
na_drop_oceania4 <- rrapply(
  oceania_unnamed,
  f = identity,
  classes = "numeric",
  how = "melt"
)
head(na_drop_oceania4)
#>        L1 L2               L3 value
#> 1 Oceania  1        Australia  9.32
#> 2 Oceania  1      New Zealand 32.76
#> 3 Oceania  2             Fiji 24.36
#> 4 Oceania  2    New Caledonia  4.03
#> 5 Oceania  2 Papua New Guinea 50.34
#> 6 Oceania  2  Solomon Islands 65.73

Nested lists containing repeated observations, such as pokedex, can be unnested with how = "bind", which expands and binds the repeated observations (i.e. sublists) as individual rows in a wide data.frame. This is similar in format to repeated application of tidyr::unnest_wider() to a nested data.frame.

## Nested list of Pokemon properties in Pokemon GO
data("pokedex")

str(pokedex, list.len = 3)
#> List of 1
#>  $ pokemon:List of 151
#>   ..$ :List of 16
#>   .. ..$ id            : int 1
#>   .. ..$ num           : chr "001"
#>   .. ..$ name          : chr "Bulbasaur"
#>   .. .. [list output truncated]
#>   ..$ :List of 17
#>   .. ..$ id            : int 2
#>   .. ..$ num           : chr "002"
#>   .. ..$ name          : chr "Ivysaur"
#>   .. .. [list output truncated]
#>   ..$ :List of 15
#>   .. ..$ id            : int 3
#>   .. ..$ num           : chr "003"
#>   .. ..$ name          : chr "Venusaur"
#>   .. .. [list output truncated]
#>   .. [list output truncated]

## Unnest list as a wide data.frame
pokedex_wide <- rrapply(pokedex, how = "bind")

head(pokedex_wide[, c(1:3, 5:10)], n = 5)
#>   pokemon.id pokemon.num pokemon.name  pokemon.type pokemon.height
#> 1          1         001    Bulbasaur Grass, Poison         0.71 m
#> 2          2         002      Ivysaur Grass, Poison         0.99 m
#> 3          3         003     Venusaur Grass, Poison         2.01 m
#> 4          4         004   Charmander          Fire         0.61 m
#> 5          5         005   Charmeleon          Fire         1.09 m
#>   pokemon.weight    pokemon.candy pokemon.candy_count pokemon.egg
#> 1         6.9 kg  Bulbasaur Candy                  25        2 km
#> 2        13.0 kg  Bulbasaur Candy                 100 Not in Eggs
#> 3       100.0 kg  Bulbasaur Candy                  NA Not in Eggs
#> 4         8.5 kg Charmander Candy                  25        2 km
#> 5        19.0 kg Charmander Candy                 100 Not in Eggs

Condition function

Base rapply() allows to apply a function f to list elements of certain types or classes via the classes argument. rrapply() generalizes this concept via the condition argument, which accepts any principal argument function to use as a condition or predicate to apply f to a subset of list elements. Conceptually, the f function is applied to all leaf elements for which the condition function exactly evaluates to TRUE similar to isTRUE(). If the condition argument is missing, f is applied to all leaf elements. In combination with how = "prune", the condition function provides additional flexibility in selecting and filtering elements from a nested list. Using the condition argument, we can update the above function call to drop all NA’s from the list to better reflect our purpose:

## drop all NA elements using condition function
na_drop_oceania3 <- rrapply(
  renewable_oceania,
  condition = Negate(is.na),
  f = identity,
  how = "prune"
)
str(na_drop_oceania3, list.len = 3, give.attr = FALSE)
#> List of 1
#>  $ Oceania:List of 4
#>   ..$ Australia and New Zealand:List of 2
#>   .. ..$ Australia  : num 9.32
#>   .. ..$ New Zealand: num 32.8
#>   ..$ Melanesia                :List of 5
#>   .. ..$ Fiji            : num 24.4
#>   .. ..$ New Caledonia   : num 4.03
#>   .. ..$ Papua New Guinea: num 50.3
#>   .. .. [list output truncated]
#>   ..$ Micronesia               :List of 7
#>   .. ..$ Guam                            : num 3.03
#>   .. ..$ Kiribati                        : num 45.4
#>   .. ..$ Marshall Islands                : num 11.8
#>   .. .. [list output truncated]
#>   .. [list output truncated]

More interesting is to consider a condition that cannot also be defined using the classes argument. For instance, we can filter all countries with values that satisfy a certain numeric condition:

## filter all countries with values above 85%
renewable_energy_above_85 <- rrapply(
  renewable_energy_by_country, 
  condition = function(x) x > 85, 
  how = "prune"
)
str(renewable_energy_above_85, give.attr = FALSE)
#> List of 1
#>  $ World:List of 1
#>   ..$ Africa:List of 1
#>   .. ..$ Sub-Saharan Africa:List of 3
#>   .. .. ..$ Eastern Africa:List of 7
#>   .. .. .. ..$ Burundi                    : num 89.2
#>   .. .. .. ..$ Ethiopia                   : num 91.9
#>   .. .. .. ..$ Rwanda                     : num 86
#>   .. .. .. ..$ Somalia                    : num 94.7
#>   .. .. .. ..$ Uganda                     : num 88.6
#>   .. .. .. ..$ United Republic of Tanzania: num 86.1
#>   .. .. .. ..$ Zambia                     : num 88.5
#>   .. .. ..$ Middle Africa :List of 2
#>   .. .. .. ..$ Chad                            : num 85.3
#>   .. .. .. ..$ Democratic Republic of the Congo: num 97
#>   .. .. ..$ Western Africa:List of 1
#>   .. .. .. ..$ Guinea-Bissau: num 86.5

## or by passing arguments to condition via ...
renewable_energy_equal_0 <- rrapply(
  renewable_energy_by_country, 
  condition = "==", 
  e2 = 0, 
  how = "prune"
)
str(renewable_energy_equal_0, give.attr = FALSE)
#> List of 1
#>  $ World:List of 4
#>   ..$ Americas:List of 1
#>   .. ..$ Latin America and the Caribbean:List of 1
#>   .. .. ..$ Caribbean:List of 1
#>   .. .. .. ..$ Antigua and Barbuda: num 0
#>   ..$ Asia    :List of 1
#>   .. ..$ Western Asia:List of 4
#>   .. .. ..$ Bahrain: num 0
#>   .. .. ..$ Kuwait : num 0
#>   .. .. ..$ Oman   : num 0
#>   .. .. ..$ Qatar  : num 0
#>   ..$ Europe  :List of 2
#>   .. ..$ Northern Europe:List of 1
#>   .. .. ..$ Channel Islands:List of 1
#>   .. .. .. ..$ Guernsey: num 0
#>   .. ..$ Southern Europe:List of 1
#>   .. .. ..$ Gibraltar: num 0
#>   ..$ Oceania :List of 2
#>   .. ..$ Micronesia:List of 1
#>   .. .. ..$ Northern Mariana Islands: num 0
#>   .. ..$ Polynesia :List of 1
#>   .. .. ..$ Wallis and Futuna Islands: num 0

Remark: Note that the NA elements are not returned, as the condition function does not evaluate to TRUE for NA values. Also, the f argument is allowed to be missing, in which case no function is applied to the leaf elements.

As the condition function is a generalization of the classes argument to have more flexible control of the predicate, it is also possible to use the deflt argument together with how = "list" or how = "unlist" to set a default value to all leaf elements for which the condition does not evaluate to TRUE:

## replace all NA elements by zero
na_zero_oceania_list <- rrapply(
  renewable_oceania, 
  condition = Negate(is.na), 
  deflt = 0, 
  how = "list"
)
str(na_zero_oceania_list, list.len = 3, give.attr = FALSE)
#> List of 1
#>  $ Oceania:List of 4
#>   ..$ Australia and New Zealand:List of 6
#>   .. ..$ Australia                        : num 9.32
#>   .. ..$ Christmas Island                 : num 0
#>   .. ..$ Cocos (Keeling) Islands          : num 0
#>   .. .. [list output truncated]
#>   ..$ Melanesia                :List of 5
#>   .. ..$ Fiji            : num 24.4
#>   .. ..$ New Caledonia   : num 4.03
#>   .. ..$ Papua New Guinea: num 50.3
#>   .. .. [list output truncated]
#>   ..$ Micronesia               :List of 8
#>   .. ..$ Guam                                : num 3.03
#>   .. ..$ Kiribati                            : num 45.4
#>   .. ..$ Marshall Islands                    : num 11.8
#>   .. .. [list output truncated]
#>   .. [list output truncated]

To be consistent with base rapply(), the deflt argument can still only be used together with how = "list" or how = "unlist".

Using the `...` argument

The first argument to f always evaluates to the content of the leaf element to which f is applied. Any further arguments (besides the special arguments .xname, .xpos, .xparents and .xsiblings discussed below) that are independent of the node content can be supplied via the ... argument. Since rrapply() accepts a function in two of its arguments f and condition, any further arguments defined via the ... also need to be defined as function arguments in both the f and condition function (if existing), even if they are not used in the function itself.

To clarify, consider the following example where we replace all NA elements by a value defined in a separate argument newvalue:

## this is not ok!
tryCatch({
  rrapply(
    renewable_oceania, 
    condition = is.na, 
    f = function(x, newvalue) newvalue, 
    newvalue = 0, 
    how = "replace"
  )
}, error = function(error) error$message)
#> [1] "2 arguments passed to 'is.na' which requires 1"

## this is ok
na_zero_oceania_replace3 <- rrapply(
  renewable_oceania, 
  condition = function(x, newvalue) is.na(x), 
  f = function(x, newvalue) newvalue, 
  newvalue = 0, 
  how = "replace"
)
str(na_zero_oceania_replace3, list.len = 3, give.attr = FALSE)
#> List of 1
#>  $ Oceania:List of 4
#>   ..$ Australia and New Zealand:List of 6
#>   .. ..$ Australia                        : num 9.32
#>   .. ..$ Christmas Island                 : num 0
#>   .. ..$ Cocos (Keeling) Islands          : num 0
#>   .. .. [list output truncated]
#>   ..$ Melanesia                :List of 5
#>   .. ..$ Fiji            : num 24.4
#>   .. ..$ New Caledonia   : num 4.03
#>   .. ..$ Papua New Guinea: num 50.3
#>   .. .. [list output truncated]
#>   ..$ Micronesia               :List of 8
#>   .. ..$ Guam                                : num 3.03
#>   .. ..$ Kiribati                            : num 45.4
#>   .. ..$ Marshall Islands                    : num 11.8
#>   .. .. [list output truncated]
#>   .. [list output truncated]

Special arguments `.xname`, `.xpos`, `.xparents` and `.xsiblings`

In base rapply(), the f function only has access to the content of the list element under evaluation through its principal argument, and there is no convenient way to access its name or location in the nested list from inside the f function. This makes rapply() impractical if we want to apply a function f that relies on e.g. the name or position of a node as in the students example above. To overcome this limitation, rrapply() allows the use of the special arguments .xname, .xpos, .xparents and .xsiblings inside the f and condition functions (in addition to the principal function argument). .xname evaluates to the name of the list element, .xpos evaluates to the position of the element in the nested list structured as an integer vector, .xparents evaluates to a vector of all parent node names in the path to the current list element, and .xsiblings evaluates to the parent list containing the current list element and all of its direct siblings.

Using the .xname and .xpos arguments, we can transform or filter list elements based on their names or positions in the nested list:

## apply a function using the name of the node
renewable_oceania_text <- rrapply(
  renewable_oceania,
  f = function(x, .xname) sprintf("Renewable energy in %s: %.2f%%", .xname, x),
  condition = Negate(is.na),
  how = "flatten"
)
head(renewable_oceania_text, n = 5)
#>                                      Australia 
#>         "Renewable energy in Australia: 9.32%" 
#>                                    New Zealand 
#>      "Renewable energy in New Zealand: 32.76%" 
#>                                           Fiji 
#>             "Renewable energy in Fiji: 24.36%" 
#>                                  New Caledonia 
#>     "Renewable energy in New Caledonia: 4.03%" 
#>                               Papua New Guinea 
#> "Renewable energy in Papua New Guinea: 50.34%"

## extract values based on country names
renewable_benelux <- rrapply(
  renewable_energy_by_country,
  condition = function(x, .xname) .xname %in% c("Belgium", "Netherlands", "Luxembourg"),
  how = "prune"
)
str(renewable_benelux, give.attr = FALSE)
#> List of 1
#>  $ World:List of 1
#>   ..$ Europe:List of 1
#>   .. ..$ Western Europe:List of 3
#>   .. .. ..$ Belgium    : num 9.14
#>   .. .. ..$ Luxembourg : num 13.5
#>   .. .. ..$ Netherlands: num 5.78

Knowing that Europe is located under the node renewable_energy_by_country[[c(1, 5)]], we can filter all European countries with a renewable energy share above 50 percent by using the .xpos argument,

## filter European countries with value above 50%
renewable_europe_above_50 <- rrapply(
  renewable_energy_by_country,
  condition = function(x, .xpos) identical(head(.xpos, 2), c(1L, 5L)) & x > 50,
  how = "prune"
)
str(renewable_europe_above_50, give.attr = FALSE)
#> List of 1
#>  $ World:List of 1
#>   ..$ Europe:List of 2
#>   .. ..$ Northern Europe:List of 3
#>   .. .. ..$ Iceland: num 78.1
#>   .. .. ..$ Norway : num 59.5
#>   .. .. ..$ Sweden : num 51.4
#>   .. ..$ Western Europe :List of 1
#>   .. .. ..$ Liechtenstein: num 62.9

This can be done more conveniently using the .xparents argument, as this does not require us to look up the location of Europe in the list beforehand,

## filter European countries with value above 50%
renewable_europe_above_50 <- rrapply(
  renewable_energy_by_country,
  condition = function(x, .xparents) "Europe" %in% .xparents & x > 50,
  how = "prune"
)
str(renewable_europe_above_50, give.attr = FALSE)
#> List of 1
#>  $ World:List of 1
#>   ..$ Europe:List of 2
#>   .. ..$ Northern Europe:List of 3
#>   .. .. ..$ Iceland: num 78.1
#>   .. .. ..$ Norway : num 59.5
#>   .. .. ..$ Sweden : num 51.4
#>   .. ..$ Western Europe :List of 1
#>   .. .. ..$ Liechtenstein: num 62.9

Using the .xpos argument, we could look up the location of a particular country in the nested list,

## look up position of Sweden in list
(xpos_sweden <- rrapply(
  renewable_energy_by_country,
  condition = function(x, .xname) identical(.xname, "Sweden"),
  f = function(x, .xpos) .xpos,
  how = "flatten"
))
#> $Sweden
#> [1]  1  5  2 14
renewable_energy_by_country[[xpos_sweden$Sweden]]
#> [1] 51.35
#> attr(,"M49-code")
#> [1] "752"

Instead, using the .xsiblings argument we could look up the direct neighbors of a particular country in the nested list,

## look up sibling countries of Sweden in list
siblings_sweden <- rrapply(
  renewable_energy_by_country,
  condition = function(x, .xsiblings) "Sweden" %in% names(.xsiblings),
  how = "flatten"
)
head(siblings_sweden, n = 10)
#> Aland Islands       Denmark       Estonia Faroe Islands       Finland 
#>            NA         33.06         26.55          4.24         42.03 
#>       Iceland       Ireland   Isle of Man        Latvia     Lithuania 
#>         78.07          8.65          4.30         38.48         31.42

We could also use the .xpos argument to determine the maximum depth of the list or the length of the longest sublist as follows,

## maximum list depth
depth_all <- rrapply(
  renewable_energy_by_country, 
  f = function(x, .xpos) length(.xpos), 
  how = "unlist"
)
max(depth_all) 
#> [1] 5

## longest sublist length
sublist_count <- rrapply(
  renewable_energy_by_country, 
  f = function(x, .xpos) max(.xpos), 
  how = "unlist"
)
max(sublist_count)
#> [1] 28

When unnesting nested lists with how = "bind", the .xname, .xpos or .xparents arguments can be useful to parse only a particular set of data.frame columns:

## parse only Pokemon number, name and type columns 
pokedex_small <- rrapply(
  pokedex,
  condition = function(x, .xpos, .xname) {
    length(.xpos) < 4 & .xname %in% c("num", "name", "type")
    },
  how = "bind"
)

head(pokedex_small)
#>   pokemon.num pokemon.name  pokemon.type
#> 1         001    Bulbasaur Grass, Poison
#> 2         002      Ivysaur Grass, Poison
#> 3         003     Venusaur Grass, Poison
#> 4         004   Charmander          Fire
#> 5         005   Charmeleon          Fire
#> 6         006    Charizard  Fire, Flying

Avoid recursing into list nodes

By default, if classes = "ANY" both base rapply() and rrapply() recurse into any list-like element. Using classes = "list" in base rapply() has no effect as the function descends into any list node before evaluating the classes argument. In contrast, rrapply() does detect classes = "list", in which case the f function is applied to list elements that satisfy the condition function. If the condition is not satisfied for a list element, rrapply() will recurse further into the sublist, apply the f function to the nodes that satisfy condition and so on. The use of classes = "list" tells rrapply() to not descend into list objects by default. For this reason this behavior can only be triggered with the classes argument and not through the use of e.g. condition = is.list.

The choice classes = "list" is useful to calculate summary statistics across nodes or to look up the position of intermediate nodes in a nested list. To illustrate, we can return the mean and standard deviation of the renewable energy share in Europe as follows:

## compute mean value of Europe
rrapply(
  renewable_energy_by_country,  
  classes = "list",
  condition = function(x, .xname) .xname == "Europe",
  f = function(x) list(
    mean = mean(unlist(x), na.rm = TRUE), 
    sd = sd(unlist(x), na.rm = TRUE)
  ),
  how = "flatten"
)
#> $Europe
#> $Europe$mean
#> [1] 22.36565
#> 
#> $Europe$sd
#> [1] 17.12639

Remark: Note that the principal x argument in the f function now evaluates to a list for the node satisfying the condition. For this reason, we first unlist the sublist before passing it to mean and sd.

We can use the .xpos argument to apply the f function only at specific locations or depths in the nested list. For instance, we could return the mean renewable energy shares for each continent by making use of the fact that the .xpos vector of each continent has length (i.e. depth) 2:

## compute mean value of each continent
renewable_continent_summary <- rrapply(
  renewable_energy_by_country,  
  classes = "list",
  condition = function(x, .xpos) length(.xpos) == 2,
  f = function(x) mean(unlist(x), na.rm = TRUE)
)

## Antarctica has a missing value
str(renewable_continent_summary, give.attr = FALSE)
#> List of 1
#>  $ World:List of 6
#>   ..$ Africa    : num 54.3
#>   ..$ Americas  : num 18.2
#>   ..$ Antarctica: logi NA
#>   ..$ Asia      : num 17.9
#>   ..$ Europe    : num 22.4
#>   ..$ Oceania   : num 17.8

If classes = "list", the f function is only applied to the (non-terminal) list nodes. If classes = "ANY", the f function is only applied to any (terminal) non-list node. To apply f to any terminal and non-terminal node of the nested list, we should combine classes = c("list", "ANY"). This is illustrated by searching both terminal and non-terminal nodes for the country or region with M49-code "155" (Western Europe):

## Filter country or region by M49-code
rrapply(
  renewable_energy_by_country,
  classes = c("list", "ANY"), 
  condition = function(x) attr(x, "M49-code") == "155",
  f = function(x, .xname) .xname,
  how = "unlist"
)
#> World.Europe.Western Europe 
#>            "Western Europe"

Another illustrating example is to preprocess certain list nodes before unnesting the nested list. Here, we extract the Pokémon evolutions in pokedex with and without pre-processing of the Pokémon evolution list nodes:

## Pokemon evolution columns without pre-processing
pokedex_wide1 <- rrapply(
  pokedex,
  condition = function(x, .xparents) any(grepl("name|evolution", .xparents)),
  how = "bind"
)

head(pokedex_wide1, n = 3)
#>   pokemon.name pokemon.next_evolution.1.num pokemon.next_evolution.1.name
#> 1    Bulbasaur                          002                       Ivysaur
#> 2      Ivysaur                          003                      Venusaur
#> 3     Venusaur                         <NA>                          <NA>
#>   pokemon.next_evolution.2.num pokemon.next_evolution.2.name
#> 1                          003                      Venusaur
#> 2                         <NA>                          <NA>
#> 3                         <NA>                          <NA>
#>   pokemon.prev_evolution.1.num pokemon.prev_evolution.1.name
#> 1                         <NA>                          <NA>
#> 2                          001                     Bulbasaur
#> 3                          001                     Bulbasaur
#>   pokemon.prev_evolution.2.num pokemon.prev_evolution.2.name
#> 1                         <NA>                          <NA>
#> 2                         <NA>                          <NA>
#> 3                          002                       Ivysaur
#>   pokemon.next_evolution.3.num pokemon.next_evolution.3.name
#> 1                         <NA>                          <NA>
#> 2                         <NA>                          <NA>
#> 3                         <NA>                          <NA>

## Pokemon evolution columns simplified to character vectors 
pokedex_wide2 <- rrapply(
  pokedex,
  classes = c("character", "list"),
  condition = function(x, .xparents) any(grepl("name|evolution", .xparents)),
  f = function(x) if(is.list(x)) sapply(x, `[[`, "name") else x,
  how = "bind"
)
    
head(pokedex_wide2, n = 9)
#>   pokemon.name pokemon.next_evolution pokemon.prev_evolution
#> 1    Bulbasaur      Ivysaur, Venusaur                     NA
#> 2      Ivysaur               Venusaur              Bulbasaur
#> 3     Venusaur                     NA     Bulbasaur, Ivysaur
#> 4   Charmander  Charmeleon, Charizard                     NA
#> 5   Charmeleon              Charizard             Charmander
#> 6    Charizard                     NA Charmander, Charmeleon
#> 7     Squirtle   Wartortle, Blastoise                     NA
#> 8    Wartortle              Blastoise               Squirtle
#> 9    Blastoise                     NA    Squirtle, Wartortle

Recursive list node modification

If classes = "list", rrapply() applies the f function to any list element that satisfies the condition function, but will not recurse further into these list elements. This makes it for instance impossible to recursively update the name of each list element in a nested list, as rrapply() stops recursing after updating the first list layer. For this purpose, set classes = "list" combined with how = "recurse", in which case rrapply() recurses further into any updated list element after application of the f function (using how = "replace"). In this context, the condition argument is interpreted as a passing criterion: as long as the condition and classes arguments are satisfied, rrapply() will try to recurse further into any list-like element.

Using how = "recurse", we can recursively replace all names in renewable_energy_by_country by their M49-codes:

## replace country names by M-49 codes
renewable_M49 <- rrapply(
  list(renewable_energy_by_country), 
  classes = "list",
  f = function(x) {
    names(x) <- vapply(x, attr, character(1L), which = "M49-code")
    return(x)
  },
  how = "recurse"
)

str(renewable_M49[[1]], max.level = 3, list.len = 3, give.attr = FALSE)
#> List of 1
#>  $ 001:List of 6
#>   ..$ 002:List of 2
#>   .. ..$ 015:List of 7
#>   .. ..$ 202:List of 4
#>   ..$ 019:List of 2
#>   .. ..$ 419:List of 3
#>   .. ..$ 021:List of 5
#>   ..$ 010: logi NA
#>   .. [list output truncated]

Remark: Here we passed list(renewable_energy_by_country) to the call of rrapply() in order to start application of the f function at the level of the list renewable_energy_by_country itself, instead of starting at its list elements.

Miscellaneous

Avoid recursing into data.frames

If classes = "ANY", rrapply() recurses into any list-like object equivalent to base rapply(). Since data.frames are list-like objects, the f function is applied to the individual columns instead of the data.frame object as a whole. To avoid this behavior, set classes = "data.frame", in which case the f and condition functions are applied directly to the data.frame object itself and not its columns. Note that with base rapply() using classes = "data.frame" has no effect as rapply() descends into the columns of any data.frame object before evaluating the classes argument.

## create a list of data.frames
oceania_df <- list(
  Oceania = lapply(
    renewable_oceania[["Oceania"]], 
    FUN = function(x) data.frame(
      Name = names(x), 
      value = unlist(x), 
      stringsAsFactors = FALSE
    )
  )
)

## this does not work!
tryCatch({
  rrapply(
    oceania_df,
    f = function(x) subset(x, !is.na(value)), ## filter NA-rows of data.frame
    how = "replace"
  )
}, error = function(error) error$message)
#> [1] "object 'value' not found"

## this does work
rrapply(
  oceania_df,
  classes = "data.frame",
  f = function(x) subset(x, !is.na(value)),
  how = "replace"
)
#> $Oceania
#> $Oceania$`Australia and New Zealand`
#>                    Name value
#> Australia     Australia  9.32
#> New Zealand New Zealand 32.76
#> 
#> $Oceania$Melanesia
#>                              Name value
#> Fiji                         Fiji 24.36
#> New Caledonia       New Caledonia  4.03
#> Papua New Guinea Papua New Guinea 50.34
#> Solomon Islands   Solomon Islands 65.73
#> Vanuatu                   Vanuatu 33.67
#> 
#> $Oceania$Micronesia
#>                                                              Name value
#> Guam                                                         Guam  3.03
#> Kiribati                                                 Kiribati 45.43
#> Marshall Islands                                 Marshall Islands 11.75
#> Micronesia (Federated States of) Micronesia (Federated States of)  1.64
#> Nauru                                                       Nauru 31.44
#> Northern Mariana Islands                 Northern Mariana Islands  0.00
#> Palau                                                       Palau  0.02
#> 
#> $Oceania$Polynesia
#>                                                Name value
#> American Samoa                       American Samoa  1.00
#> Cook Islands                           Cook Islands  1.90
#> French Polynesia                   French Polynesia 11.06
#> Niue                                           Niue 22.07
#> Samoa                                         Samoa 27.30
#> Tonga                                         Tonga  1.98
#> Tuvalu                                       Tuvalu 11.76
#> Wallis and Futuna Islands Wallis and Futuna Islands  0.00

Remark: Note that the same result can also be obtained using classes = "list" and checking that the list element under evaluation is a data.frame:

rrapply(
  oceania_df,
  classes = "list",
  condition = is.data.frame,
  f = function(x) subset(x, !is.na(value)),
  how = "replace"
)

List attributes

Base rapply() may produce different results when using how = "replace" or how = "list" when working with list attributes. The former preserves intermediate list attributes whereas the latter does not. To avoid unexpected behavior, rrapply() always preserves intermediate list attributes when using how = "replace", how = "list" or how = "prune". If how = "flatten", how = "melt", how = "bind" or how = "unlist" intermediate list attributes cannot be preserved as the result is no longer a nested list.

## how = "list" now preserves all list attributes
na_drop_oceania_list_attr2 <- rrapply(
  renewable_oceania, 
  f = function(x) replace(x, is.na(x), 0), 
  how = "list"
)

str(na_drop_oceania_list_attr2, max.level = 2)
#> List of 1
#>  $ Oceania:List of 4
#>   ..$ Australia and New Zealand:List of 6
#>   .. ..- attr(*, "M49-code")= chr "053"
#>   ..$ Melanesia                :List of 5
#>   .. ..- attr(*, "M49-code")= chr "054"
#>   ..$ Micronesia               :List of 8
#>   .. ..- attr(*, "M49-code")= chr "057"
#>   ..$ Polynesia                :List of 10
#>   .. ..- attr(*, "M49-code")= chr "061"
#>   ..- attr(*, "M49-code")= chr "009"

## how = "prune" also preserves list attributes
na_drop_oceania_attr <- rrapply(
  renewable_oceania, 
  condition = Negate(is.na), 
  how = "prune"
)
str(na_drop_oceania_attr, max.level = 2)
#> List of 1
#>  $ Oceania:List of 4
#>   ..$ Australia and New Zealand:List of 2
#>   .. ..- attr(*, "M49-code")= chr "053"
#>   ..$ Melanesia                :List of 5
#>   .. ..- attr(*, "M49-code")= chr "054"
#>   ..$ Micronesia               :List of 7
#>   .. ..- attr(*, "M49-code")= chr "057"
#>   ..$ Polynesia                :List of 8
#>   .. ..- attr(*, "M49-code")= chr "061"
#>   ..- attr(*, "M49-code")= chr "009"

Using `rrapply()` on data.frames

The previous sections explained how to avoid recursing into list-like elements using rrapply(). However, it can also be useful to exploit the property that a data.frame is a list-like object and use base rapply() to apply a function f to data.frame columns of certain classes. For instance, it is straightforward to standardize all numeric columns in the iris dataset by their sample mean and standard deviation:

iris_standard <- rapply(iris, f = scale, classes = "numeric", how = "replace")
head(iris_standard)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
#> 2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
#> 3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
#> 4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
#> 5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
#> 6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa

Using the condition argument in rrapply(), we gain additional flexibility in selecting the columns to which f is applied. For instance, we can apply the f function only to the Sepal columns using the .xname argument:

iris_standard_sepal <- rrapply(
  iris,                    
  condition = function(x, .xname) grepl("Sepal", .xname), 
  f = scale
)
head(iris_standard_sepal)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1   -0.8976739  1.01560199          1.4         0.2  setosa
#> 2   -1.1392005 -0.13153881          1.4         0.2  setosa
#> 3   -1.3807271  0.32731751          1.3         0.2  setosa
#> 4   -1.5014904  0.09788935          1.5         0.2  setosa
#> 5   -1.0184372  1.24503015          1.4         0.2  setosa
#> 6   -0.5353840  1.93331463          1.7         0.4  setosa

Instead of mutating columns, we can also transmute columns (i.e. keeping only the columns to which f is applied) by setting how = "prune":

iris_standard_transmute <- rrapply(
  iris, 
  f = scale, 
  classes = "numeric", 
  how = "prune"
)
head(iris_standard_transmute)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1   -0.8976739  1.01560199    -1.335752   -1.311052
#> 2   -1.1392005 -0.13153881    -1.335752   -1.311052
#> 3   -1.3807271  0.32731751    -1.392399   -1.311052
#> 4   -1.5014904  0.09788935    -1.279104   -1.311052
#> 5   -1.0184372  1.24503015    -1.335752   -1.311052
#> 6   -0.5353840  1.93331463    -1.165809   -1.048667

In order to summarize a set of selected columns, use how = "flatten" instead of how = "prune", as the latter preserves list attributes –including data.frame dimensions– which should not be kept.

## summarize columns with how = "flatten"
iris_standard_summarize <- rrapply(
  iris, 
  f = summary, 
  classes = "numeric", 
  how = "flatten"
)
iris_standard_summarize
#> $Sepal.Length
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   4.300   5.100   5.800   5.843   6.400   7.900 
#> 
#> $Sepal.Width
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   2.000   2.800   3.000   3.057   3.300   4.400 
#> 
#> $Petal.Length
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   1.600   4.350   3.758   5.100   6.900 
#> 
#> $Petal.Width
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   0.100   0.300   1.300   1.199   1.800   2.500

Conclusion

To conclude, let us return to the list recursion exercise in the first section. Using rrapply(), we can solve the task in a more readable and less error-prone fashion. A possible approach is to split up the question in two steps as follows:

## look up the position of Euler (Leonhard)
xpos_Euler <- rrapply(
  students, 
  classes = "list",
  condition = function(x, .xname) .xname == "Euler" && attr(x, "given") == "Leonhard",
  f = function(x, .xpos) .xpos,
  how = "flatten"
)[[1]]

xpos_Euler 
#> [1] 1 1 2

## filter descendants of Euler (Leonhard) and replace missing values by zero
students_Euler <- rrapply(
  students,
  condition = function(x, .xpos) identical(.xpos[seq_along(xpos_Euler)], xpos_Euler), 
  f = function(x) replace(x, is.na(x), 0),
  how = "prune"
)

str(students_Euler, give.attr = FALSE)
#> List of 1
#>  $ Bernoulli:List of 1
#>   ..$ Bernoulli:List of 1
#>   .. ..$ Euler:List of 2
#>   .. .. ..$ Euler   : num 0
#>   .. .. ..$ Lagrange:List of 3
#>   .. .. .. ..$ Fourier: num 68679
#>   .. .. .. ..$ Plana  : num 0
#>   .. .. .. ..$ Poisson: num 113435

Knowing that Johann Euler is a descendant of Leonhard Euler, we can further simplify this into a single function call using the .xparents argument:

## filter descendants of Euler (Leonhard) and replace missing values by zero
students_Euler <- rrapply(
  students,
  condition = function(x, .xparents) "Euler" %in% .xparents,
  f = function(x) replace(x, is.na(x), 0),
  how = "prune"
)

str(students_Euler, give.attr = FALSE)
#> List of 1
#>  $ Bernoulli:List of 1
#>   ..$ Bernoulli:List of 1
#>   .. ..$ Euler:List of 2
#>   .. .. ..$ Euler   : num 0
#>   .. .. ..$ Lagrange:List of 3
#>   .. .. .. ..$ Fourier: num 68679
#>   .. .. .. ..$ Plana  : num 0
#>   .. .. .. ..$ Poisson: num 113435

For additional information and details check out the package vignette (browseVignettes(package = "rrapply")) or the other articles in the Articles section.

Efficient list recursion with rrapply

Joris Chau

Latest date: 2021-02-22

List recursion in R

When to use `rrapply()`

List pruning and unnesting

Condition function

Using the `...` argument

Special arguments `.xname`, `.xpos`, `.xparents` and `.xsiblings`

Avoid recursing into list nodes

Recursive list node modification

Miscellaneous

Avoid recursing into data.frames

List attributes

Using `rrapply()` on data.frames

Conclusion

Efficient list recursion with rrapply

Joris Chau

Latest date: 2021-02-22

List recursion in R

When to use rrapply()

List pruning and unnesting

Condition function

Using the ... argument

Special arguments .xname, .xpos, .xparents and .xsiblings

Avoid recursing into list nodes

Recursive list node modification

Miscellaneous

Avoid recursing into data.frames

List attributes

Using rrapply() on data.frames

Conclusion

When to use `rrapply()`

Using the `...` argument

Special arguments `.xname`, `.xpos`, `.xparents` and `.xsiblings`

Using `rrapply()` on data.frames