Introduction

In many cases you need to repeat the same action across multiple objects, for instance loading many files, or computing summary statistics across many vectors of observations. Instead of repeating the same operation manually for every object - which is not only time consuming, but especially prone to mistakes - you can use for loops.

However for can be quite verbose, and especially in case you need to nest them - i.e. running a loop inside a loop - it can be difficult to inspect the code for errors during the analysis and especially in the future.

Base R already provides some functions to avoid the creation of for loops, with the family of apply functions. However sometimes the syntax can be different across functions, and still a bit verbose.

The tidyverse provides functions that help getting rid of for loops for good using the purrr package. Below there is just an example. More details can be found in the iteration chapter of R for Data Science and in the functionals chapter of Advanced R

library(tidyverse)
library(kableExtra)
library(DT)
options(digits=2)

Let’s say you collected data in 8 different runs of an experiment. For instance the time, in seconds, spent freezing, running or grooming in 10 participants after a given stimulus in each subsequent run.

For our example we will create some random data. The code below creates 8 dataframes with 10 observations for three distinct variables. It already uses the map function that we are going to explain later, so for now you can just disregard it, and come back later to understand what it does as an excercise.

1:8 %>% map( function(x) {
  tibble(
    SUBID = map(1:10, ~ paste0("sub_",.x) ) %>% unlist(),
    freezing = runif(10)*10 * log(x+1),
    running = runif(10)*10,
    grooming = runif(10)*10
  ) %>%
    write_csv(paste0("run_",x,".csv"))
})

We obtain 8 csv files with our data.

myfiles <- list.files(pattern = ".csv", full.names = T)
myfiles
## [1] "./run_1.csv" "./run_2.csv" "./run_3.csv" "./run_4.csv" "./run_5.csv" "./run_6.csv" "./run_7.csv"
## [8] "./run_8.csv"
read.csv("run_1.csv")
##     SUBID freezing running grooming
## 1   sub_1     6.52    1.03      6.6
## 2   sub_2     0.32    9.00      7.1
## 3   sub_3     3.66    2.46      5.4
## 4   sub_4     6.19    0.42      5.9
## 5   sub_5     3.82    3.28      2.9
## 6   sub_6     3.17    9.55      1.5
## 7   sub_7     6.63    8.90      9.6
## 8   sub_8     3.14    6.93      9.0
## 9   sub_9     4.70    6.41      6.9
## 10 sub_10     3.97    9.94      8.0

purrr::map

Now you want to load everything in the same dataframe (i.e. table), for instance to carry out a RM-ANOVA. You could use a for loop to load all the files:

allruns = vector(mode = "list", length = 8)

for (run in 1:length(allruns)) {
  allruns[[run]] <- read.csv( myfiles[[run]] )
}

# allruns

Or you could use the map function inside the purrr package

allruns <- map(myfiles, read.csv)

# allruns

In other words you passed to every element of the list myfiles the function read.csv

Note the advantages:

  • you do not need to write extra code to initialize an empty list, since the result is automatically stored in a list
  • you don’t need to provide the total number of files,
  • the syntax is much more concise (and when you get used to it, also much more readable).

purrr::map2

To carry out the RM-ANOVA, you need to combine all the tables into one singe dataframe, but also retain information about the different run.

The idea is the same as before: you have a function that creates a column with the run numba in each run’s data table. This means that you want to provide two lists: (1) the list containing the table of each run and (2) the list of filenames.

alldata <- map2(allruns, myfiles, function(run, file) run %>% mutate(run = file)) %>% bind_rows()

or with a more concise syntax:

alldata <- map2_df(allruns, myfiles, ~ .x %>% mutate(run = .y))

You might have noticed that here I used a specific flavor of map, that is map_df, which returns a dataframe (or a tibble in the tidyverse language) instead of the default list, so that I can drop the final bind_rows().

purrr::pmap

As you might expect, there is also a function pmap which allows you to pass an arbitrary number of tables. I personally prefer this syntax since it allows me to pipe the list into it:

alldata <- list(allruns, myfiles) %>% pmap_df(~ .x %>% mutate(run = .y))

datatable(alldata, options = list(dom = 'tp'))

map is similar to group_by for dataframes

Finally, note that the map function - and its variation, such as pmap, is a similar operator for list to the group_by operator inside dataframes.

For instance let’s say that you want to get the mean and standard deviation for every variable in each run:

descriptives <- alldata %>% 
  group_by(run) %>%
  summarise(
    across(where(is.numeric), list(mean = mean, sd = sd)),
    .groups = "drop"
  ) %>% ungroup() 


descriptives %>% 
  kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))
run freezing_mean freezing_sd running_mean running_sd grooming_mean grooming_sd
./run_1.csv 4.2 1.9 5.8 3.7 6.3 2.5
./run_2.csv 3.7 2.2 3.5 2.9 4.3 2.7
./run_3.csv 8.6 3.6 4.3 2.1 5.3 3.4
./run_4.csv 7.6 4.2 5.7 3.4 4.8 2.5
./run_5.csv 9.6 5.8 5.1 3.0 4.3 2.8
./run_6.csv 7.5 3.3 5.0 2.4 5.8 2.1
./run_7.csv 12.3 5.6 5.5 2.6 3.9 2.3
./run_8.csv 13.0 6.6 4.8 2.7 4.7 2.6