Short note about tidyeval

Following Jenny Bryan’s talk on tidyeval in the last rstudio::conf 2019, I decided to write this short note (mainly as a reminder to myself).

What is tidyeval?

Tidy evaluation, or non standard evaluation, allows us to pass column names between functions. This is the “classic” behaviour of most tidyverse functions. For example, we use:

library(tidyverse)

mtcars %>% 
  select(mpg, cyl)
##                      mpg cyl
## Mazda RX4           21.0   6
## Mazda RX4 Wag       21.0   6
## Datsun 710          22.8   4
## Hornet 4 Drive      21.4   6
## Hornet Sportabout   18.7   8
## Valiant             18.1   6
## Duster 360          14.3   8
## Merc 240D           24.4   4
## Merc 230            22.8   4
## Merc 280            19.2   6
## Merc 280C           17.8   6
## Merc 450SE          16.4   8
## Merc 450SL          17.3   8
## Merc 450SLC         15.2   8
## Cadillac Fleetwood  10.4   8
## Lincoln Continental 10.4   8
## Chrysler Imperial   14.7   8
## Fiat 128            32.4   4
## Honda Civic         30.4   4
## Toyota Corolla      33.9   4
## Toyota Corona       21.5   4
## Dodge Challenger    15.5   8
## AMC Javelin         15.2   8
## Camaro Z28          13.3   8
## Pontiac Firebird    19.2   8
## Fiat X1-9           27.3   4
## Porsche 914-2       26.0   4
## Lotus Europa        30.4   4
## Ford Pantera L      15.8   8
## Ferrari Dino        19.7   6
## Maserati Bora       15.0   8
## Volvo 142E          21.4   4

The two variables were selected out of the mtcars data set, and we specified them as names without using any quotation marks. They are symbolic, not characters (although they could also be specified as characters, select is smart enough that way).

But assume we want to pass variables “tidy style” between functions which do different operations.

Variation one - a basic example

We’ll start simple: a function which has two parameters. The first parameter is a dataset. The second parameters is a grouping variable. All other variables in the data set will have their mean computed using summarize_all.

test1 <- function(dataset, groupby_vars){
  grouping_vars <- enquo(groupby_vars)
  dataset %>% 
    group_by(!! grouping_vars) %>%
    summarize_all(funs(mean(.))) %>%
    return()
}

mtcars %>%
  select(cyl:carb) %>%
  test1(groupby_vars = cyl)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## # A tibble: 3 × 10
##     cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
## 2     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
## 3     8  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5

We can see that mtcars was grouped by cyl which was passed as a name (not characters). The function test1 took it, then enquo()-ed it, and eventually used it in the tidy chain using !!. The function enquo turns the input into a “quosure”. Then the !! “uses” the quosure to select the proper variable from mtcars.

Passing arguments using ...

A slightly more complex situation is passing multiple arguments to the function. Assume that this time we want to construct a function which gets one input by which to group by, and what are the variables to be summarized:

test2 <- function(dataset, groupby_vars, ...){
  grouping_vars <- enquo(groupby_vars)
  dataset %>% 
    group_by(!! grouping_vars) %>%
    summarize_at(vars(...), funs(mean(.))) %>%
    return()
}

mtcars %>%
  select(cyl:carb) %>%
  test2(groupby_vars = cyl, disp:drat)
## # A tibble: 3 × 4
##     cyl  disp    hp  drat
##   <dbl> <dbl> <dbl> <dbl>
## 1     4  105.  82.6  4.07
## 2     6  183. 122.   3.59
## 3     8  353. 209.   3.23

What happend is that test2 treats the grouping variable the same way that test1 treated it, but it also passed along the variables disp:drat.

Maximum flexibility - multiple enquo()s

Sometime passing the dots, i.e., ... is not enough. For example, if we want specify behaviour for different columns of the data frame (e.g., compute the mean for some and the std for others). In such cases we need a more flexible version. We can extend the flexibilty of this approach using multiple enqou()s.

test3 <- function(dataset, groupby_vars, computemean_vars, computestd_vars){
  grouping_vars <- enquo(groupby_vars)
  mean_vars <- enquo(computemean_vars)
  std_vars <- enquo(computestd_vars)
  dataset %>% 
    group_by(!! grouping_vars) %>%
    summarize_at(vars(!!mean_vars), funs(mean(.))) %>%
    left_join(dataset %>%
                group_by(!! grouping_vars) %>%
                summarize_at(vars(!!std_vars), funs(sd(.))))
}
mtcars %>% 
  test3(groupby_vars = cyl, disp:drat, wt:carb)
## Joining, by = "cyl"
## # A tibble: 3 × 10
##     cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     4  105.  82.6  4.07 0.570  1.68 0.302 0.467 0.539 0.522
## 2     6  183. 122.   3.59 0.356  1.71 0.535 0.535 0.690 1.81 
## 3     8  353. 209.   3.23 0.759  1.20 0     0.363 0.726 1.56

In the resulting table, the first column cyl is the grouping variable, columns disp through drat have the mean of the corresponding variables, and columns wt through carb have their standard deviation computed.

Additional uses of tidy evaluation

This evaluation is very useful when building flexible functions, but also when using the ggplot2 syntax within functions, and more so when using Shiny applications, in which input parameters need to go in as grouping or as plotting parameters.

However, this is a topic for a different post.

Conclusions

Tidy evaluation empowers you with great tools - it offers a great degree of flexibilty, but it’s a bit tricky to master.

My suggestion is that if you’re trying to master tidy evaluation, just think about your use case: which of the three variations presented in this post it resembles too?

Work your way up - from the simplest version (if it works for you) and up to the complex (but most flexible) version.


Partner and Head of Data Science