Programming with dplyr


This post is published on Medium and available as an Rmd notebook.

With dplyr version 1.0, there are new ways that you can write functions. The programming with dplyr vignette with the docs is the best reference.

If you're familiar with using sym and converting from standard to nonstandard form, the following progression may be useful. If you’re familiar with using sym and converting from standard to nonstandard form, the following progression should show you how to replace (and extend) your code. It should be mostly find-replace! If you have a function that takes a character vector and uses an _at verb, see the difference between the corresponding old option (here, option 1 with _at ) and see how this is changed for the "Super version" at the bottom.

Old option 1: use *_at verbs

This is the old way of writing a dplyr function:

max_by_at <- function(data, var, by="") {
  data %>%
    group_by_at(by) %>%
    summarise_at(var, max, na.rm = TRUE)
}

Let's try it out:

starwars %>% max_by_at("height", by="gender")
starwars %>% max_by_at(c("height", "mass"), by="gender")
starwars %>% max_by_at(c("height", "mass"), by=c("sex", "gender"))

That worked great, but it won't work for env variables:

testthat::expect_error(starwars %>% max_by_at(height, by=gender))
testthat::expect_error(starwars %>% max_by_at("height", by=gender))
testthat::expect_error(starwars %>% max_by_at(height, by="gender"))

Old(ish) option 2: use across

This works for characters and character vectors, but not for env variables. Using across is a replacement for using *_at, and it has the same functionality:

max_by_across <- function(data, var, by="") {
  data %>%
    group_by(across(by)) %>%
    summarise(across(var, max, na.rm = TRUE), .groups='keep')
}
starwars %>% max_by_across("height", by="gender")
starwars %>% max_by_across(c("height", "mass"), by="gender")
starwars %>% max_by_across(c("height", "mass"), by=c("sex", "gender"))
testthat::expect_error(starwars %>% max_by_across(height, by=gender))
testthat::expect_error(starwars %>% max_by_across("height", by=gender))
testthat::expect_error(starwars %>% max_by_across(height, by="gender"))

Old option 3: Convert from character to env var by sym

max_by_1 <- function(data, var, by="") {
  data %>%
    group_by(!!sym(by)) %>%
    summarise(maximum = max(!!sym(var), na.rm = TRUE))
}

It doesn't work for passing in env variables:

testthat::expect_error(starwars %>% max_by_1(height))
testthat::expect_error(starwars %>% max_by_1(height, by=gender))

It does work for strings:

starwars %>% max_by_1("height")
starwars %>% max_by_1("height", by="gender")

But, it doesn't work for lists (so, it's less general than across):

testthat::expect_error(starwars %>% max_by_1(c("height", "weight")))
testthat::expect_error(starwars %>% max_by_1("height", by=c("gender", "sex")))

Better with braces

Check out this improved version!

It works for env vars, so we can use it like a dplyr function with non standard eval, as well as pass in sym variables.

max_by_2 <- function(data, var, by) {
  data %>%
    group_by({{ by }}) %>%
    summarise(maximum = max({{ var }}, na.rm = TRUE))
}

It does work for env variables!

Which is pretty cool:

starwars %>% max_by_2(height)
starwars %>% max_by_2(height, by=gender)

It does not work for strings out of the box:

starwars %>% max_by_2("height")
starwars %>% max_by_2("height", by="gender")

We can work around this with sym:

starwars %>% max_by_2(!!sym("height"))

It does not work for lists of env vars:

starwars %>% max_by_2(c(height, mass))
testthat::expect_error(starwars %>% max_by_2(height, by=c(gender, sex)))

Super version

We'll use across() to allow strings, lists of env vars, and even lists of strings. The default for by=() becomes an empty list and we simple wrap the {{}} with across():

max_by_3 <- function(data, var, by=c()) {
  data %>%
    group_by(across({{ by }})) %>%
    summarise(across({{ var }}, max, .names = "max_{.col}", na.rm = TRUE), .groups='keep')
}

It works for env variables:

starwars %>% max_by_3(height)
starwars %>% max_by_3(height, by=gender)

It works for strings:

starwars %>% max_by_3("height")
starwars %>% max_by_3("height", by="gender")

It works for lists of env variables:

starwars %>% max_by_3(c(height, mass))
starwars %>% max_by_3(height, by=c(gender, sex))
starwars %>% max_by_3(c(height, mass), by=c(gender, sex))

It works for character lists:

starwars %>% max_by_3(c("height", "mass"))
starwars %>% max_by_3("height", by=c(gender, sex))
starwars %>% max_by_3(c("height", "mass"), by=c("gender", "sex"))

Now you've seen how to write some very flexible functions using the new powers of dplyr programming. Enjoy!