Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2022-08-08

Using across() to create multiple columns

Nicholas Tierney

Categories: rstats functions Tags: rstats functions across tidyverse mutate

10 minute read

pigeons across the drain, Nick Tierney, film, olympus XA

A colleague of mine the other day had a question along the lines of:

How do I add multiple columns that give the ranks of values in corresponding columns

And I ended up cooking up a really fun example of using across from dplyr. I thought it would be fun to share!

Let’s give a little more detail.

Load up the tidyverse.

library(tidyverse)#> ── Attaching packages ───────────────────────────── tidyverse 1.3.1 ──#>  ggplot2 3.3.6      purrr   0.3.4
#>  tibble  3.1.7      dplyr   1.0.9
#>  tidyr   1.2.0      stringr 1.4.0
#>  readr   2.1.2      forcats 0.5.1#> ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()

The data they had referred to concentrations of a quantity, that they wanted to rank. We can create these like so:

dat <- tibble(
  x_1 = runif(10),
  x_2 = runif(10),
  x_3 = runif(10)
)

dat#> # A tibble: 10 × 3
#>       x_1   x_2   x_3
#>     <dbl> <dbl> <dbl>
#>  1 0.0712 0.784 0.875
#>  2 0.295  0.734 0.185
#>  3 0.385  0.844 0.690
#>  4 0.648  0.480 0.807
#>  5 0.272  0.323 0.868
#>  6 0.554  0.590 0.859
#>  7 0.421  0.270 0.433
#>  8 0.724  0.732 0.870
#>  9 0.654  0.973 0.606
#> 10 0.159  0.782 0.720

Or for fun, I did this:

vals <- runif(30)

dat <- as_tibble(
  x = matrix(vals, nrow = 10),
  .name_repair = janitor::make_clean_names
) %>%
  rename(
    x_1 = x
  )

dat#> # A tibble: 10 × 3
#>       x_1    x_2   x_3
#>     <dbl>  <dbl> <dbl>
#>  1 0.927  0.661  0.454
#>  2 0.792  0.0989 0.970
#>  3 0.346  0.0544 0.278
#>  4 0.634  0.989  0.360
#>  5 0.0605 0.741  0.675
#>  6 0.373  0.729  0.581
#>  7 0.0359 0.429  0.151
#>  8 0.819  0.277  0.436
#>  9 0.242  0.308  0.452
#> 10 0.0654 0.292  0.620

An aside: prefer dat over df. I’ve been burned recently by using df as my go-to name for data frames, but it turns out that df is actually a function in R, for calculating the density of an F distribution. Which is awesome, but sometimes leads to funny errors that you might not expect.

OK, so what my friend wanted was something like this:

dat %>%  
  mutate(
    x_1_rank = rank(x_1),
    x_2_rank = rank(x_2),
    x_3_rank = rank(x_3)
  )#> # A tibble: 10 × 6
#>       x_1    x_2   x_3 x_1_rank x_2_rank x_3_rank
#>     <dbl>  <dbl> <dbl>    <dbl>    <dbl>    <dbl>
#>  1 0.927  0.661  0.454       10        7        6
#>  2 0.792  0.0989 0.970        8        2       10
#>  3 0.346  0.0544 0.278        5        1        2
#>  4 0.634  0.989  0.360        7       10        3
#>  5 0.0605 0.741  0.675        2        9        9
#>  6 0.373  0.729  0.581        6        8        7
#>  7 0.0359 0.429  0.151        1        6        1
#>  8 0.819  0.277  0.436        9        3        4
#>  9 0.242  0.308  0.452        4        5        5
#> 10 0.0654 0.292  0.620        3        4        8

Now, that’s relatively easy. But they actually have a lot of these concentration columns - sometimes just 10, other times many times that number.

The reason you might want to avoid writing this out each time is because it will involve a lot of reptition, where you might accidentally copy something twice or not increment the numbers. E.g.,

dat %>%  
  mutate(
    x_1_rank = rank(x_1),
    x_2_rank = rank(x_2),
    x_3_rank = rank(x_2)
  )#> # A tibble: 10 × 6
#>       x_1    x_2   x_3 x_1_rank x_2_rank x_3_rank
#>     <dbl>  <dbl> <dbl>    <dbl>    <dbl>    <dbl>
#>  1 0.927  0.661  0.454       10        7        7
#>  2 0.792  0.0989 0.970        8        2        2
#>  3 0.346  0.0544 0.278        5        1        1
#>  4 0.634  0.989  0.360        7       10       10
#>  5 0.0605 0.741  0.675        2        9        9
#>  6 0.373  0.729  0.581        6        8        8
#>  7 0.0359 0.429  0.151        1        6        6
#>  8 0.819  0.277  0.436        9        3        3
#>  9 0.242  0.308  0.452        4        5        5
#> 10 0.0654 0.292  0.620        3        4        4

See the error? x_3_rank is actually from rank(x_2)

So the question is, can you generalise this type of column creation?

Yes! Yes we can. With across. It looks like this.

dat %>% 
  mutate(
    across(
      .cols = everything(),
      .fns = rank
    )
  )#> # A tibble: 10 × 3
#>      x_1   x_2   x_3
#>    <dbl> <dbl> <dbl>
#>  1    10     7     6
#>  2     8     2    10
#>  3     5     1     2
#>  4     7    10     3
#>  5     2     9     9
#>  6     6     8     7
#>  7     1     6     1
#>  8     9     3     4
#>  9     4     5     5
#> 10     3     4     8

This says:

mutate across all columns, and use the function, rank, on each one.

Uh, but we want the other columns preserved…how do we do that?

With .names - we specify a special pattern on how we want to name the columns with .names = "{.col}_rank", which says: “call it the name of the column, and then add”_rank” to it.

dat %>% 
  mutate(
    across(
      .cols = everything(),
      .fns = rank,
      .names = "{.col}_rank"
    )
  )#> # A tibble: 10 × 6
#>       x_1    x_2   x_3 x_1_rank x_2_rank x_3_rank
#>     <dbl>  <dbl> <dbl>    <dbl>    <dbl>    <dbl>
#>  1 0.927  0.661  0.454       10        7        6
#>  2 0.792  0.0989 0.970        8        2       10
#>  3 0.346  0.0544 0.278        5        1        2
#>  4 0.634  0.989  0.360        7       10        3
#>  5 0.0605 0.741  0.675        2        9        9
#>  6 0.373  0.729  0.581        6        8        7
#>  7 0.0359 0.429  0.151        1        6        1
#>  8 0.819  0.277  0.436        9        3        4
#>  9 0.242  0.308  0.452        4        5        5
#> 10 0.0654 0.292  0.620        3        4        8

OK, but what if we’ve got data that contains other columns we don’t want to apply rank to - say, some data that looks like this:

dat_more <- dat %>% 
  rowid_to_column(var = "id") %>% 
  mutate(code = sample(LETTERS, n()),
         .after = id)

dat_more#> # A tibble: 10 × 5
#>       id code     x_1    x_2   x_3
#>    <int> <chr>  <dbl>  <dbl> <dbl>
#>  1     1 Q     0.927  0.661  0.454
#>  2     2 R     0.792  0.0989 0.970
#>  3     3 Y     0.346  0.0544 0.278
#>  4     4 B     0.634  0.989  0.360
#>  5     5 T     0.0605 0.741  0.675
#>  6     6 A     0.373  0.729  0.581
#>  7     7 M     0.0359 0.429  0.151
#>  8     8 K     0.819  0.277  0.436
#>  9     9 C     0.242  0.308  0.452
#> 10    10 G     0.0654 0.292  0.620

We can tell it which columns to pay attention to, or even avoid, like so:

# pay attention to "x_"
dat_more_rank <- dat_more %>% 
  mutate(
    across(
      .cols = starts_with("x_"),
      .fns = rank,
      .names = "{.col}_rank"
    )
  )

dat_more_rank#> # A tibble: 10 × 8
#>       id code     x_1    x_2   x_3 x_1_rank x_2_rank x_3_rank
#>    <int> <chr>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>    <dbl>
#>  1     1 Q     0.927  0.661  0.454       10        7        6
#>  2     2 R     0.792  0.0989 0.970        8        2       10
#>  3     3 Y     0.346  0.0544 0.278        5        1        2
#>  4     4 B     0.634  0.989  0.360        7       10        3
#>  5     5 T     0.0605 0.741  0.675        2        9        9
#>  6     6 A     0.373  0.729  0.581        6        8        7
#>  7     7 M     0.0359 0.429  0.151        1        6        1
#>  8     8 K     0.819  0.277  0.436        9        3        4
#>  9     9 C     0.242  0.308  0.452        4        5        5
#> 10    10 G     0.0654 0.292  0.620        3        4        8
# do it to everything EXCEPT
dat_more %>% 
  mutate(
    across(
      .cols = -c(id, code),
      .fns = rank,
      .names = "{.col}_rank"
    )
  )#> # A tibble: 10 × 8
#>       id code     x_1    x_2   x_3 x_1_rank x_2_rank x_3_rank
#>    <int> <chr>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>    <dbl>
#>  1     1 Q     0.927  0.661  0.454       10        7        6
#>  2     2 R     0.792  0.0989 0.970        8        2       10
#>  3     3 Y     0.346  0.0544 0.278        5        1        2
#>  4     4 B     0.634  0.989  0.360        7       10        3
#>  5     5 T     0.0605 0.741  0.675        2        9        9
#>  6     6 A     0.373  0.729  0.581        6        8        7
#>  7     7 M     0.0359 0.429  0.151        1        6        1
#>  8     8 K     0.819  0.277  0.436        9        3        4
#>  9     9 C     0.242  0.308  0.452        4        5        5
#> 10    10 G     0.0654 0.292  0.620        3        4        8

OK, but can we make this look a little bit nicer by having the columns be like x_1, then x_1_rank and so on?

Sure! We can take advantage of the fact that we’ve got a pretty systematic naming scheme in this case.

new_col_order <- sort(names(dat_more_rank))

new_col_order#> [1] "code"     "id"       "x_1"      "x_1_rank" "x_2"      "x_2_rank" "x_3"     
#> [8] "x_3_rank"

We can use relocate to specify a new column order:

dat_more_rank %>% 
  relocate(
    id,
    code,
    new_col_order
  )#> Note: Using an external vector in selections is ambiguous.
#>  Use `all_of(new_col_order)` instead of `new_col_order` to silence this message.
#>  See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.#> # A tibble: 10 × 8
#>       id code     x_1 x_1_rank    x_2 x_2_rank   x_3 x_3_rank
#>    <int> <chr>  <dbl>    <dbl>  <dbl>    <dbl> <dbl>    <dbl>
#>  1     1 Q     0.927        10 0.661         7 0.454        6
#>  2     2 R     0.792         8 0.0989        2 0.970       10
#>  3     3 Y     0.346         5 0.0544        1 0.278        2
#>  4     4 B     0.634         7 0.989        10 0.360        3
#>  5     5 T     0.0605        2 0.741         9 0.675        9
#>  6     6 A     0.373         6 0.729         8 0.581        7
#>  7     7 M     0.0359        1 0.429         6 0.151        1
#>  8     8 K     0.819         9 0.277         3 0.436        4
#>  9     9 C     0.242         4 0.308         5 0.452        5
#> 10    10 G     0.0654        3 0.292         4 0.620        8

Huh, apparently we need to use all_of, since using an external vector is ambiguous! Right you are, tidyverse team.

dat_more_rank %>% 
  relocate(
    id,
    code,
    all_of(new_col_order)
  )#> # A tibble: 10 × 8
#>       id code     x_1 x_1_rank    x_2 x_2_rank   x_3 x_3_rank
#>    <int> <chr>  <dbl>    <dbl>  <dbl>    <dbl> <dbl>    <dbl>
#>  1     1 Q     0.927        10 0.661         7 0.454        6
#>  2     2 R     0.792         8 0.0989        2 0.970       10
#>  3     3 Y     0.346         5 0.0544        1 0.278        2
#>  4     4 B     0.634         7 0.989        10 0.360        3
#>  5     5 T     0.0605        2 0.741         9 0.675        9
#>  6     6 A     0.373         6 0.729         8 0.581        7
#>  7     7 M     0.0359        1 0.429         6 0.151        1
#>  8     8 K     0.819         9 0.277         3 0.436        4
#>  9     9 C     0.242         4 0.308         5 0.452        5
#> 10    10 G     0.0654        3 0.292         4 0.620        8

Or alternatively:

dat_more_rank %>% 
  relocate(
    id,
    code,
    sort(names(.))
  )#> # A tibble: 10 × 8
#>       id code     x_1 x_1_rank    x_2 x_2_rank   x_3 x_3_rank
#>    <int> <chr>  <dbl>    <dbl>  <dbl>    <dbl> <dbl>    <dbl>
#>  1     1 Q     0.927        10 0.661         7 0.454        6
#>  2     2 R     0.792         8 0.0989        2 0.970       10
#>  3     3 Y     0.346         5 0.0544        1 0.278        2
#>  4     4 B     0.634         7 0.989        10 0.360        3
#>  5     5 T     0.0605        2 0.741         9 0.675        9
#>  6     6 A     0.373         6 0.729         8 0.581        7
#>  7     7 M     0.0359        1 0.429         6 0.151        1
#>  8     8 K     0.819         9 0.277         3 0.436        4
#>  9     9 C     0.242         4 0.308         5 0.452        5
#> 10    10 G     0.0654        3 0.292         4 0.620        8

relocate is a relatively new operation, it just exists to relocate existing columns, and won’t remove other columns. In the past I would have done something like:

dat_more_rank %>% 
  select(
    id,
    code,
    all_of(new_col_order),
    everything()
  )#> # A tibble: 10 × 8
#>       id code     x_1 x_1_rank    x_2 x_2_rank   x_3 x_3_rank
#>    <int> <chr>  <dbl>    <dbl>  <dbl>    <dbl> <dbl>    <dbl>
#>  1     1 Q     0.927        10 0.661         7 0.454        6
#>  2     2 R     0.792         8 0.0989        2 0.970       10
#>  3     3 Y     0.346         5 0.0544        1 0.278        2
#>  4     4 B     0.634         7 0.989        10 0.360        3
#>  5     5 T     0.0605        2 0.741         9 0.675        9
#>  6     6 A     0.373         6 0.729         8 0.581        7
#>  7     7 M     0.0359        1 0.429         6 0.151        1
#>  8     8 K     0.819         9 0.277         3 0.436        4
#>  9     9 C     0.242         4 0.308         5 0.452        5
#> 10    10 G     0.0654        3 0.292         4 0.620        8

With that everything part there being to capture any columns I forgot about, because select will only keep the specified columns.

There are many ways we can do these kinds of things, but to me this felt like a nice fun example.

End

Anywho, that’s a bit of tidyverse magic, maybe it’ll be useful for you!