pigeons across the drain, Nick Tierney, film, olympus XA
A colleague of mine the other day had a question along the lines of:
How do I add multiple columns that give the ranks of values in corresponding columns
And I ended up cooking up a really fun example of using across
from dplyr
. I thought it would be fun to share!
Let’s give a little more detail.
Load up the tidyverse.
library(tidyverse)#> ── Attaching packages ───────────────────────────── tidyverse 1.3.1 ──#> ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
#> ✔ tibble 3.1.7 ✔ dplyr 1.0.9
#> ✔ tidyr 1.2.0 ✔ stringr 1.4.0
#> ✔ readr 2.1.2 ✔ forcats 0.5.1#> ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
The data they had referred to concentrations of a quantity, that they wanted to rank. We can create these like so:
dat <- tibble(
x_1 = runif(10),
x_2 = runif(10),
x_3 = runif(10)
)
dat#> # A tibble: 10 × 3
#> x_1 x_2 x_3
#> <dbl> <dbl> <dbl>
#> 1 0.0712 0.784 0.875
#> 2 0.295 0.734 0.185
#> 3 0.385 0.844 0.690
#> 4 0.648 0.480 0.807
#> 5 0.272 0.323 0.868
#> 6 0.554 0.590 0.859
#> 7 0.421 0.270 0.433
#> 8 0.724 0.732 0.870
#> 9 0.654 0.973 0.606
#> 10 0.159 0.782 0.720
Or for fun, I did this:
vals <- runif(30)
dat <- as_tibble(
x = matrix(vals, nrow = 10),
.name_repair = janitor::make_clean_names
) %>%
rename(
x_1 = x
)
dat#> # A tibble: 10 × 3
#> x_1 x_2 x_3
#> <dbl> <dbl> <dbl>
#> 1 0.927 0.661 0.454
#> 2 0.792 0.0989 0.970
#> 3 0.346 0.0544 0.278
#> 4 0.634 0.989 0.360
#> 5 0.0605 0.741 0.675
#> 6 0.373 0.729 0.581
#> 7 0.0359 0.429 0.151
#> 8 0.819 0.277 0.436
#> 9 0.242 0.308 0.452
#> 10 0.0654 0.292 0.620
An aside: prefer
dat
overdf
. I’ve been burned recently by usingdf
as my go-to name for data frames, but it turns out thatdf
is actually a function in R, for calculating the density of an F distribution. Which is awesome, but sometimes leads to funny errors that you might not expect.
OK, so what my friend wanted was something like this:
dat %>%
mutate(
x_1_rank = rank(x_1),
x_2_rank = rank(x_2),
x_3_rank = rank(x_3)
)#> # A tibble: 10 × 6
#> x_1 x_2 x_3 x_1_rank x_2_rank x_3_rank
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.927 0.661 0.454 10 7 6
#> 2 0.792 0.0989 0.970 8 2 10
#> 3 0.346 0.0544 0.278 5 1 2
#> 4 0.634 0.989 0.360 7 10 3
#> 5 0.0605 0.741 0.675 2 9 9
#> 6 0.373 0.729 0.581 6 8 7
#> 7 0.0359 0.429 0.151 1 6 1
#> 8 0.819 0.277 0.436 9 3 4
#> 9 0.242 0.308 0.452 4 5 5
#> 10 0.0654 0.292 0.620 3 4 8
Now, that’s relatively easy. But they actually have a lot of these concentration columns - sometimes just 10, other times many times that number.
The reason you might want to avoid writing this out each time is because it will involve a lot of reptition, where you might accidentally copy something twice or not increment the numbers. E.g.,
dat %>%
mutate(
x_1_rank = rank(x_1),
x_2_rank = rank(x_2),
x_3_rank = rank(x_2)
)#> # A tibble: 10 × 6
#> x_1 x_2 x_3 x_1_rank x_2_rank x_3_rank
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.927 0.661 0.454 10 7 7
#> 2 0.792 0.0989 0.970 8 2 2
#> 3 0.346 0.0544 0.278 5 1 1
#> 4 0.634 0.989 0.360 7 10 10
#> 5 0.0605 0.741 0.675 2 9 9
#> 6 0.373 0.729 0.581 6 8 8
#> 7 0.0359 0.429 0.151 1 6 6
#> 8 0.819 0.277 0.436 9 3 3
#> 9 0.242 0.308 0.452 4 5 5
#> 10 0.0654 0.292 0.620 3 4 4
See the error? x_3_rank
is actually from rank(x_2)
So the question is, can you generalise this type of column creation?
Yes! Yes we can. With across
. It looks like this.
dat %>%
mutate(
across(
.cols = everything(),
.fns = rank
)
)#> # A tibble: 10 × 3
#> x_1 x_2 x_3
#> <dbl> <dbl> <dbl>
#> 1 10 7 6
#> 2 8 2 10
#> 3 5 1 2
#> 4 7 10 3
#> 5 2 9 9
#> 6 6 8 7
#> 7 1 6 1
#> 8 9 3 4
#> 9 4 5 5
#> 10 3 4 8
This says:
mutate across all columns, and use the function,
rank
, on each one.
Uh, but we want the other columns preserved…how do we do that?
With .names
- we specify a special pattern on how we want to name the columns with .names = "{.col}_rank"
, which says: “call it the name of the column, and then add”_rank” to it.
dat %>%
mutate(
across(
.cols = everything(),
.fns = rank,
.names = "{.col}_rank"
)
)#> # A tibble: 10 × 6
#> x_1 x_2 x_3 x_1_rank x_2_rank x_3_rank
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.927 0.661 0.454 10 7 6
#> 2 0.792 0.0989 0.970 8 2 10
#> 3 0.346 0.0544 0.278 5 1 2
#> 4 0.634 0.989 0.360 7 10 3
#> 5 0.0605 0.741 0.675 2 9 9
#> 6 0.373 0.729 0.581 6 8 7
#> 7 0.0359 0.429 0.151 1 6 1
#> 8 0.819 0.277 0.436 9 3 4
#> 9 0.242 0.308 0.452 4 5 5
#> 10 0.0654 0.292 0.620 3 4 8
OK, but what if we’ve got data that contains other columns we don’t want to apply rank
to - say, some data that looks like this:
dat_more <- dat %>%
rowid_to_column(var = "id") %>%
mutate(code = sample(LETTERS, n()),
.after = id)
dat_more#> # A tibble: 10 × 5
#> id code x_1 x_2 x_3
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Q 0.927 0.661 0.454
#> 2 2 R 0.792 0.0989 0.970
#> 3 3 Y 0.346 0.0544 0.278
#> 4 4 B 0.634 0.989 0.360
#> 5 5 T 0.0605 0.741 0.675
#> 6 6 A 0.373 0.729 0.581
#> 7 7 M 0.0359 0.429 0.151
#> 8 8 K 0.819 0.277 0.436
#> 9 9 C 0.242 0.308 0.452
#> 10 10 G 0.0654 0.292 0.620
We can tell it which columns to pay attention to, or even avoid, like so:
# pay attention to "x_"
dat_more_rank <- dat_more %>%
mutate(
across(
.cols = starts_with("x_"),
.fns = rank,
.names = "{.col}_rank"
)
)
dat_more_rank#> # A tibble: 10 × 8
#> id code x_1 x_2 x_3 x_1_rank x_2_rank x_3_rank
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Q 0.927 0.661 0.454 10 7 6
#> 2 2 R 0.792 0.0989 0.970 8 2 10
#> 3 3 Y 0.346 0.0544 0.278 5 1 2
#> 4 4 B 0.634 0.989 0.360 7 10 3
#> 5 5 T 0.0605 0.741 0.675 2 9 9
#> 6 6 A 0.373 0.729 0.581 6 8 7
#> 7 7 M 0.0359 0.429 0.151 1 6 1
#> 8 8 K 0.819 0.277 0.436 9 3 4
#> 9 9 C 0.242 0.308 0.452 4 5 5
#> 10 10 G 0.0654 0.292 0.620 3 4 8
# do it to everything EXCEPT
dat_more %>%
mutate(
across(
.cols = -c(id, code),
.fns = rank,
.names = "{.col}_rank"
)
)#> # A tibble: 10 × 8
#> id code x_1 x_2 x_3 x_1_rank x_2_rank x_3_rank
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Q 0.927 0.661 0.454 10 7 6
#> 2 2 R 0.792 0.0989 0.970 8 2 10
#> 3 3 Y 0.346 0.0544 0.278 5 1 2
#> 4 4 B 0.634 0.989 0.360 7 10 3
#> 5 5 T 0.0605 0.741 0.675 2 9 9
#> 6 6 A 0.373 0.729 0.581 6 8 7
#> 7 7 M 0.0359 0.429 0.151 1 6 1
#> 8 8 K 0.819 0.277 0.436 9 3 4
#> 9 9 C 0.242 0.308 0.452 4 5 5
#> 10 10 G 0.0654 0.292 0.620 3 4 8
OK, but can we make this look a little bit nicer by having the columns be like x_1
, then x_1_rank
and so on?
Sure! We can take advantage of the fact that we’ve got a pretty systematic naming scheme in this case.
new_col_order <- sort(names(dat_more_rank))
new_col_order#> [1] "code" "id" "x_1" "x_1_rank" "x_2" "x_2_rank" "x_3"
#> [8] "x_3_rank"
We can use relocate
to specify a new column order:
dat_more_rank %>%
relocate(
id,
code,
new_col_order
)#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(new_col_order)` instead of `new_col_order` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.#> # A tibble: 10 × 8
#> id code x_1 x_1_rank x_2 x_2_rank x_3 x_3_rank
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Q 0.927 10 0.661 7 0.454 6
#> 2 2 R 0.792 8 0.0989 2 0.970 10
#> 3 3 Y 0.346 5 0.0544 1 0.278 2
#> 4 4 B 0.634 7 0.989 10 0.360 3
#> 5 5 T 0.0605 2 0.741 9 0.675 9
#> 6 6 A 0.373 6 0.729 8 0.581 7
#> 7 7 M 0.0359 1 0.429 6 0.151 1
#> 8 8 K 0.819 9 0.277 3 0.436 4
#> 9 9 C 0.242 4 0.308 5 0.452 5
#> 10 10 G 0.0654 3 0.292 4 0.620 8
Huh, apparently we need to use all_of
, since using an external vector is ambiguous! Right you are, tidyverse team.
dat_more_rank %>%
relocate(
id,
code,
all_of(new_col_order)
)#> # A tibble: 10 × 8
#> id code x_1 x_1_rank x_2 x_2_rank x_3 x_3_rank
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Q 0.927 10 0.661 7 0.454 6
#> 2 2 R 0.792 8 0.0989 2 0.970 10
#> 3 3 Y 0.346 5 0.0544 1 0.278 2
#> 4 4 B 0.634 7 0.989 10 0.360 3
#> 5 5 T 0.0605 2 0.741 9 0.675 9
#> 6 6 A 0.373 6 0.729 8 0.581 7
#> 7 7 M 0.0359 1 0.429 6 0.151 1
#> 8 8 K 0.819 9 0.277 3 0.436 4
#> 9 9 C 0.242 4 0.308 5 0.452 5
#> 10 10 G 0.0654 3 0.292 4 0.620 8
Or alternatively:
dat_more_rank %>%
relocate(
id,
code,
sort(names(.))
)#> # A tibble: 10 × 8
#> id code x_1 x_1_rank x_2 x_2_rank x_3 x_3_rank
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Q 0.927 10 0.661 7 0.454 6
#> 2 2 R 0.792 8 0.0989 2 0.970 10
#> 3 3 Y 0.346 5 0.0544 1 0.278 2
#> 4 4 B 0.634 7 0.989 10 0.360 3
#> 5 5 T 0.0605 2 0.741 9 0.675 9
#> 6 6 A 0.373 6 0.729 8 0.581 7
#> 7 7 M 0.0359 1 0.429 6 0.151 1
#> 8 8 K 0.819 9 0.277 3 0.436 4
#> 9 9 C 0.242 4 0.308 5 0.452 5
#> 10 10 G 0.0654 3 0.292 4 0.620 8
relocate
is a relatively new operation, it just exists to relocate existing columns, and won’t remove other columns. In the past I would have done something like:
dat_more_rank %>%
select(
id,
code,
all_of(new_col_order),
everything()
)#> # A tibble: 10 × 8
#> id code x_1 x_1_rank x_2 x_2_rank x_3 x_3_rank
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Q 0.927 10 0.661 7 0.454 6
#> 2 2 R 0.792 8 0.0989 2 0.970 10
#> 3 3 Y 0.346 5 0.0544 1 0.278 2
#> 4 4 B 0.634 7 0.989 10 0.360 3
#> 5 5 T 0.0605 2 0.741 9 0.675 9
#> 6 6 A 0.373 6 0.729 8 0.581 7
#> 7 7 M 0.0359 1 0.429 6 0.151 1
#> 8 8 K 0.819 9 0.277 3 0.436 4
#> 9 9 C 0.242 4 0.308 5 0.452 5
#> 10 10 G 0.0654 3 0.292 4 0.620 8
With that everything
part there being to capture any columns I forgot about, because select
will only keep the specified columns.
There are many ways we can do these kinds of things, but to me this felt like a nice fun example.
End
Anywho, that’s a bit of tidyverse magic, maybe it’ll be useful for you!