Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2020-06-08

Just Quickly: Rowwise Median in dplyr

Nicholas Tierney

Categories: rstats Tags: rstats dplyr

3 minute read

The tidyverse team recently completed a 1.0.0 release for dplyr, which was a pretty big deal, and it included a bunch of new features. One of the things that I really enjoyed was that they wrote a series of blog posts describing new features in the release. This was great, because we got to see what was coming up, and great because people tried them out and gave them feedback. Then, the tidyverse listened, and changed behaviour based on feedback from the community.

Isn’t that great?

Let’s celebrate something from the tidyverse today: rowwise. This function has actually been around for a while, but I never really used it, for some reason. A student recently had an issue where they had data like this:

library(tibble)
income <- tibble(income_range = c("0-74",
                                  "75-145",
                                  "150-325",
                                  "325+"),
                 count = c(125,
                           170, 
                           215,
                           250))

income
#> # A tibble: 4 x 2
#>   income_range count
#>   <chr>        <dbl>
#> 1 0-74           125
#> 2 75-145         170
#> 3 150-325        215
#> 4 325+           250

They wanted to calculate the median of income_range.

This presents an interesting problem, with a few steps:

  1. Separate the range values into two columns.
  2. Calculate the median of each of those pairs of numbers.

We can get ourselves into a better position by separating out income_range into two columns, lower, and upper, and converting the contents. We can use separate from tidyr. It is kind of magical. While you can specify a specific thing that separates the numbers, separate has a nice bit og magic that just finds the most likely character to separate on.

library(tidyr)
income_sep <- income %>% 
  separate(col = income_range,
           into = c("lower", "upper"),
           convert = TRUE)
income_sep
#> # A tibble: 4 x 3
#>   lower upper count
#>   <int> <int> <dbl>
#> 1     0    74   125
#> 2    75   145   170
#> 3   150   325   215
#> 4   325    NA   250

So now have a lower and an upper range of values, and we want to calculate the median of these.

This…gets a little bit tricky.

At first instinct, you might try something like this:

calculate the median based on the lower and upper columns:

library(dplyr, warn.conflicts = FALSE)
income_sep %>% 
  mutate(med = median(c(lower, upper), na.rm = TRUE))
#> # A tibble: 4 x 4
#>   lower upper count   med
#>   <int> <int> <dbl> <int>
#> 1     0    74   125   145
#> 2    75   145   170   145
#> 3   150   325   215   145
#> 4   325    NA   250   145

But this doesn’t give us what we want. It just gives us the median of the vector, I think?

Anyway, how do we solve this?

We can now call rowwise() and calculate the median based on the lower and upper, and it will consider each row and take the median of those two numbers:

income_sep %>% 
  rowwise() %>% 
  mutate(med = median(c(lower, upper), na.rm = TRUE))
#> # A tibble: 4 x 4
#> # Rowwise: 
#>   lower upper count   med
#>   <int> <int> <dbl> <dbl>
#> 1     0    74   125   37 
#> 2    75   145   170  110 
#> 3   150   325   215  238.
#> 4   325    NA   250  325

And I think that’s pretty neat.

Thanks, tidyverse team!