Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2022-08-09

Pyramid Plots in ggplot2

Nicholas Tierney

Categories: data visualisation covid19 ggplot2 colour palette rstats Tags: covid19 data visualisation colour rstats pyramid ggplot2 functions

15 minute read

Two Abandoned Trolleys, Nick Tierney, Film, Olympus XA

I recently had to make some pyramid plots in R. They are a useful way to compare age structures of populations. They look like this:

Age population structure of Sydney vs Melbourne

It’s been a while since I had to make them, and in the past I’ve cooked them up in a relatively bespoke way. This time I needed to be able to have some level of programming to them - I wanted to be able to provide any two LGAs in Australia and then make a pyramid plot of them that looked nice.

I started by looking at this SO thread - overall this gave me what I wanted, but I thought I’d walk through my solution, as it is a little different, and might hopefully be useful to others.

The data

The data is comes from the Australian Bureau of Statistics, which is then cleaned up, and packaged in the conmat R package, which I maintain. Before I show that, I’ll load the packages I need, tidyverse , and conmat:

library(tidyverse)#> ── Attaching packages ───────────────────────────── tidyverse 1.3.1 ──#>  ggplot2 3.3.6      purrr   0.3.4
#>  tibble  3.1.7      dplyr   1.0.9
#>  tidyr   1.2.0      stringr 1.4.0
#>  readr   2.1.2      forcats 0.5.1#> ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()library(conmat)#> Warning: package 'conmat' was built under R version 4.2.1

You can access the age structured population for a given LGA like so:

# |label: LGA-hobart
abs_age_lga("Hobart (C)")#> # A tibble: 18 × 4
#>    lga        lower.age.limit  year population
#>    <chr>                <dbl> <dbl>      <dbl>
#>  1 Hobart (C)               0  2020       2442
#>  2 Hobart (C)               5  2020       2833
#>  3 Hobart (C)              10  2020       2771
#>  4 Hobart (C)              15  2020       3038
#>  5 Hobart (C)              20  2020       4982
#>  6 Hobart (C)              25  2020       5132
#>  7 Hobart (C)              30  2020       4196
#>  8 Hobart (C)              35  2020       3510
#>  9 Hobart (C)              40  2020       3214
#> 10 Hobart (C)              45  2020       3428
#> 11 Hobart (C)              50  2020       3202
#> 12 Hobart (C)              55  2020       3381
#> 13 Hobart (C)              60  2020       3291
#> 14 Hobart (C)              65  2020       3036
#> 15 Hobart (C)              70  2020       2512
#> 16 Hobart (C)              75  2020       1721
#> 17 Hobart (C)              80  2020       1189
#> 18 Hobart (C)              85  2020       1372

In our case, we want to combine two LGAs and compare them. So let’s write a little helper function, two_abs_age_lga:

two_abs_age_lga <- function(lga_1, lga_2){
  bind_rows(
    abs_age_lga(lga_1),
    abs_age_lga(lga_2)
  )
}

So now we can get say, Melbourne and Sydney, like so:

melb_syd <- two_abs_age_lga("Melbourne (C)", "Sydney (C)")
melb_syd#> # A tibble: 36 × 4
#>    lga           lower.age.limit  year population
#>    <chr>                   <dbl> <dbl>      <dbl>
#>  1 Melbourne (C)               0  2020       4882
#>  2 Melbourne (C)               5  2020       3450
#>  3 Melbourne (C)              10  2020       2675
#>  4 Melbourne (C)              15  2020       9396
#>  5 Melbourne (C)              20  2020      38434
#>  6 Melbourne (C)              25  2020      38546
#>  7 Melbourne (C)              30  2020      25834
#>  8 Melbourne (C)              35  2020      15072
#>  9 Melbourne (C)              40  2020       8554
#> 10 Melbourne (C)              45  2020       6753
#> # … with 26 more rowstail(melb_syd)#> # A tibble: 6 × 4
#>   lga        lower.age.limit  year population
#>   <chr>                <dbl> <dbl>      <dbl>
#> 1 Sydney (C)              60  2020       8950
#> 2 Sydney (C)              65  2020       7377
#> 3 Sydney (C)              70  2020       6308
#> 4 Sydney (C)              75  2020       4002
#> 5 Sydney (C)              80  2020       2583
#> 6 Sydney (C)              85  2020       2506

The data wrangling

Now, to get the data into this plot, what we want is a plot of the population against each age group. To illustrate this I’ll just use Melbourne data:

melb <- abs_age_lga("Melbourne (C)")
ggplot(melb,
       aes(x = population,
           y = lower.age.limit,
           fill = lga)) + 
  geom_col()

Ack, we need the y variable to be a factor

ggplot(melb,
       aes(x = population,
           y = factor(lower.age.limit),
           fill = lga)) + 
  geom_col()

But what we actually want are two plots, one going the other way for the other group. We can make this happen by making the population negative:

syd <- abs_age_lga("Sydney (C)")
ggplot(syd,
       aes(x = -population,
           y = factor(lower.age.limit),
           fill = lga)) + 
  geom_col()

And then we want to combine those two plots - so we can write something bespoke, like this. Let’s also make that age limit data a factor while we are here

melb_syd_pyramid <- melb_syd %>% 
  mutate(
    population = case_when(
      lga == "Sydney (C)" ~ -population,
      TRUE ~ population
    ),
    lower.age.limit = as_factor(lower.age.limit)
  )

melb_syd_pyramid#> # A tibble: 36 × 4
#>    lga           lower.age.limit  year population
#>    <chr>         <fct>           <dbl>      <dbl>
#>  1 Melbourne (C) 0                2020       4882
#>  2 Melbourne (C) 5                2020       3450
#>  3 Melbourne (C) 10               2020       2675
#>  4 Melbourne (C) 15               2020       9396
#>  5 Melbourne (C) 20               2020      38434
#>  6 Melbourne (C) 25               2020      38546
#>  7 Melbourne (C) 30               2020      25834
#>  8 Melbourne (C) 35               2020      15072
#>  9 Melbourne (C) 40               2020       8554
#> 10 Melbourne (C) 45               2020       6753
#> # … with 26 more rows

So now we can get this:

ggplot(melb_syd_pyramid,
         aes(x = population,
             y = lower.age.limit,
             fill = lga)) +
    geom_col() 

And that’s pretty good!

Let’s tidy it up a little bit - we want to change the axis on the bottom to be positive in both directions, so we’ll need to specify the break points for the axis marks, as well as the labels. So we want a sequence from one end to the other, but for it to be symmetric. Let’s start by getting the range, by using the range function:

pop_range <- range(melb_syd_pyramid$population)
pop_range#> [1] -45686  38546

Now we want a sequence from there to the end value. We can use seq to make a sequence

pop_range_seq <- seq(pop_range[1], pop_range[2], by = 15000)
pop_range_seq#> [1] -45686 -30686 -15686   -686  14314  29314

Briefly, what we want to do is something like this:

ggplot(melb_syd_pyramid,
         aes(x = population,
             y = lower.age.limit,
             fill = lga)) +
    geom_col() +
    scale_x_continuous(breaks  = pop_range_seq,
                       labels = abs(pop_range_seq))

Ugh, but the numbers aren’t very nice. We want some nicer numbers. We can use a negative number in round to give us something slightly better.

round(pop_range, -2)#> [1] -45700  38500

But I’d like to round to the nearest 500…and we also need to have a 0 in there as well.

base::pretty()

Turns out this is a pretty common thing to do, and base R has us covered, with pretty()! It even includes 0!

pretty(pop_range)#> [1] -60000 -40000 -20000      0  20000  40000

There are loads of options, but we’ll just specify n to change the number of breaks

pop_range_breaks <- pretty(pop_range, n = 7)

So now we have something that looks pretty good

ggplot(melb_syd_pyramid,
         aes(x = population,
             y = lower.age.limit,
             fill = lga)) +
    geom_col() +
    scale_x_continuous(breaks  = pop_range_breaks,
                       labels = abs(pop_range_breaks))

OK, but a few more gripes:

scales::comma()

A really cool function! It takes number input and adds commas into them.

scales::comma(10)#> [1] "10"scales::comma(100)#> [1] "100"scales::comma(1000)#> [1] "1,000"scales::comma(1000000)#> [1] "1,000,000"

Let’s add that in, along with the changes to the legend, as well as an improved colour scale

ggplot(melb_syd_pyramid,
         aes(x = population,
             y = lower.age.limit,
             fill = lga)) +
    geom_col() +
    scale_x_continuous(breaks  = pop_range_breaks,
                       labels = scales::comma(abs(pop_range_breaks))) + 
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal() +
  theme(legend.position = "top") 

This doesn’t generalise to other data

Unfortunately, if I want to do this to another dataset, which doesn’t have Sydney and Melbourne, I’ll need to write custom code each time. We can do a little better, let’s write this as a function, which takes two inputs, the name of each of the LGAs we want to explore.

This involves a few steps - first, we need to modify the data as we’ve done above. Let’s call this a prep_pop_pyramid function. First let’s scratch out what it will do.

So we need to get the data.

melb_syd <- two_abs_age_lga("Melbourne (C)", "Sydney (C)")

Then we need to multiply one of the populations by -1, so that we get the reflected population in the other direction. We had previously done:

melb_syd %>% 
  mutate(
    population = case_when(
      lga == "Sydney (C)" ~ -population,
      TRUE ~ population
    )
  )#> # A tibble: 36 × 4
#>    lga           lower.age.limit  year population
#>    <chr>                   <dbl> <dbl>      <dbl>
#>  1 Melbourne (C)               0  2020       4882
#>  2 Melbourne (C)               5  2020       3450
#>  3 Melbourne (C)              10  2020       2675
#>  4 Melbourne (C)              15  2020       9396
#>  5 Melbourne (C)              20  2020      38434
#>  6 Melbourne (C)              25  2020      38546
#>  7 Melbourne (C)              30  2020      25834
#>  8 Melbourne (C)              35  2020      15072
#>  9 Melbourne (C)              40  2020       8554
#> 10 Melbourne (C)              45  2020       6753
#> # … with 26 more rows

But this will not generalise to any two LGA names. So instead we can assign one of the LGAs to a number, which we can do by first grouping by lga, then using cur_group_id(), to label the groups as a number.

melb_syd %>% 
    group_by(lga) %>%
    mutate(
      age_multiplier = cur_group_id(),
      .after = lga
    )#> # A tibble: 36 × 5
#> # Groups:   lga [2]
#>    lga           age_multiplier lower.age.limit  year population
#>    <chr>                  <int>           <dbl> <dbl>      <dbl>
#>  1 Melbourne (C)              1               0  2020       4882
#>  2 Melbourne (C)              1               5  2020       3450
#>  3 Melbourne (C)              1              10  2020       2675
#>  4 Melbourne (C)              1              15  2020       9396
#>  5 Melbourne (C)              1              20  2020      38434
#>  6 Melbourne (C)              1              25  2020      38546
#>  7 Melbourne (C)              1              30  2020      25834
#>  8 Melbourne (C)              1              35  2020      15072
#>  9 Melbourne (C)              1              40  2020       8554
#> 10 Melbourne (C)              1              45  2020       6753
#> # … with 26 more rows

This then gives us a 1 or a 2. We can then set this to be negative if it is the second group.

melb_syd %>% 
    group_by(lga) %>%
    mutate(
      age_multiplier = cur_group_id(),
      age_multiplier = case_when(age_multiplier == 2 ~ 1,
                                 TRUE ~ -1),
      .after = lga
    ) #> # A tibble: 36 × 5
#> # Groups:   lga [2]
#>    lga           age_multiplier lower.age.limit  year population
#>    <chr>                  <dbl>           <dbl> <dbl>      <dbl>
#>  1 Melbourne (C)             -1               0  2020       4882
#>  2 Melbourne (C)             -1               5  2020       3450
#>  3 Melbourne (C)             -1              10  2020       2675
#>  4 Melbourne (C)             -1              15  2020       9396
#>  5 Melbourne (C)             -1              20  2020      38434
#>  6 Melbourne (C)             -1              25  2020      38546
#>  7 Melbourne (C)             -1              30  2020      25834
#>  8 Melbourne (C)             -1              35  2020      15072
#>  9 Melbourne (C)             -1              40  2020       8554
#> 10 Melbourne (C)             -1              45  2020       6753
#> # … with 26 more rows

Then, we have something that we can use to multiply the population by -1 ! Let’s also take this chance to turn age into a factor.

melb_syd %>% 
    group_by(lga) %>%
    mutate(
      age_multiplier = cur_group_id(),
      age_multiplier = case_when(age_multiplier == 2 ~ 1,
                                 TRUE ~ -1),
      .after = lga
    ) %>%
    ungroup() %>%
    mutate(age = as_factor(lower.age.limit),
           population = population * age_multiplier)#> # A tibble: 36 × 6
#>    lga           age_multiplier lower.age.limit  year population age  
#>    <chr>                  <dbl>           <dbl> <dbl>      <dbl> <fct>
#>  1 Melbourne (C)             -1               0  2020      -4882 0    
#>  2 Melbourne (C)             -1               5  2020      -3450 5    
#>  3 Melbourne (C)             -1              10  2020      -2675 10   
#>  4 Melbourne (C)             -1              15  2020      -9396 15   
#>  5 Melbourne (C)             -1              20  2020     -38434 20   
#>  6 Melbourne (C)             -1              25  2020     -38546 25   
#>  7 Melbourne (C)             -1              30  2020     -25834 30   
#>  8 Melbourne (C)             -1              35  2020     -15072 35   
#>  9 Melbourne (C)             -1              40  2020      -8554 40   
#> 10 Melbourne (C)             -1              45  2020      -6753 45   
#> # … with 26 more rows

And then let’s wrap this into a function:

prep_pop_pyramid <- function(data){
  data %>% 
    group_by(lga) %>%
    mutate(
      age_multiplier = cur_group_id(),
      age_multiplier = case_when(age_multiplier == 2 ~ 1,
                                 TRUE ~ -1),
      .after = lga
    ) %>%
    ungroup() %>%
    mutate(age = as_factor(lower.age.limit),
           population = population * age_multiplier)
}

OK so now let’s put those two parts together:

melb_syd <- two_abs_age_lga(
  "Melbourne (C)",
  "Sydney (C)"
) %>% 
  prep_pop_pyramid()

melb_syd#> # A tibble: 36 × 6
#>    lga           age_multiplier lower.age.limit  year population age  
#>    <chr>                  <dbl>           <dbl> <dbl>      <dbl> <fct>
#>  1 Melbourne (C)             -1               0  2020      -4882 0    
#>  2 Melbourne (C)             -1               5  2020      -3450 5    
#>  3 Melbourne (C)             -1              10  2020      -2675 10   
#>  4 Melbourne (C)             -1              15  2020      -9396 15   
#>  5 Melbourne (C)             -1              20  2020     -38434 20   
#>  6 Melbourne (C)             -1              25  2020     -38546 25   
#>  7 Melbourne (C)             -1              30  2020     -25834 30   
#>  8 Melbourne (C)             -1              35  2020     -15072 35   
#>  9 Melbourne (C)             -1              40  2020      -8554 40   
#> 10 Melbourne (C)             -1              45  2020      -6753 45   
#> # … with 26 more rows

And now let’s wrap the plotting code into a function:

plot_pop_pyramid <- function(data){
  pop_range <- range(data$population)
  
  age_range_seq <- pretty(pop_range, n = 10)
  
  ggplot(data,
         aes(x = population,
             y = age,
             fill = lga)) +
    geom_col() +
    scale_x_continuous(breaks  = age_range_seq,
                       labels = scales::comma(abs(age_range_seq))) +
    scale_fill_brewer(palette = "Dark2",
                      guide = guide_legend(
                        title = ""
                      )) +
    theme_minimal() +
    theme(legend.position = "top")
}
plot_pop_pyramid(melb_syd)

End

And that was the end of this blog post. But something about the plots kind of bothered me…

Bonus - centering

Well, there’s actually one final step here that you might be interested in - which is centering the “0” of the population. Perhaps there is another way around this, but the approach that I ended up doing was the following:

pretty_symmetric <- function(range, n = 5){
  range_1 <- c(-range[1], range[2])
  range_2 <- c(range[1], -range[2])
  
  pretty_vec_1 <- pretty(range_1)
  pretty_vec_2 <- pretty(range_2)
  
  pretty(
    c(pretty_vec_1, pretty_vec_2), 
    n = n
  )
  
}

pretty_symmetric(c(-5000, 10000))#> [1] -10000  -5000      0   5000  10000

So then we replace pretty, with pretty_symmetric , and then need to add this other bit, expand_limits which ensures the limits are kept

plot_pop_pyramid <- function(data){
  pop_range <- range(data$population)
  
  age_range_seq <- pretty_symmetric(pop_range, n = 5)
  
  ggplot(data,
         aes(x = population,
             y = age,
             fill = lga)) +
    geom_col() +
    scale_x_continuous(breaks  = age_range_seq,
                       labels = scales::comma(abs(age_range_seq))) +
    expand_limits(x = range(age_range_seq)) + 
    scale_fill_brewer(palette = "Dark2",
                      guide = guide_legend(
                        title = ""
                      )) +
    theme_minimal() +
    theme(legend.position = "top")
}
plot_pop_pyramid(melb_syd)

This has a nice benefit of facilitating stacking the plots with something like patchwork and the 0s will be aligned.

melb_syd <- two_abs_age_lga(
  "Melbourne (C)",
  "Sydney (C)"
) 

bris_hobart <- two_abs_age_lga(
  "Brisbane (C)",
  "Hobart (C)"
)
alice_perth <- two_abs_age_lga(
  "Alice Springs (T)",
  "Perth (C)"
) 

Then we can stack them together with patchwork

library(patchwork)
melb_syd %>% prep_pop_pyramid() %>% plot_pop_pyramid() /
bris_hobart %>% prep_pop_pyramid() %>% plot_pop_pyramid() /
alice_perth %>% prep_pop_pyramid() %>% plot_pop_pyramid()

Bonus Bonus - per capita

Ugh. OK. These plots are nice, but really, these should be normalised by the total population.

What does that mean? Why would we want to do that? Well, if you have two cities, one with HEAPS of people, and other with 1000 times less people - the numbers are going to be harder to compare. And if the point you are trying to convey is:

Hey, these two cities have very different population sizes

Then we’ve done that above.

But what we actually kind of care about the most is:

Hey, do these two cities have similar age distributions?

So like, do they have the same proportion of 20-somethings in both cities? Regardless of the fact that one city has like 100 or 1000 times more people?

So, we should fix it.

Where do we fix it? We need to change the data creation function. Adding a new column, population_per_capita, with something like this:

mutate(population_per_capita = population / sum(population))

Which divides the population by the total population. This is sometimes called “normalising”. Effectively, we are comparing each age population to the total population.

Also importantly, we need to do this still in the group_by(lga) below

prep_pop_pyramid <- function(data) {
  data %>%
    group_by(lga) %>%
    mutate(
      age_multiplier = cur_group_id(),
      age_multiplier = case_when(
        age_multiplier == 2 ~ 1,
        TRUE ~ -1
      ),
      .after = lga
    ) %>%
    mutate(population_per_capita = population / sum(population)) %>%
    ungroup() %>%
    mutate(
      age = as_factor(lower.age.limit),
      population = population * age_multiplier,
      population_per_capita = population_per_capita * age_multiplier
    )
}

And since we’ve changed the data that we are using now - we aren’t using population anymore, and I hard coded the plotting code, we need to re-write the function. Perhaps with the benefit of hindsight it would have been better to not hard code variables in the plotting function, but here we are.

plot_pop_pyramid_per_capita <- function(data) {
  pop_range <- range(data$population_per_capita)

  age_range_seq <- pretty_symmetric(pop_range, n = 5)

  ggplot(
    data,
    aes(
      x = population_per_capita,
      y = age,
      fill = lga
    )
  ) +
    geom_col() +
    scale_x_continuous(
      breaks = age_range_seq,
      labels = abs(age_range_seq)
    ) +
    expand_limits(x = range(age_range_seq)) +
    scale_fill_brewer(
      palette = "Dark2",
      guide = guide_legend(
        title = ""
      )
    ) +
    theme_minimal() +
    theme(legend.position = "top")
}

All together now

melb_syd %>% prep_pop_pyramid() %>% plot_pop_pyramid_per_capita() /
bris_hobart %>% prep_pop_pyramid() %>% plot_pop_pyramid_per_capita() /
alice_perth %>% prep_pop_pyramid() %>% plot_pop_pyramid_per_capita()

And now for comparison to drive home the difference between the per capita pyramid and the raw population pyramid - most notably, there is a huge difference between brisbane and hobart - previously the population numbers in Brisbane drowned out the differnces in Hobart

bris_hobart_pyramid <- bris_hobart %>% prep_pop_pyramid()
plot_pop_pyramid(bris_hobart_pyramid) /
  plot_pop_pyramid_per_capita(bris_hobart_pyramid)

Another way to present this is as a regular bar graph, which more strongly places our focus on comparison between age groups.

bris_hobart_pyramid %>% 
  mutate(
    # reverse age_multiplier
    population = case_when(
      age_multiplier == -1 ~ population * -1,
      TRUE ~ population
    ),
    population_per_capita = case_when(
      age_multiplier == -1 ~ population_per_capita * -1,
      TRUE ~ population_per_capita
    )
  ) %>% 
  pivot_longer(
    cols = c(population,
             population_per_capita),
    names_to = "pop_type",
    values_to = "pop_value"
  ) %>% 
  ggplot(aes(x = age,
             y = pop_value,
             fill = lga)) + 
  geom_col(position = "dodge") +
  facet_wrap(~pop_type,
             scales = "free_y",
             ncol = 1) + 
  scale_fill_brewer(
      palette = "Dark2",
      guide = guide_legend(
        title = ""
      )
    ) +
    theme_minimal() +
    theme(legend.position = "top")

In some ways, I think I prefer this graph - there are many ways to present information, and in this case it helps us more clearly compare the differences between two places. It also strongly highlights the difference between using per capita and raw population values.

PS

I actually wrote this up initially to make a figure for a grant. Then, unhappy with the fact that I had to alter the code 2 times to make the figure, I converted it into a targets pipeline, which you can see here on github: https://github.com/njtierney/target-pop-pyramid

Also fun fact, the figure didn’t make it into the grant. But now you’ve got this blog post, so that’s pretty neat, I guess?