Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2020-09-07

When Was the Last Day We Hit 20c?

Nicholas Tierney

Categories: rstats Tags: rstats bomrang

14 minute read

It’s starting to get warmer here in Melbourne. Today was the first day of over 20 degrees Celsius we’ve had in…well how long? It feels like a while. We hit this middle part of the year and we don’t get over 20C for a while. But I’d like to know the last time it got over 20 degrees Celsius. How do we do that? With the bomrang package by Adam Sparks!

First, you’ll need the latest version of bomrang - you should be on version 0.7.0.9000 or greater:

# remotes::install_github("ropensci/bomrang")
suppressPackageStartupMessages(library(bomrang))
packageVersion("bomrang")

#> [1] '0.7.0.9000'

Next, we sweep for stations nearby Melbourne. I know there are packages (like geocoder) that help get the coordinates, but when I just need one coordinate I usually just copy it from google maps.

melb_stations <- sweep_for_stations(latlon = c(-37.800372,
                                                        144.996162))

melb_stations

#>        site dist                          name start  end      lat      lon
#>   1: 086338   86      MELBOURNE (OLYMPIC PARK)  2013 2020 -37.8255 144.9816
#>   2: 086220   86       ST KILDA HARBOUR - RMYS  2006 2020 -37.8642 144.9639
#>   3: 086351   86 BUNDOORA (LATROBE UNIVERSITY)  1979 2020 -37.7163 145.0453
#>   4: 086068   86                      VIEWBANK  1999 2020 -37.7408 145.0972
#>   5: 086038   86              ESSENDON AIRPORT  1929 2020 -37.7276 144.9066
#>  ---                                                                       
#> 883: 300039  300                            G3  1999 2020 -70.8919  69.8725
#> 884: 200284  200          COCOS ISLAND AIRPORT  1901 2020 -12.1892  96.8344
#> 885: 300028  300       HEARD ISLAND (THE SPIT)  1992 2020 -53.1082  73.7225
#> 886: 300005  300     HEARD ISLAND (ATLAS COVE)  1948 2020 -53.0190  73.3918
#> 887: 300001  300                        MAWSON  1954 2020 -67.6017  62.8753
#>      state elev bar_ht   wmo state_code
#>   1:   VIC  7.5    7.5 95936          V
#>   2:   VIC  6.4     NA 95864          V
#>   3:   VIC 83.0     NA 95873          V
#>   4:   VIC 66.1   66.4 95874          V
#>   5:   VIC 78.4   78.8 95866          V
#>  ---                                   
#> 883:   ANT 84.0   84.0 89767          T
#> 884:    WA  3.0    4.0 96996          W
#> 885:   ANT 12.0   12.5 94997          T
#> 886:   ANT  3.0    3.5 95997          T
#> 887:   ANT  9.9   16.0 89564          T
#>                                                         url    distance
#>   1: http://www.bom.gov.au/fwo/IDV60801/IDV60801.95936.json    3.073013
#>   2: http://www.bom.gov.au/fwo/IDV60801/IDV60801.95864.json    7.642001
#>   3: http://www.bom.gov.au/fwo/IDV60801/IDV60801.95873.json   10.298180
#>   4: http://www.bom.gov.au/fwo/IDV60801/IDV60801.95874.json   11.079185
#>   5: http://www.bom.gov.au/fwo/IDV60801/IDV60801.95866.json   11.289837
#>  ---                                                                   
#> 883: http://www.bom.gov.au/fwo/IDT60803/IDT60803.89767.json 5536.751202
#> 884: http://www.bom.gov.au/fwo/IDW60801/IDW60801.96996.json 5544.800897
#> 885: http://www.bom.gov.au/fwo/IDT60803/IDT60803.94997.json 5562.364714
#> 886: http://www.bom.gov.au/fwo/IDT60803/IDT60803.95997.json 5586.092643
#> 887: http://www.bom.gov.au/fwo/IDT60803/IDT60803.89564.json 5844.280345

This then returns information on the number of stations in Melbourne and their location.

We take the first station and get the max temperature.

melbourne_weather <- get_historical(stationid = melb_stations$site[1],
                                    type = "max")

#> Warning: The list of available stations for `type = rain` is currently empty.
#> This is likely a temporary error in the Bureau of Meteorology's
#> database and may cause requests for rain station data to fail.

#> Data saved as /var/folders/mw/gj7418356js6s29x7wn8crfmljy4wh/T//RtmpfgXQLS/IDCJAC0010_086338_1800_Data.csv


melbourne_weather

#>   --- Australian Bureau of Meteorology (BOM) Data Resource ---
#>   (Original Request Parameters)
#>   Station:    MELBOURNE (OLYMPIC PARK) [086338] 
#>   Location:    lat: -37.8255, lon: 144.9816
#>   Measurement / Origin:  Max / Historical
#>   Timespan:    2013-06-01 -- 2020-09-01 [7.3 years]
#>   ---------------------------------------------------------------  
#>       product_code station_number year month day max_temperature accum_days_max
#>    1:   IDCJAC0010          86338 2013     1   1              NA             NA
#>    2:   IDCJAC0010          86338 2013     1   2              NA             NA
#>    3:   IDCJAC0010          86338 2013     1   3              NA             NA
#>    4:   IDCJAC0010          86338 2013     1   4              NA             NA
#>    5:   IDCJAC0010          86338 2013     1   5              NA             NA
#>   ---                                                                          
#> 2812:   IDCJAC0010          86338 2020     9  12            16.0              1
#> 2813:   IDCJAC0010          86338 2020     9  13            17.2              1
#> 2814:   IDCJAC0010          86338 2020     9  14            16.6              1
#> 2815:   IDCJAC0010          86338 2020     9  15            15.8              1
#> 2816:   IDCJAC0010          86338 2020     9  16            22.8              1
#>       quality
#>    1:        
#>    2:        
#>    3:        
#>    4:        
#>    5:        
#>   ---        
#> 2812:       N
#> 2813:       N
#> 2814:       N
#> 2815:       N
#> 2816:       N

We can convert this into a date format with one of my favourite R functions of all time, ISOdate() (also ISOdatetime() is great):

melb_weather_date <- melbourne_weather %>% 
  mutate(date = ISOdate(year, month, day))

melb_weather_date

#>   --- Australian Bureau of Meteorology (BOM) Data Resource ---
#>   (Original Request Parameters)
#>   Station:    MELBOURNE (OLYMPIC PARK) [086338] 
#>   Location:    lat: -37.8255, lon: 144.9816
#>   Measurement / Origin:  Max / Historical
#>   Timespan:    2013-06-01 -- 2020-09-01 [7.3 years]
#>   ---------------------------------------------------------------  
#>       product_code station_number year month day max_temperature accum_days_max
#>    1:   IDCJAC0010          86338 2013     1   1              NA             NA
#>    2:   IDCJAC0010          86338 2013     1   2              NA             NA
#>    3:   IDCJAC0010          86338 2013     1   3              NA             NA
#>    4:   IDCJAC0010          86338 2013     1   4              NA             NA
#>    5:   IDCJAC0010          86338 2013     1   5              NA             NA
#>   ---                                                                          
#> 2812:   IDCJAC0010          86338 2020     9  12            16.0              1
#> 2813:   IDCJAC0010          86338 2020     9  13            17.2              1
#> 2814:   IDCJAC0010          86338 2020     9  14            16.6              1
#> 2815:   IDCJAC0010          86338 2020     9  15            15.8              1
#> 2816:   IDCJAC0010          86338 2020     9  16            22.8              1
#>       quality                date
#>    1:         2013-01-01 12:00:00
#>    2:         2013-01-02 12:00:00
#>    3:         2013-01-03 12:00:00
#>    4:         2013-01-04 12:00:00
#>    5:         2013-01-05 12:00:00
#>   ---                            
#> 2812:       N 2020-09-12 12:00:00
#> 2813:       N 2020-09-13 12:00:00
#> 2814:       N 2020-09-14 12:00:00
#> 2815:       N 2020-09-15 12:00:00
#> 2816:       N 2020-09-16 12:00:00

Now we can look at when it last hit 20C for the past little while:

suppressPackageStartupMessages(library(ggplot2))

ggplot(melb_weather_date,
       aes(x = date,
           y = max_temperature)) + 
  geom_line() + 
  geom_hline(yintercept = 20, colour = "salmon")

#> Warning: Removed 151 row(s) containing missing values (geom_path).

Let’s just focus on 2020:

suppressPackageStartupMessages(library(dplyr))
melb_weather_2020 <- melb_weather_date %>% 
  filter(year == 2020)

melb_2020_plot <- 
ggplot(melb_weather_2020,
       aes(x = date,
           y = max_temperature)) + 
  geom_line() + 
  geom_hline(yintercept = 20, colour = "salmon")

melb_2020_plot

So now rather than squinting at a graph, I want to return the last times it was over 20C.

I was tempted to try something like this:

melb_weather_2020 %>% 
  filter(max_temperature > 20) %>% 
  ggplot(aes(x = date,
           y = max_temperature)) + 
  geom_line() + 
  geom_hline(yintercept = 20, colour = "salmon")

But this doesn’t realllly help me. We just see a bit gap. I still need to squint. Let’s take a look at the plot again:

melb_2020_plot

I want to be able to identify when it stopped being over 20, and then when it started again.

Buckle up, we’re going into the realm of RLE - Run Length Encodings.

Earo Wang first showed this to me I first arrived at Monash, (where it was used for miss_var_run).

rle counts the length of a “run” of a vector. Here’s an example to explain from the help file:

z <- c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)
rle(z)

#> Run Length Encoding
#>   lengths: int [1:5] 2 2 1 1 3
#>   values : logi [1:5] TRUE FALSE TRUE FALSE TRUE

So this tells us that we have 2 runs of the same thing, then 2 more, then 1, 1, and 3. Another way I think about it is is like a simplified/summary of storing a number. Here’s another brief example to help get this solidified:

rle(1:10)

#> Run Length Encoding
#>   lengths: int [1:10] 1 1 1 1 1 1 1 1 1 1
#>   values : int [1:10] 1 2 3 4 5 6 7 8 9 10

rle(c(1,2,2,3,3,3))

#> Run Length Encoding
#>   lengths: int [1:3] 1 2 3
#>   values : num [1:3] 1 2 3

rle(c(1,1,1,2,2,3))

#> Run Length Encoding
#>   lengths: int [1:3] 3 2 1
#>   values : num [1:3] 1 2 3

So, we can use rle to calculate the run of the times that temperature was below 20. We can calculate a new column that is TRUE when temperature is below, and FALSE otherwise:

melb_weather_2020 %>% 
  mutate(below_20 = max_temperature < 20) 

#>   --- Australian Bureau of Meteorology (BOM) Data Resource ---
#>   (Original Request Parameters)
#>   Station:    MELBOURNE (OLYMPIC PARK) [086338] 
#>   Location:    lat: -37.8255, lon: 144.9816
#>   Measurement / Origin:  Max / Historical
#>   Timespan:    2013-06-01 -- 2020-09-01 [7.3 years]
#>   ---------------------------------------------------------------  
#>      product_code station_number year month day max_temperature accum_days_max
#>   1:   IDCJAC0010          86338 2020     1   1            24.9              1
#>   2:   IDCJAC0010          86338 2020     1   2            25.0              1
#>   3:   IDCJAC0010          86338 2020     1   3            36.6              1
#>   4:   IDCJAC0010          86338 2020     1   4            26.8              1
#>   5:   IDCJAC0010          86338 2020     1   5            16.7              1
#>  ---                                                                          
#> 256:   IDCJAC0010          86338 2020     9  12            16.0              1
#> 257:   IDCJAC0010          86338 2020     9  13            17.2              1
#> 258:   IDCJAC0010          86338 2020     9  14            16.6              1
#> 259:   IDCJAC0010          86338 2020     9  15            15.8              1
#> 260:   IDCJAC0010          86338 2020     9  16            22.8              1
#>      quality                date below_20
#>   1:       Y 2020-01-01 12:00:00    FALSE
#>   2:       Y 2020-01-02 12:00:00    FALSE
#>   3:       Y 2020-01-03 12:00:00    FALSE
#>   4:       Y 2020-01-04 12:00:00    FALSE
#>   5:       Y 2020-01-05 12:00:00     TRUE
#>  ---                                     
#> 256:       N 2020-09-12 12:00:00     TRUE
#> 257:       N 2020-09-13 12:00:00     TRUE
#> 258:       N 2020-09-14 12:00:00     TRUE
#> 259:       N 2020-09-15 12:00:00     TRUE
#> 260:       N 2020-09-16 12:00:00    FALSE

then pull it out

melb_weather_2020 %>% 
  mutate(below_20 = max_temperature < 20) %>% 
  pull(below_20) 

#>   [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
#>  [13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
#>  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#>  [49] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
#>  [61] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [73] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
#>  [85]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
#>  [97]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
#> [109]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
#> [121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
#> [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [145]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [157]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [169]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [181]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [193]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [205]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [217]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> [229]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
#> [241]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
#> [253]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

And then use rle on the vector of TRUE/FALSE

melb_rle <- melb_weather_2020 %>% 
  mutate(below_20 = max_temperature < 20) %>% 
  pull(below_20) %>% 
  rle()

melb_rle

#> Run Length Encoding
#>   lengths: int [1:43] 4 2 4 2 3 1 17 2 10 1 ...
#>   values : logi [1:43] FALSE TRUE FALSE TRUE FALSE TRUE ...

This provides two named vectors, lengths and values. Lengths is the number of times a corresponding value repeats.

We can then identify what the largest gap was by taking the max of lengths:

# what was the largest gap?
what_gap <- max(melb_rle$lengths)

what_gap

#> [1] 111

Neat! So we now know that the period between the start and the end spanned 111 days.

So, when was that?

This part involves a few steps, let me break it down.

We want to find that date it was last over 20C - which is the day before it had that 111 day stretch. We can get this by summing up all the numbers of run lengths before the 111 day stretch.

First we calculate which position has the max with which.max. I like to print the vector with it for my own sanity’s sake:

which.max(melb_rle$lengths)

#> [1] 34

melb_rle$lengths

#>  [1]   4   2   4   2   3   1  17   2  10   1   3   3   4   2   3   2   2   1   7
#> [20]   2   5   6   8   5   2   3   3   3   1   3   2  10   2 111   1   5   1   4
#> [39]   1   2   1   5   1

So it is the 34th one.

But we want the second last one - the day before 111, so we subtract 1.

last_day_position <- which.max(melb_rle$lengths) - 1
last_day_position

#> [1] 33

We then want 1 through to that number so we can get all the numbers out, using seq_len:

seq_last_day_position <- seq_len(last_day_position)
seq_last_day_position

#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33

We then subset this lengths like so, and sum them up to get the number of days before it went below 20C:

melb_rle$lengths[seq_last_day_position]

#>  [1]  4  2  4  2  3  1 17  2 10  1  3  3  4  2  3  2  2  1  7  2  5  6  8  5  2
#> [26]  3  3  3  1  3  2 10  2

n_days_before_20c <- sum(melb_rle$lengths[seq_last_day_position])
n_days_before_20c

#> [1] 128

Now…what date is that?

We can use lubridate to calculate it like so, by saying, the number of days from the start of 2020:

library(lubridate)

#> 
#> Attaching package: 'lubridate'

#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

# when was that?
last_day_of_20c <- ymd("2020-01-01") + days(n_days_before_20c)
last_day_of_20c

#> [1] "2020-05-08"

Awesome, now when did it finish?

# when did it end?
first_day_of_20c <- last_day_of_20c + days(what_gap)
first_day_of_20c

#> [1] "2020-08-27"

And just to come full circle, let’s check the difference

first_day_of_20c - last_day_of_20c

#> Time difference of 111 days

whew! That was a lot of work. I almost forgot what I was trying to answer so let’s put this in a header now

When was the last time in 2020 Melbourne was above 20 Celsius?

Post Script

In retrospect there is more to add to this plot, and probably other simpler ways to calculate this, I’m keen to hear your thoughts below.

I suspect this sort of function will come in handy in the future as we start to compare COVID19 days, to help us answer: “when was the last day of … cases”.

I’m trying to get this blog post out in one sitting so I’ll skip over writing a function for this, this time Here is a function and an example below of this, which might be useful:

n_days_below <- function(x, below){
  rle_below <- rle(x < below)
  max(rle_below$lengths)
}

last_day_below <- function(x, below){
  rle_below <- rle(x < below)
  last_day_position <- which.max(rle_below$lengths) - 1
  seq_last_day_position <- seq_len(last_day_position)
  n_days_before_below <- sum(rle_below$lengths[seq_last_day_position])
  n_days_before_below
}

summary_days_below <- function(x, below, origin){
  
  vec_n_days_below <- n_days_below(x, below)
  vec_last_day_below <- n_days_below(x, below)
  
  last_day_of_below <- lubridate::ymd(origin) + lubridate::days(vec_last_day_below)
  first_day_of_below <- last_day_of_below + vec_n_days_below
  
  list(last_date_below = last_day_of_below,
       first_date_above = first_day_of_below,
       time_between = first_day_of_below - last_day_of_below)
  
}

n_days_below(melb_weather_2020$max_temperature, 20)

#> [1] 111

last_day_below(melb_weather_2020$max_temperature, 20)

#> [1] 128

summary_days_below(melb_weather_2020$max_temperature, 20, "2020-01-01")

#> $last_date_below
#> [1] "2020-04-21"
#> 
#> $first_date_above
#> [1] "2020-08-10"
#> 
#> $time_between
#> Time difference of 111 days

A side note on the history of bomrang

I take great joy in seeing bomrang and using it. I first posted an issue about it in 2016 at the rOpenSci unconf which was held in the USA, but Miles, Jessie, and I (And also Alex Simmons) participated remotely, with Bob Rudis, Brooke Anderson, Maëlle Salmon and a few folks. My memory is a bit hazy because in order to keep pace with the USA we had to stay up all night, and I don’t really remember the days before or after it. It got somewhere, but I didn’t really know what we were doing, I just thought, “this should be a thing, why can’t I get Australian weather data from R?”, but had no idea what I was doing with web scraping, and knew nothing about the weather other than what I’d seen on the news and that lightning can indeed strike the same place twice. So I suggested the idea, but everyone else provided the know how. It was hard, because it turned out the BoM weather bureau API didn’t exist and required a bit of domain expertise

Next, it was picked up by much more experienced team at the first Australian rOpenSci Ozunconf (then, au-unconf), in 2016 in Brisbane. Happily they knew what they were doing, and has experience with Australian weather data. The project grew from there, being headed by Adam Sparks, initially it was called bomr, but Di Cook suggested, “bomrang”, the name stuck, and Adam submitted it to rOpenSci for package review, and the package became more mature and in it’s current state of glory.