Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

Tidyverse Case Study: Exploring the Billboard Charts

2017/11/07

Categories: rstats rbloggers

Data packages are something that have been on my mind a bit lately, having recently worked on the ozroaddeaths data access package at the rOpenSci ozunconf. I was reading Joe Rickert’s R views blogpost about data packages in R, and saw the billboard package by Mikkel Krogsholm, which provides:

… data sets regarding songs on the Billboard Hot 100 list from 1960 to 2016, including ranks for the given year, musical features, and lyrics.

This seemed like a really cool dataset to look at, so last weekend I started to have a dig around and noticed that it had some nice examples of data munging with tidyverse packages and friends, and it seemed like it would make a nice blogpost case study, of sorts.

So, this blogpost walks through how you might start to unpack the data, clean it, and draw some interesting conclusions. I also wanted to avoid the “draw the rest of the fucking owl” problem.

This means that we don’t start with a perfectly clean dataset, and I try to take a bit of time to walk through some of the code.

So, first things first, we’re going to load up our packages:

library(billboard)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(visdat)

Data Munging

According to the help file for wiki_hot_100s, this data contains:

57 years of Billboards Hot 100 songs. The data is scraped from Wikipedia from the urls ‘https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_' and then the year added. Example: https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1960. One year has more than a 100 songs due to a tie.

It then states the following info about the variables:

A data frame with 5701 rows and 4 variables:

no: the rank that the song had that year

title: the title of the song

artist: the artist of the song

year: year

Before we go ahead and jump into analysing the data, it’s a good idea to do have a look at the data, with vis_dat.

vis_dat(wiki_hot_100s)

vis_dat gives you a birds eye view of the data, showing the class of each variable, and also displays missing data (if there is any). In this case, we can see that there is no missing data, and that everything is a character.

I would expect year and rank (marked as “no”) to be numbers, not characters, so let’s have a closer look at these.

But first, let’s convert it to a tibble, which, among other things, is basically the same as a normal data.frame, but gives a nice print method, so it only shows the first 10 rows by default, and gives some text describing the data type.

billboard_raw <- as_tibble(wiki_hot_100s)
billboard_raw
## # A tibble: 5,701 x 4
##       no                     title              artist  year
##    <chr>                     <chr>               <chr> <chr>
##  1     1 Theme from A Summer Place         Percy Faith  1960
##  2     2          He'll Have to Go          Jim Reeves  1960
##  3     3             Cathy's Clown The Everly Brothers  1960
##  4     4              Running Bear      Johnny Preston  1960
##  5     5                Teen Angel        Mark Dinning  1960
##  6     6                 I'm Sorry          Brenda Lee  1960
##  7     7         It's Now or Never       Elvis Presley  1960
##  8     8                 Handy Man         Jimmy Jones  1960
##  9     9              Stuck on You       Elvis Presley  1960
## 10    10                 The Twist      Chubby Checker  1960
## # ... with 5,691 more rows

This is nice because it means you don’t have to consciously think about what might happen when you want to view the data - it doesn’t just vomit out the ENTIRE dataframe when you type wiki_hot_100s.

OK, so we can see that no and year are characters, but really they should be numeric, as they aren’t written as “Number One” and “Nineteen Sixty”, so let’s coerce them into numbers with as.numeric. We also create another variable, rank, which is a more descriptive word than no.

billboard_raw %>%
  mutate(rank = as.numeric(no),
         year = as.numeric(year))
## Warning in evalq(as.numeric(no), <environment>): NAs introduced by coercion
## # A tibble: 5,701 x 5
##       no                     title              artist  year  rank
##    <chr>                     <chr>               <chr> <dbl> <dbl>
##  1     1 Theme from A Summer Place         Percy Faith  1960     1
##  2     2          He'll Have to Go          Jim Reeves  1960     2
##  3     3             Cathy's Clown The Everly Brothers  1960     3
##  4     4              Running Bear      Johnny Preston  1960     4
##  5     5                Teen Angel        Mark Dinning  1960     5
##  6     6                 I'm Sorry          Brenda Lee  1960     6
##  7     7         It's Now or Never       Elvis Presley  1960     7
##  8     8                 Handy Man         Jimmy Jones  1960     8
##  9     9              Stuck on You       Elvis Presley  1960     9
## 10    10                 The Twist      Chubby Checker  1960    10
## # ... with 5,691 more rows

This gives us a warning - “NAs introduced by coercion”. This tells us something went wrong. Let’s see where those values are NA, and what’s going on there by filtering our observations to see only those that are missing using filter(is.na(rank)).

billboard_raw %>%
  mutate(rank = as.numeric(no),
         year = as.numeric(year)) %>%
  filter(is.na(rank))
## Warning in evalq(as.numeric(no), <environment>): NAs introduced by coercion
## # A tibble: 5 x 5
##      no                          title                    artist  year
##   <chr>                          <chr>                     <chr> <dbl>
## 1   Tie                         Let Me Paul Revere & the Raiders  1969
## 2   Tie                 Rock Your Baby             George McCrae  1974
## 3   Tie Never, Never Gonna Give You Up               Barry White  1974
## 4   Tie              The Lord's Prayer         Sister Janet Mead  1974
## 5   Tie   My Mistake (Was to Love You)  Diana Ross & Marvin Gaye  1974
## # ... with 1 more variables: rank <dbl>

Huh, turns out that “Tie” is a category for 1974 and 1969, this was mentioned in the description of the data in the help file, but I just assumed that they would be given the same rank.

Now, we could look at the surrounding rows of the “Tie” columns more closely by simply filtering according to those years, 1969 and 1974, and using View().

BUT, if you had like a thousand, or say, one bajillion rows to look at then this doesn’t really work.

So instead we are going to find the row numbers where this occurs, and then look at them.

We do this by first creating a new variable containing the row numbers (1 through to the number of rows in the dataset), then filter those observations that are marked as “Tie”, then pull out the row_numbers as a vector with the relatively new pull function, which returns the values from a dataframe as a vector, rather than a smaller dataframe, which you would get with select.

row_tie <- billboard_raw %>%
  mutate(row_num = 1:n()) %>%
  filter(no == "Tie") %>%
  pull(row_num)

row_tie
## [1] 1001 1439 1456 1487 1493

These row locations, 1001, 1439, 1456, 1487, 1493, are where these tie’s occur.

What we want to do now is look at the rows before and after these ones with the tie, to get a sense of what rank these Tied values have. You can think of this as padding out the rows around it. We can pad out the numbers by adding one and subtracting one, to get the rows around the tied rows, and then sort them, to give the row numbers with the ties plus the previous and subsequent rows.

row_pad <- c(row_tie, 
             row_tie + 1, 
             row_tie -1) %>% 
  sort()

row_pad
##  [1] 1000 1001 1002 1438 1439 1440 1455 1456 1457 1486 1487 1488 1492 1493
## [15] 1494

Finally, you can then use this new vector and tell dplyr to display the slice of rows containing those numbers:

billboard_raw %>% slice(row_pad)
## # A tibble: 15 x 4
##       no                                 title                    artist
##    <chr>                                 <chr>                     <chr>
##  1   100                    Sweet Cream Ladies              The Box Tops
##  2   Tie                                Let Me Paul Revere & the Raiders
##  3     1            Bridge over Troubled Water         Simon & Garfunkel
##  4    37                  Nothing from Nothing             Billy Preston
##  5   Tie                        Rock Your Baby             George McCrae
##  6    39                      Top of the World            The Carpenters
##  7    54                      You Won't See Me               Anne Murray
##  8   Tie        Never, Never Gonna Give You Up               Barry White
##  9    56                Tell Me Something Good        Rufus & Chaka Khan
## 10    85 I'll Have to Say I Love You in a Song                 Jim Croce
## 11   Tie                     The Lord's Prayer         Sister Janet Mead
## 12    87         Trying To Hold On To My Woman             Lamont Dozier
## 13    91                          Helen Wheels  Paul McCartney and Wings
## 14   Tie          My Mistake (Was to Love You)  Diana Ross & Marvin Gaye
## 15    93                         Wildwood Weed              Jim Stafford
## # ... with 1 more variables: year <chr>

What we see here is that a “Tie” takes on the value of the previous row. so we can now replace this with the value of the previous row, using the lag function, which turns a sequence like this:

x <- c(1,2,3)

Into this:

lag(x)
## [1] NA  1  2
billboard_raw %>%
  mutate(rank = as.numeric(no)) %>%
  mutate(rank_lag = lag(rank)) %>%
  select(no,rank,rank_lag) %>%
  mutate(rank_rep = if_else(is.na(rank),
                            true = rank_lag,
                            false = rank)) %>%
  filter(is.na(rank))
## Warning in evalq(as.numeric(no), <environment>): NAs introduced by coercion
## # A tibble: 5 x 4
##      no  rank rank_lag rank_rep
##   <chr> <dbl>    <dbl>    <dbl>
## 1   Tie    NA      100      100
## 2   Tie    NA       37       37
## 3   Tie    NA       54       54
## 4   Tie    NA       85       85
## 5   Tie    NA       91       91

OK, now let’s quickly check what these values are:

billboard_raw %>% 
  mutate(rank = as.numeric(no)) %>%
  mutate(rank_lag = lag(rank)) %>%
  select(no,rank,rank_lag) %>%
  mutate(rank_rep = if_else(is.na(rank),
                            true = rank_lag,
                            false = rank)) %>%
  slice(row_pad)
## Warning in evalq(as.numeric(no), <environment>): NAs introduced by coercion
## # A tibble: 15 x 4
##       no  rank rank_lag rank_rep
##    <chr> <dbl>    <dbl>    <dbl>
##  1   100   100       99      100
##  2   Tie    NA      100      100
##  3     1     1       NA        1
##  4    37    37       36       37
##  5   Tie    NA       37       37
##  6    39    39       NA       39
##  7    54    54       53       54
##  8   Tie    NA       54       54
##  9    56    56       NA       56
## 10    85    85       84       85
## 11   Tie    NA       85       85
## 12    87    87       NA       87
## 13    91    91       90       91
## 14   Tie    NA       91       91
## 15    93    93       NA       93

OK great, so now we can combine all of these steps (also renaming the column no to be rank). This is our clean data, so we call it billboard_clean.

billboard_clean <- billboard_raw %>%
  mutate(rank = as.numeric(no),
         year = as.numeric(year),
         rank_lag = lag(rank),
         rank_rep = if_else(condition = is.na(rank),
                            true = rank_lag,
                            false = rank)) %>%
  select(rank_rep,
         title,
         artist,
         year) %>%
  rename(rank = rank_rep)
## Warning in evalq(as.numeric(no), <environment>): NAs introduced by coercion

Note: This is a common pattern that I like to use - build up chains of commands with the magrittr pipe operator until you are confident that they work. Then, you combine them together to create an object. This saves you creating intermediate objects like “data_1”, “data_check_lag”, “data_insert_lag_lead”, etc.

So now we’ve got a clean dataset, let’s think about something interesting to look at in this cool dataset!

One hit wonders

Let’s look at the one hit wonders in this list.

We first add the count, which creates a count of the number of times that each artist appears, and adds it back onto the dataset as a column. It saves doing some fandangling with joins, and was suggested by David Robinson in yet another awesome Rstats moment / design decision being made on twitter.

billboard_clean %>%
  add_count(artist)
## # A tibble: 5,701 x 5
##     rank                     title              artist  year     n
##    <dbl>                     <chr>               <chr> <dbl> <int>
##  1     1 Theme from A Summer Place         Percy Faith  1960     1
##  2     2          He'll Have to Go          Jim Reeves  1960     1
##  3     3             Cathy's Clown The Everly Brothers  1960     6
##  4     4              Running Bear      Johnny Preston  1960     2
##  5     5                Teen Angel        Mark Dinning  1960     1
##  6     6                 I'm Sorry          Brenda Lee  1960    11
##  7     7         It's Now or Never       Elvis Presley  1960    16
##  8     8                 Handy Man         Jimmy Jones  1960     2
##  9     9              Stuck on You       Elvis Presley  1960    16
## 10    10                 The Twist      Chubby Checker  1960     5
## # ... with 5,691 more rows

What we then want to do is see how many bands have the rank of 1 and have only appeared once (n = 1).

# What about one hit wonders?
  
billboard_clean %>%
  add_count(artist) %>%
  filter(rank == 1, n == 1)
## # A tibble: 15 x 5
##     rank                        title
##    <dbl>                        <chr>
##  1     1    Theme from A Summer Place
##  2     1        Stranger on the Shore
##  3     1   Ballad of the Green Berets
##  4     1             To Sir With Love
##  5     1                 Sugar, Sugar
##  6     1                   My Sharona
##  7     1             Careless Whisper
##  8     1  That's What Friends Are For
##  9     1                        Yeah!
## 10     1                      Bad Day
## 11     1                          Low
## 12     1 Somebody That I Used to Know
## 13     1                  Thrift Shop
## 14     1                        Happy
## 15     1                  Uptown Funk
## # ... with 3 more variables: artist <chr>, year <dbl>, n <int>

15 songs!

I would have thought more.

Alright, some interesting ones here. Some personal favourites include:

I’ve actually turned these into a public spotify playlist if you want to listen to them - it’s interesting to track the progress over time of these songs.

Of course, there are a few problems with this - there are a few artists on there who do special duets and “one offs”, like “Dionne and Friends”, which is made up of Dionne Warwick, Elton John, Gladys Knight, and Stevie Wonder, three of who have many hits. Also, some of these artists are more likely to enter the list again, for example, it wouldn’t surprise me if Macklemore, Pharrell, and Mark Ronson make it in again in the future. So while they are currently one hit wonders, it doesn’t mean that they won’t have another hit.

Hmm, on that note, what is the spread of years between entering the list twice or more? Let’s calculate the distance from the latest year to the earliest year for each artist, and look at those who have had a “long reign”, say more than 10 years.

billboard_clean %>%
  group_by(artist) %>%
  summarise(year_dist = max(year) - min(year)) %>%
  filter(year_dist > 0) %>%
  arrange(-year_dist) %>%
  filter(year_dist > 10) %>%
  slice(1:20)
## # A tibble: 20 x 2
##              artist year_dist
##               <chr>     <dbl>
##  1             Cher        34
##  2 The Four Seasons        32
##  3  Aretha Franklin        31
##  4  Michael Jackson        30
##  5      Ben E. King        26
##  6       Elton John        26
##  7   The Beach Boys        26
##  8        Aerosmith        25
##  9    Aaron Neville        24
## 10     Eric Clapton        23
## 11          Madonna        22
## 12      Rod Stewart        22
## 13       Tim McGraw        22
## 14      James Brown        21
## 15          Chicago        20
## 16  Gary U.S. Bonds        20
## 17      Marvin Gaye        20
## 18    Stevie Wonder        20
## 19   Dionne Warwick        19
## 20      Herb Alpert        19

Cool! Cher has had a reign for 34 years - she’s literally been in the charts for longer than I’ve been alive.

But what about those who are only in the list twice? Let’s cut it down a bit, and only appear after 10 years.

billboard_clean %>%
  add_count(artist) %>%
  group_by(artist) %>%
  mutate(year_dist = max(year) - min(year)) %>%
  filter(year_dist > 0) %>%
  filter(n == 2) %>%
  arrange(-year_dist) %>%
  filter(year_dist > 10)
## # A tibble: 10 x 6
## # Groups:   artist [5]
##     rank                        title         artist  year     n year_dist
##    <dbl>                        <chr>          <chr> <dbl> <int>     <dbl>
##  1    76           Tell It Like It Is  Aaron Neville  1967     2        24
##  2    86     Everybody Plays the Fool  Aaron Neville  1991     2        24
##  3    49              The Loco-Motion  Kylie Minogue  1988     2        14
##  4    45 Can't Get You Out of My Head  Kylie Minogue  2002     2        14
##  5    91                   Roundabout            Yes  1972     2        12
##  6     8      Owner of a Lonely Heart            Yes  1984     2        12
##  7    69                    Evil Ways        Santana  1970     2        11
##  8    84                      Winning        Santana  1981     2        11
##  9    87          What Was I Thinkin' Dierks Bentley  2003     2        11
## 10    79             Drunk on a Plane Dierks Bentley  2014     2        11

Cool! Didn’t expect Kylie Minogue, or Santana to be in there!

Check out their music in the playlist billboard-reappearance.

Multiple number ones

What about bands with multiple number ones? How many are there?

billboard_clean %>%
  filter(rank == 1) %>%
  add_count(artist) %>%
  filter(n > 1)
## # A tibble: 2 x 5
##    rank                    title      artist  year     n
##   <dbl>                    <chr>       <chr> <dbl> <int>
## 1     1 I Want to Hold Your Hand The Beatles  1964     2
## 2     1                 Hey Jude The Beatles  1968     2

OK, so not surprising that this is the Beatles, but I swear that they have more than one number 1 hit, based on me growing up listening to their “1” Album.. But looking closely, this is a combination of number one hits in the USA and UK across various lists, whereas this dataset is just for “billboard”.

Artists who appear in the top 100 multiple times

OK, so, I’m still a bit shocked by the fact that they Beatles only have one number one, how many times do they appear in the list?

billboard_clean %>%
  filter(artist == "The Beatles")
## # A tibble: 26 x 4
##     rank                        title      artist  year
##    <dbl>                        <chr>       <chr> <dbl>
##  1     1     I Want to Hold Your Hand The Beatles  1964
##  2     2                She Loves You The Beatles  1964
##  3    13           A Hard Day's Night The Beatles  1964
##  4    14                   Love Me Do The Beatles  1964
##  5    16             Please Please Me The Beatles  1964
##  6    40              Twist and Shout The Beatles  1964
##  7    52            Can't Buy Me Love The Beatles  1964
##  8    55 Do You Want to Know a Secret The Beatles  1964
##  9    95     I Saw Her Standing There The Beatles  1964
## 10     7                        Help! The Beatles  1965
## # ... with 16 more rows

26 times!

That seems like a lot. Is it the most?

Let’s start by looking at the number of times each artist appears in the top 100, by grouping by artist and then using n() to count the number of artists, then arranging by the number of times that they appear.

billboard_clean %>%
  group_by(artist) %>%
  summarise(n_times_in_100 = n()) %>%
  arrange(-n_times_in_100)
## # A tibble: 2,768 x 2
##             artist n_times_in_100
##              <chr>          <int>
##  1         Madonna             35
##  2      Elton John             26
##  3     The Beatles             26
##  4   Janet Jackson             24
##  5    Mariah Carey             24
##  6 Michael Jackson             22
##  7   Stevie Wonder             22
##  8         Rihanna             20
##  9    Taylor Swift             20
## 10 Whitney Houston             19
## # ... with 2,758 more rows

Huh, I didn’t expect Madonna to be in the list the most number of times, but that’s cool!

Let’s plot this, and look at the top 20 artists

billboard_clean %>% 
  group_by(artist) %>%
  summarise(n_times_in_100 = n()) %>%
  arrange(-n_times_in_100) %>%
  top_n(wt = n_times_in_100,
        n = 20) %>%
  ggplot(aes(x = n_times_in_100,
           y = reorder(artist,n_times_in_100))) + 
  ggalt::geom_lollipop(horizontal = TRUE,
                       colour = "navy") + 
  labs(x = "# Times Appeared in top 100\nfrom 1960-2017",
       y = "Artist") +
  theme_minimal()

Now, I’m not exactly a musicologist, but I do enjoy my music. I gotta say, I wasn’t expecting:

And I gotta wonder, how did they get there? Was it a quick journey, a short one?

An artist’s rise to fame

Let’s see if we can view their rise to fame. We can keep a count of the number of times that an artist appeared in the top 100.

We then want to get a tally of the number of times that each artist appears. We can do this by grouping by the artist, arranging by artist and then year, and then creating a new variable that counts up from 1 to the number of times that artist appears. We also just chuck in a filter there to see what happens for just Madonna.

billboard_clean %>% 
  # add a grouping category for the growth
  arrange(artist,year) %>%
  group_by(artist) %>%
  mutate(rank_tally = 1:n()) %>%
  ungroup() %>%
  filter(artist == "Madonna")
## # A tibble: 35 x 5
##     rank             title  artist  year rank_tally
##    <dbl>             <chr>   <chr> <dbl>      <int>
##  1    35        Borderline Madonna  1984          1
##  2    66        Lucky Star Madonna  1984          2
##  3    79           Holiday Madonna  1984          3
##  4     2     Like a Virgin Madonna  1985          4
##  5     9     Crazy for You Madonna  1985          5
##  6    58     Material Girl Madonna  1985          6
##  7    81             Angel Madonna  1985          7
##  8    98      Dress You Up Madonna  1985          8
##  9    29 Papa Don't Preach Madonna  1986          9
## 10    35      Live to Tell Madonna  1986         10
## # ... with 25 more rows

OK let’s put this in a new dataset, “billboard_clean_growth”

billboard_clean_growth <- billboard_clean %>% 
  add_count(artist) %>%
  # add a grouping category for the growth
  arrange(artist,year) %>%
  group_by(artist) %>%
  mutate(rank_tally = 1:n()) %>%
  ungroup()

And now let’s visualise it, but let’s only look at those artists who appeared in the top 100 more than 20 times.

 billboard_clean_growth %>%
  filter(n >= 20) %>%
  ggplot(aes(x = year,
               y = rank_tally,
               group = artist,
               colour = artist)) +
    geom_line()

But let’s add some labels, and remove the legend.

 billboard_clean_growth %>%
  filter(n >= 21) %>%
  ggplot(aes(x = year,
               y = rank_tally,
               group = artist,
               colour = artist)) +
    geom_line() +
    geom_label(data = filter(billboard_clean_growth,
                             n >= 21,
                             rank_tally == n),
               aes(label = artist,
                   fill = artist),
               colour = "white") +
  theme_dark() + 
  expand_limits(x = c(1964,2011)) +
  theme(legend.position = "none") 

There you have it! The rise of artists over time.

Looking at this plot, it makes me want to investigate further with some kind of grouped/multi level growth model - identify those say with the most rapid growth, or perhaps the greatest predicted growth.

I guess the next logical step here is to look more at other kinds of music data - perhaps we can get information on the nationality of artists, and also combine together multiple databases - rolling stone, … and other labels.

There are also some other really cool datasets within the billboard package:

lyrics

A data set containing lyrics for songs on the Billboard Hot 100 over the past 57 years. The lyrics were identified and collected by webscraping so there might be some errors and mistakes - have that in mind.

spotify_track_data

A data set contaning 56 playlists from Spotify that were used to get the songs for the feature extraction of Billboard Hot 100 songs from 1960 to 2015 that you find in spotify_track_data.

spotify_playlists

Using the playlists in the spotify_playlists data set, this data contains the features of all of the tracks on the playlists.

Thanks again for making this R package, Mikkel Krogsholm - great stuff!

Conclusion

There you have it! Some (hopefully!) interesting data munging using tidyverse tools. I got to connect with some interesting music, and also learnt some cool stuff. Cool, right?

Spotify playlists: