Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2023-02-02

visdat Version 0.6.0

Nicholas Tierney

Categories: rstats data visualisation Missing Data rbloggers Tags: rstats data visualisation missing-data rbloggers

9 minute read

I’m please to say that visdat version 0.6.0 (codename: “Superman, Lazlo Bane”) is now on CRAN. This is the first release in nearly 3 years, there are a couple of new functions for visualising numeric and binary data, as well as some maintenance and bug fixes.

Let’s walk through some of the new features, bug fixes, and other misc changes.

New Features

options(tidyverse.quiet = TRUE)
library(tidyverse)
library(visdat)

vis_value() - visualise values

The idea of vis_value() is to visualise numeric data, so that you can get a quick idea of the values in your dataset. It does this by scaling all the data between 0 and 1, but it only works with numeric data.

vis_value(mtcars)

It can be fun and interesting to arrange by a variable and then show see how that changes the plot.

vis_cor(mtcars)

mtcars %>% arrange(cyl) %>% vis_value()

Although fair warning that there’s a whole set of statistics/data visualisation that focusses on how to arrange rows and columns - a technique called seriation. For a fun introduction, I’d recommend this lovely blogpost by Nicholas Kruchten. One day I will implement seriation in visdat.

Note that if you use vis_value() on a dataset that isn’t entirely numeric, you will get an error:

vis_value(diamonds)
#> Error in `test_if_all_numeric()` at visdat/R/vis-value.R:33:2:
#> ! Data input can only contain numeric values
#> Please subset the data to the numeric values you would like.
#> `dplyr::select(<data>, where(is.numeric))`
#> Can be helpful here!

vis_binary() visualise binary values

The vis_binary() function is for visualising datasets with binary values - similar to vis_value(), but just for binary data (0, 1, NA).

vis_binary(dat_bin)

Thank you to Trish Gilholm for her suggested use case for this.

Facetting in visdat

It is now possible to perform facetting for the following functions in visdat: vis_dat(), vis_cor(), and vis_miss() via the facet argument. This lead to some internal cleaning up of package code (always fun to revisit some old code and refactor!) Here’s an example of facetting:

vis_dat(airquality, facet = Month) 

vis_cor(airquality, facet = Month) 

vis_miss(airquality, facet = Month) 

Notably for vis_miss when using facetting, you don’t get column missingness summaries, as I couldn’t quite work out how to do this for each facet.

Thank you to Sam Firke’s initial tweet on this that inspired this, and Jonathan Zadra’s contributions in the issue thread.

The next release will implement facetting for vis_value(), vis_binary(), vis_compare(), vis_expect(), and vis_guess() - see #159 to keep track.

Data methods for plotting

Related to facetting, I have implemented methods that provide data methods for plots with data_vis_dat(), data_vis_cor(), and data_vis_miss():

data_vis_dat(airquality)
#> # A tibble: 918 × 4
#>     rows variable valueType value
#>    <int> <chr>    <chr>     <chr>
#>  1     1 Day      integer   41   
#>  2     1 Month    integer   190  
#>  3     1 Ozone    integer   7.4  
#>  4     1 Solar.R  integer   67   
#>  5     1 Temp     integer   5    
#>  6     1 Wind     numeric   1    
#>  7     2 Day      integer   36   
#>  8     2 Month    integer   118  
#>  9     2 Ozone    integer   8    
#> 10     2 Solar.R  integer   72   
#> # … with 908 more rows
data_vis_miss(airquality)
#> # A tibble: 918 × 4
#>     rows variable valueType value
#>    <int> <chr>    <chr>     <chr>
#>  1     1 Day      FALSE     FALSE
#>  2     1 Month    FALSE     FALSE
#>  3     1 Ozone    FALSE     FALSE
#>  4     1 Solar.R  FALSE     FALSE
#>  5     1 Temp     FALSE     FALSE
#>  6     1 Wind     FALSE     FALSE
#>  7     2 Day      FALSE     FALSE
#>  8     2 Month    FALSE     FALSE
#>  9     2 Ozone    FALSE     FALSE
#> 10     2 Solar.R  FALSE     FALSE
#> # … with 908 more rows
data_vis_cor(airquality)
#> # A tibble: 36 × 3
#>    row_1   row_2     value
#>    <chr>   <chr>     <dbl>
#>  1 Ozone   Ozone    1     
#>  2 Ozone   Solar.R  0.348 
#>  3 Ozone   Wind    -0.602 
#>  4 Ozone   Temp     0.698 
#>  5 Ozone   Month    0.165 
#>  6 Ozone   Day     -0.0132
#>  7 Solar.R Ozone    0.348 
#>  8 Solar.R Solar.R  1     
#>  9 Solar.R Wind    -0.0568
#> 10 Solar.R Temp     0.276 
#> # … with 26 more rows

The implementation of this works by providing these functions as S3 methods that have a .grouped_df method to facilitate plotting with facets.

airquality %>% group_by(Month) %>% data_vis_dat()
#> # A tibble: 765 × 5
#> # Groups:   Month [5]
#>    Month  rows variable valueType value
#>    <int> <int> <chr>    <chr>     <chr>
#>  1     5     1 Day      integer   41   
#>  2     5     1 Ozone    integer   190  
#>  3     5     1 Solar.R  integer   7.4  
#>  4     5     1 Temp     integer   67   
#>  5     5     1 Wind     numeric   1    
#>  6     5     2 Day      integer   36   
#>  7     5     2 Ozone    integer   118  
#>  8     5     2 Solar.R  integer   8    
#>  9     5     2 Temp     integer   72   
#> 10     5     2 Wind     numeric   2    
#> # … with 755 more rows

Missing values show up in list columns

vis_dat() vis_miss() and vis_guess() now render missing values in list-columns. Let’s demonstrate this with the star_wars dataset from dplyr, which has a few list columns.

starwars
#> # A tibble: 87 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO          167    75 NA      gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2           96    32 NA      white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
#>  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
#>  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
#>  8 R5-D4           97    32 NA      white,… red        NA   none  mascu… Tatooi…
#>  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#> 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
#> # … with 77 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
glimpse(starwars)
#> Rows: 87
#> Columns: 14
#> $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
#> $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
#> $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
#> $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
#> $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
#> $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
#> $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
#> $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
#> $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
#> $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
#> $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
vis_dat(starwars)

vis_miss(starwars)

vis_guess(starwars)

As you can see, lists are now displayed in the visualisation. Unfortunately vis_guess has trouble guessing lists, but that is a limitation due to how it guesses variable types.

Thank you to github user cregouby for adding this in #138.

Abbreviation helpers

Long variable names can be annoying and can crowd a plot. The abbreviate_vars() function can be used to help with this:

long_data <- tibble(
  really_really_long_name = c(NA, NA, 1:8),
  very_quite_long_name = c(-1:-8, NA, NA),
  this_long_name_is_something_else = c(NA, NA,
                                       seq(from = 0, to = 1, length.out = 8))
)

vis_miss(long_data)

Ugh no good.

Use abbreviate_vars() to help:

long_data %>% abbreviate_vars() %>% vis_miss()

You can control the length of the abbreviation with min_length:

long_data %>% abbreviate_vars(min_length = 5) %>% vis_miss()

Under the hood this uses the base R function, abbreviate() - a gem of a function.

Thank you to chapau3, and ivanhanigan for requesting this feature (#140 and #9).

Missingness percentages are now integers.

The vis_miss() function shows a percentage missing of the missing values in each column - I’ve decided to make this round to integers, as it is only a guide and I found them to be a bit cluttered. I also do not like the idea of extracting summary statistics from graphics, so they now look like this:

vis_miss(airquality)

For more accurate representation of missingness summaries please use the naniar R package functions like miss_var_summary():

library(naniar)
miss_var_summary(airquality)
#> # A tibble: 6 × 3
#>   variable n_miss pct_miss
#>   <chr>     <int>    <dbl>
#> 1 Ozone        37    24.2 
#> 2 Solar.R       7     4.58
#> 3 Wind          0     0   
#> 4 Temp          0     0   
#> 5 Month         0     0   
#> 6 Day           0     0

Which also works with dplyr::group_by():

airquality %>% 
  group_by(Month) %>% 
  miss_var_summary()
#> # A tibble: 25 × 4
#> # Groups:   Month [5]
#>    Month variable n_miss pct_miss
#>    <int> <chr>     <int>    <dbl>
#>  1     5 Ozone         5     16.1
#>  2     5 Solar.R       4     12.9
#>  3     5 Wind          0      0  
#>  4     5 Temp          0      0  
#>  5     5 Day           0      0  
#>  6     6 Ozone        21     70  
#>  7     6 Solar.R       0      0  
#>  8     6 Wind          0      0  
#>  9     6 Temp          0      0  
#> 10     6 Day           0      0  
#> # … with 15 more rows

If you find that this really bothers you and you want to have control over the percentage missingness in the columns, please file an issue on visdat and I will look into adding more user control. Ideally I would have added control for users in the first place, but it just wasn’t something I was certain users wanted, and would require more arguments to a function, which require more tests…and so on. So with the mindset of keeping this package easy to maintain, I figured this might be the easiest way forward.

Bug Fixes and Other changes

Here is a quick listing of the other changes in this release of visdat:

Thanks

Thank you to all the users who contributed to this release!

muschellij2, chapau3, ivanhanigan, cregouby, jzadra.