Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2023-02-07

naniar Version 1.0.0

Nicholas Tierney

Categories: rstats data visualisation Missing Data Tags: rstat data visualisation missing-data

5 minute read

naniar 1.0.0

I’m very pleased to announce that naniar version 1.0.0 is now on CRAN!

Version 1.0.0 of naniar is to signify that this release is associated with the publication of the associated JSS paper doi:10.18637/jss.v105.i07 (!!!). This paper has been the labour of a lot of effort between myself and Di Cook, and I am very excited to be able to share it.

There is still a lot to do in naniar, and this release does not signify that there are no changes upcoming. It is a 1.0.0 release to establish that this is a stable release, and any changes upcoming will go through a more formal deprecation process.

Here’s a brief description of some of the changes in this release

New things

JSS publication

You can now retrieve a citation for naniar with citation():

citation("naniar")
#> 
#> To cite naniar in publications use:
#> 
#>   Tierney N, Cook D (2023). "Expanding Tidy Data Principles to
#>   Facilitate Missing Data Exploration, Visualization and Assessment of
#>   Imputations." _Journal of Statistical Software_, *105*(7), 1-31.
#>   doi:10.18637/jss.v105.i07 <https://doi.org/10.18637/jss.v105.i07>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations},
#>     author = {Nicholas Tierney and Dianne Cook},
#>     journal = {Journal of Statistical Software},
#>     year = {2023},
#>     volume = {105},
#>     number = {7},
#>     pages = {1--31},
#>     doi = {10.18637/jss.v105.i07},
#>   }

Set missing values with set_n_miss() and set_prop_miss()

These functions allow you to set a random amount of missingness either as a number of values, or as a proportion:

library(naniar)
vec <- 1:10
# different each time
set_n_miss(vec, n = 1)
#>  [1] NA  2  3  4  5  6  7  8  9 10
set_n_miss(vec, n = 1)
#>  [1]  1  2  3  4  5  6  7  8  9 NA

set_prop_miss(vec, prop = 0.2)
#>  [1] NA  2  3 NA  5  6  7  8  9 10
set_prop_miss(vec, prop = 0.6)
#>  [1]  1 NA NA  4 NA NA NA  8  9 NA

I would suggest that these functions are used inside a dataframe. I will provide a few examples below using dplyr. For just one variable, you could set missingness like so:

library(tidyverse)
#> ── Attaching packages ───────────────────────────── tidyverse 1.3.2 ──
#>  ggplot2 3.4.0      purrr   1.0.1
#>  tibble  3.1.8      dplyr   1.1.0
#>  tidyr   1.3.0      stringr 1.5.0
#>  readr   2.1.3      forcats 1.0.0
#> ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
mtcars_df <- as_tibble(mtcars)

vis_miss(mtcars_df)


mtcars_miss_mpg <- mtcars_df %>% 
  mutate(mpg = set_prop_miss(mpg, 0.5))

vis_miss(mtcars_miss_mpg)

Or add missingness to a few variables:

mtcars_miss_some <- mtcars_df %>% 
  mutate(
    across(
      c(mpg, cyl, disp),
      \(x) set_prop_miss(x, 0.5)
    )
  )

mtcars_miss_some
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  NA      NA   NA    110  3.9   2.62  16.5     0     1     4     4
#>  2  21      NA   NA    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4    NA  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  NA      NA   NA    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6   NA    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8   NA    245  3.21  3.57  15.8     0     0     3     4
#>  8  NA      NA  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8    NA  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6   NA    123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows

vis_miss(mtcars_miss_some)

Or you can add missingness to all variables like so:

mtcars_miss_all <- mtcars_df %>% 
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, 0.5)
    )
  )

mtcars_miss_all
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  NA      NA  160    110  3.9   2.62  16.5    NA    NA     4    NA
#>  2  21      NA   NA    110  3.9   2.88  17.0     0     1    NA    NA
#>  3  22.8     4   NA     NA NA    NA     18.6     1    NA     4    NA
#>  4  NA      NA   NA    110 NA    NA     19.4    NA    NA    NA     1
#>  5  NA       8   NA     NA NA     3.44  NA      NA    NA     3     2
#>  6  18.1     6  225     NA NA    NA     20.2     1     0     3     1
#>  7  NA      NA   NA     NA  3.21  3.57  NA       0    NA    NA     4
#>  8  24.4    NA  147.    NA  3.69  3.19  20      NA    NA     4     2
#>  9  NA       4  141.    95  3.92  3.15  22.9    NA     0    NA    NA
#> 10  NA      NA  168.   123  3.92 NA     NA      NA     0     4     4
#> # … with 22 more rows

vis_miss(mtcars_miss_all)


miss_var_summary(mtcars_miss_all)
#> # A tibble: 11 × 3
#>    variable n_miss pct_miss
#>    <chr>     <int>    <dbl>
#>  1 mpg          16       50
#>  2 cyl          16       50
#>  3 disp         16       50
#>  4 hp           16       50
#>  5 drat         16       50
#>  6 wt           16       50
#>  7 qsec         16       50
#>  8 vs           16       50
#>  9 am           16       50
#> 10 gear         16       50
#> 11 carb         16       50

This resolves #298.

Bug Fixes and other small changes

Some thank yous

Thank you to everyone who has contributed to this release! Especially the following people: @ddauber, @davidgohel.

I am also excited to announce that I have been supported by the R Consortium to improve how R handles missing values! Through this grant, I will be improving the R packages naniar and visdat. I will be posting more details about this soon, but what this means for you the user is that there will be more updates and improvements to both of these packages in the coming months. Stay tuned.