Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2020-09-17

The many Flavours of Missing Values

Nicholas Tierney

Categories: Missing Data rstats Tags: rstats missing-data

4 minute read

NA values represent missing values in R. They’re awesome, because they’re baked right into R natively. There are actually many different flavours of NA values in R:

This means that these NA values have different properties, even though when printing them , they print as NA, they are character, or complex, or whatnot.

is.na(NA)

#> [1] TRUE

is.na(NA_character_)

#> [1] TRUE

is.character(NA_character_)

#> [1] TRUE

is.double(NA_character_)

#> [1] FALSE

is.integer(NA_integer_)

#> [1] TRUE

is.logical(NA)

#> [1] TRUE

Uhhh-huh. So, neat? Right? NA values are this double entity that have different classes? Yup! And they’re among the special reserved words in R. That’s a fun fact.

OK, so why care about this? Well, in R, when you create a vector, it has to resolve to the same class. Not sure what I mean?

Well, imagine you want to have the values 1:3

c(1,2,3)

#> [1] 1 2 3

And then you add one that is in quotes, “hello there”:

c(1,2,3, "hello there")

#> [1] "1"           "2"           "3"           "hello there"

They all get converted to “character”. For more on this, see Hadley Wickham’s vctrs talk

Well, it turns out that NA values need to have that feature as well, they aren’t this amorphous value that magically takes on the class. Well, they kind of are actually, and that’s kind of the point - we don’t notice it, and it’s one of the great things about R, it has native support for NA values.

So, imagine this tiny vector, then:

vec <- c("a", NA)
vec

#> [1] "a" NA
is.character(vec[1])

#> [1] TRUE

is.na(vec[1])

#> [1] FALSE

is.character(vec[2])

#> [1] TRUE

is.na(vec[2])

#> [1] TRUE

OK, so, what’s the big deal? What’s the deal with this long lead up? Stay with me, we’re nearly there:

vec <- c(1:5)
vec

#> [1] 1 2 3 4 5

Now, let’s say we want to replace values greater than 4 to be the next line in the song by Feist.

If we use the base R, ifelse:

ifelse(vec > 4, yes = "tell me that you love me more", no = vec)

#> [1] "1"                             "2"                            
#> [3] "3"                             "4"                            
#> [5] "tell me that you love me more"

It converts everything to a character. We get what we want here.

Now, if we use dplyr::if_else:

dplyr::if_else(vec > 4, true = "tell me that you love me more", false = vec)

#> Error: `false` must be a character vector, not an integer vector.

ooo, an error? This is useful because you might have a case where you do something like this:

dplyr::if_else(vec > 4, true = "5", false = vec)

#> Error: `false` must be a character vector, not an integer vector.

Which wouldn’t be protected against in base:

ifelse(vec > 4, yes = "5", no = vec)

#> [1] "1" "2" "3" "4" "5"

So why does that matter for NA values?

Well, because if you try and replace values more than 4 with NA, you’ll get the same error:

dplyr::if_else(vec > 4, true = NA, false = vec)

#> Error: `false` must be a logical vector, not an integer vector.

But this can be resolved by using the appropriate NA type:

dplyr::if_else(vec > 4, true = NA_integer_, false = vec)

#> [1]  1  2  3  4 NA

And that’s why it’s important to know about.

It’s one of these somewhat annoying things that you can come across in the tidyverse, but it’s also kind of great. It’s opinionated, and it means that you will almost certainly save yourself a whole world of pain later.

What is kind of fun is that using base R you can get some interesting results playing with the different types of NA values, like so:

ifelse(vec > 4, yes = NA, no = vec)

#> [1]  1  2  3  4 NA

ifelse(vec > 4, yes = NA_character_, no = vec)

#> [1] "1" "2" "3" "4" NA

It’s also worth knowing that you’ll get the same error appearing in case_when:

dplyr::case_when(
  vec > 4 ~ NA,
  TRUE ~ vec
  )

#> Error: must be a logical vector, not an integer vector.

But this can be resolved by using the appropriate NA value

dplyr::case_when(
  vec > 4 ~ NA_integer_,
  TRUE ~ vec
  )

#> [1]  1  2  3  4 NA

Lesson learnt?

Remember if you are replacing values with NA when using dplyr::if_else or dplyr::case_when, to consider the flavour of NA to use!

Happy travels!