Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2023-11-10

How to Get Good with R?

Nicholas Tierney

Categories: rstats Tags: rstats functions

17 minute read

A Stop Sign in Melbourne, Film, Pentax K1000, Nick Tierney

Someone recently asked me, “How do you get good with R?”. It’s a good question. I’m not sure I have a good answer. But I’ve been thinking about it, and I’ve got some ideas. Hopefully they are helpful for people, or can promote discussion on making better ideas. I think you can view “getting good” through two lenses: coding, and non coding. I’ll write about the coding aspects in this blog post, and the non coding aspects in a future one.

This post assumes that you’re somewhat familiar with R, you can open it, you do some analysis, etc. I should probably write a blog post first on “How to get started with R”. But writing introductory materials takes a bit more time, and I’d rather put out some kind of blog post than continue to add to the pile of blog post drafts that have been sitting there since 2017.

How to get good, from the coding side:

How to get good, from the non coding side:

Let’s delve a bit deeper

Let’s give examples of all of these things. I’ve realised now as I’ve been writing this that I won’t have time to write the non-coding aspects, so that will be its own blog post. But I thought it might be useful to outline the points above.

From the coding side

Try to give things OK names

There is a quote by Phil Karlton: “There are only two hard things in Computer Science: cache invalidation and naming thing”. We won’t talk about cache invalidation (Although Yihui has a good post about it if you are interested), but naming things well…is hard. Giving things OK names is a bit easier. Getting in the habit of giving things an OK name will lead to you giving better and better names as you practice. Why should we are about this? Well, compare:

weight_mean <- mean(ChickWeight$weight)
weight_sd <- sd(ChickWeight$weight)

to

x <- mean(ChickWeight$weight)
y <- sd(ChickWeight$weight)

If I see weight_mean and weight_sd later, I’ve got a clue what they are - the mean and the standard deviation of a thing called weight. I don’t know what x and y are when I see them. A small counter example to this is very small functions. But let’s not get into the weeds.

Consistent naming

Building on the previous point, I think it useful to pick a way to name something, and stick to it.

Compare:

weight_mean <- mean(ChickWeight$weight)
sd_weight <- sd(ChickWeight$weight)

and

weight_mean <- mean(ChickWeight$weight)
weight_sd <- sd(ChickWeight$weight)

The first one changes the order of the naming, which is: variable_operation - the first word describes the variable we are measuring, weight, and the second tells us what operation we’ve done to it. It doesn’t matter about the order, necessarily, but the consistency.

Why? Because if it is consistent we can get benefits in things like tab complete - I can type weight_ and hit tab and it tells me all the things related to weight that I’ve done. Similarly, it could also be useful to know all the things I’ve done sd to, sd_ could tell me the things with sd. E.g.,

sd_height <- ChickWeight$weight
sd_time <- ChickWeight$Time

It depends what you are doing - do you care more about finding things related to sd, or related to height or weight? My point is that I think it’s worthwhile to stick to a consistent approach. It means you can just remember the more important thing, whether that be things related to weight, or things related to sd.

Stick to a style guide

Style guides are a thing - they help define a set of rules to keep your code easy to read. To quote Hadley Wickham:

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.

There is a comprehensive guide like the Tidyverse Style Guide. Or for a shorter guide, you can see the style guide from “Advanced R”, 1st edition

I think it is worth your time to read through a style guide once. You’ll notice things that you didn’t before in your code and you’ll see things differently. It is similar to learning about kerning. Once you see it, you’ll notice it everywhere.

So, pick something like snake_case or camelCase or CamelCase, and stick to it. I prefer snake_case because I find it easier to read. But you can do whatever you want! Just be consistent.

I find it confusing to go through code that mixes it up or combines different approaches. It means you need to spend time remembering which way something was named. Here’s a “good” vs “bad”

# good
sdWeight
sdTime
weightSd
timeSd
# bad
sd_Weight
SD_weight
Long_variable_Name

If you want to have a quick way to help style your code, check out Lorenz Walthert’s styler package.

Clean up your code as you go (refactor)

Get into the habit of quickly going back over your code to tidy it up. A kitchen analogy is useful here I think. Professional kitchens clean up as they go. Keeping a tidy workspace helps keep your mind clear. And I rarely write my best code on the first pass. It gets better with iteration. So, iterate on your code briefly as you go. Clean up those random comments that don’t mean anything. Remove those bits of wilted lettuce, these are those commented out bits of code that you might just come back to - you probably won’t, and if they are really important, write a comment explaining why they might be relevant. Clean up the style. Check the names. Consider tidying up the implementation, refactoring it. You might not have time to do all of these things, or do them thoroughly, but doing some of them will help you write better code, and you’ll get better at doing these things the more you practice.

A word worth explaining - “refactor”. It refers to changing the code but keeping it’s behaviour the same. Kind of like giving a car a service, or replacing key parts in a vehicle that get worn out. For a great talk on a related topic, see Jenny Bryan’s “Code Smells and Feels”

Learn how to create a reproducible example (a reprex), and use it a lot

When you run into a problem, or an error, if you can’t work out the answer after some tinkering about, it can be worthwhile spending some time to construct a small example of the code that breaks. This takes a bit of time, and could be its own little blog post. It takes practice. But in the process of reducing the problem down to its core components, I often can solve the problem myself. It’s kind of like that experience of when you talk to someone to try and describe a problem that you are working on, and in talking about it, you arrive at a solution.

There is a great R package that helps you create these reproducible examples, called reprex, by Jenny Bryan. I’ve written about the reprex package here

For the purposes of illustration, let’s briefly tear down a small example using the somewhat large dataset of diamonds

library(tidyverse)
#> ── Attaching core tidyverse packages ────────────── tidyverse 2.0.0 ──
#>  dplyr     1.1.3      readr     2.1.4
#>  forcats   1.0.0      stringr   1.5.0
#>  ggplot2   3.4.4      tibble    3.2.1
#>  lubridate 1.9.2      tidyr     1.3.0
#>  purrr     1.0.2     
#> ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
#>  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
diamonds
#> # A tibble: 53,940 × 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # ℹ 53,930 more rows

Let’s say we had a few steps involved in the data summary of diamonds data:

diamonds %>%
  mutate(
    price_per_carat = price / carat
  ) %>% 
  group_by(
    cut
    ) %>% 
  summarise(
    price_mean = mean(price_per_carat),
    price_sd = sd(price_per_carat),
    mean_color = mean(color)
  )
#> Warning: There were 5 warnings in `summarise()`.
#> The first warning was:
#>  In argument: `mean_color = mean(color)`.
#>  In group 1: `cut = Fair`.
#> Caused by warning in `mean.default()`:
#> ! argument is not numeric or logical: returning NA
#>  Run `dplyr::last_dplyr_warnings()` to see the 4 remaining warnings.
#> # A tibble: 5 × 4
#>   cut       price_mean price_sd mean_color
#>   <ord>          <dbl>    <dbl>      <dbl>
#> 1 Fair           3767.    1540.         NA
#> 2 Good           3860.    1830.         NA
#> 3 Very Good      4014.    2037.         NA
#> 4 Premium        4223.    2035.         NA
#> 5 Ideal          3920.    2043.         NA

We get a clue that the error is in the line mean_color, so let’s just try and do that line:

diamonds %>%
  mutate(
    mean_color = mean(color)
  )
#> Warning: There was 1 warning in `mutate()`.
#>  In argument: `mean_color = mean(color)`.
#> Caused by warning in `mean.default()`:
#> ! argument is not numeric or logical: returning NA
#> # A tibble: 53,940 × 11
#>    carat cut       color clarity depth table price     x     y     z mean_color
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <dbl>
#>  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43         NA
#>  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31         NA
#>  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31         NA
#>  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63         NA
#>  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75         NA
#>  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48         NA
#>  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47         NA
#>  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53         NA
#>  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49         NA
#> 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39         NA
#> # ℹ 53,930 more rows

We still get that error, so what if we just do

mean(diamonds$color)
#> Warning in mean.default(diamonds$color): argument is not numeric or logical: returning NA
#> [1] NA

OK same error. What is in color?

head(diamonds$color)
#> [1] E E E I J J
#> Levels: D < E < F < G < H < I < J

Does it really make sense to take the mean of some letters? Ah, of course not!

Use other R packages

You don’t need to write everything yourself from scratch. There are thousands of R packages that people have written to solve problems. Not all of them will suit, but if you find yourself trying to solve a problem, or write a statistical model from scratch, it it worth your time to search google to see if anyone has tried to solve this problem.

At worst, you’ll see no one has thought about this - this is an opportunity for you to do something cool! Or you might see ways people have tried to solve this, which can help you get started, and you’ll learn the other words people use to describe the problem you are facing, which can help you search this problem better in the future. At best you’ll find an exact answer.

Write functions

I don’t think I can overstate this, but learning how to write functions changed how I think about code and how I think about solving problems. I think it is a skill that takes some time to develop, but the payoff is enormous. A function is an abstraction of a problem. Kind of like a shortcut. Save yourself writing many many lines of code and replace them with a function. But that’s an oversimplification.

Here’s a short primer on why functions are good, adapted from the functions chapter of R4DS, which I recommend you read.

What does the following code do?

dat <- tibble::tibble(
  weight = rnorm(10, mean = 80, sd = 6),
  height = rnorm(10, mean = 165, sd = 10),
)

dat
#> # A tibble: 10 × 2
#>    weight height
#>     <dbl>  <dbl>
#>  1   78.0   167.
#>  2   76.0   152.
#>  3   81.2   173.
#>  4   71.9   156.
#>  5   88.1   160.
#>  6   80.2   180.
#>  7   67.1   168.
#>  8   82.7   180.
#>  9   81.7   160.
#> 10   81.6   157.

dat$weight_0 <- (dat$weight - mean(dat$weight, na.rm = TRUE)) / sd(dat$weight, na.rm = TRUE)

dat$height_0 <- (dat$height - mean(dat$height, na.rm = TRUE)) / sd(dat$weight, na.rm = TRUE)

This is a kind of transformation called mean centering and scaling. It can be useful in some statistical models to have your data have a mean of 0 and a standard deviation of 1.

But read through that code it can be hard to understand that is what is happening, and it also leads to potential errors - you have to repeat dat$weight or dat$height many times. And if you copy and paste this code then you might end up with an error. Which there is. for height_0, I divide by the standard deviation of weight, not height.

A function helps clarify this by abstracting away the need to write the variable in each time. We can break this into two steps:

mean_center <- function(variable){
  variable - mean(variable, na.rm = TRUE)
}

scale_sd <- function(variable){
  variable / sd(variable, na.rm = TRUE)
}

Now we can express this as follows:

dat$height %>% 
  mean_center() %>% 
  scale_sd()
#>  [1]  0.1832022 -1.3362374  0.7525296 -0.9848021 -0.4937227  1.4924404
#>  [7]  0.3021602  1.4477151 -0.5125610 -0.8507242

or, we can write a function that does both of these steps:

center_scale <- function(variable){
  centered <- mean_center(variable)
  centered_scaled <- scale_sd(centered)
  centered_scaled
}

What is nice about this is that each of the functions describes what it does in its name. So when we look at the code for center_scale, we can see that it does a thing called mean_center first, then scale_sd next. Another useful principle to follow with writing functions is to generally try to aim to get them to do one thing, and to return one thing. There are many exceptions to this rule, but it’s a useful practice to stick to.

So compare:


dat$weight_0 <- (dat$weight - mean(dat$weight, na.rm = TRUE)) / sd(dat$weight, na.rm = TRUE)

dat$height_0 <- (dat$height - mean(dat$height, na.rm = TRUE)) / sd(dat$weight, na.rm = TRUE)

dat
#> # A tibble: 10 × 4
#>    weight height weight_0 height_0
#>     <dbl>  <dbl>    <dbl>    <dbl>
#>  1   78.0   167.   -0.138    0.308
#>  2   76.0   152.   -0.476   -2.25 
#>  3   81.2   173.    0.394    1.27 
#>  4   71.9   156.   -1.18    -1.66 
#>  5   88.1   160.    1.56    -0.831
#>  6   80.2   180.    0.224    2.51 
#>  7   67.1   168.   -1.97     0.508
#>  8   82.7   180.    0.644    2.44 
#>  9   81.7   160.    0.482   -0.862
#> 10   81.6   157.    0.456   -1.43

to


dat$weight_0 <- center_scale(dat$weight)

dat$height_0 <- center_scale(dat$height)

dat
#> # A tibble: 10 × 4
#>    weight height weight_0 height_0
#>     <dbl>  <dbl>    <dbl>    <dbl>
#>  1   78.0   167.   -0.138    0.183
#>  2   76.0   152.   -0.476   -1.34 
#>  3   81.2   173.    0.394    0.753
#>  4   71.9   156.   -1.18    -0.985
#>  5   88.1   160.    1.56    -0.494
#>  6   80.2   180.    0.224    1.49 
#>  7   67.1   168.   -1.97     0.302
#>  8   82.7   180.    0.644    1.45 
#>  9   81.7   160.    0.482   -0.513
#> 10   81.6   157.    0.456   -0.851

It is worth noting that there is a function called scale in base R that does this, however it returns a matrix, so you’ll need to wrap it in as.numeric

dat$weight_0 <- as.numeric(scale(dat$weight))
dat$height_0 <- as.numeric(scale(dat$height))
dat
#> # A tibble: 10 × 4
#>    weight height weight_0 height_0
#>     <dbl>  <dbl>    <dbl>    <dbl>
#>  1   78.0   167.   -0.138    0.183
#>  2   76.0   152.   -0.476   -1.34 
#>  3   81.2   173.    0.394    0.753
#>  4   71.9   156.   -1.18    -0.985
#>  5   88.1   160.    1.56    -0.494
#>  6   80.2   180.    0.224    1.49 
#>  7   67.1   168.   -1.97     0.302
#>  8   82.7   180.    0.644    1.45 
#>  9   81.7   160.    0.482   -0.513
#> 10   81.6   157.    0.456   -0.851

Which is why it is worthwhile looking to see if a solution to a problem exists :)

To summarise writing functions:

  1. Functions help us reason with our code by abstracting away details we don’t need to care about
  2. Give Functions good names, they should describe what they do
  3. Functions should do one task
  4. Functions should return one thing

Learn how to use debugging tools

R is interactive, you can run some code and see the output immediately. It’s a really nice way to code for data analysis. It means that when you encounter a bug or an error, you can poke around and see what the error is. But if you write a function and want to see how it behaves or what goes on inside, you’ll need to use a special tool called a debugger.

If you’ve found yourself copying the internals of a function over to another script and hard coding the arguments at the top of the script and running the code line by line, then this is especially for you. I say this as someone who did that for years. Ahem.

Three debugging tools to know about:

  1. browser()
  2. debug() and undebug()
  3. debugonce()

browser() gets placed inside a function, and then when you run the function, you land at that point in the code. For example:

mean_center <- function(variable){
  browser()
  variable - mean(variable, na.rm = TRUE)
}

If you run this function, and then try and use it later, you will be landed at where the browser() line is. Try running the code above and then doing mean_center(1:10) afterwards. An important thing to remember with browser() is that you need to remove browser() after the fact.

debug() and undebug() are like browser(), but you run them on a function and that turns on debugging mode. Just like putting browser() at the top line of a function.

debug(mean_center)
mean_center(1:10)
## enter debug mode here, tinker around with the outputs
## then turn off the debugger
undebug(mean_center)

debugonce() does the debug() step, but just one time. It’s handy because it means you don’t need to run undebug() - sometimes I’ve forgotten I needed to do undebug() and i’ve felt like a madness has descended upon me.

So what to remember about this? If you want to see where a function is erroring, then you probably want to use debugonce() on it.

(Not) The End

There’s more to this. I don’t have all the answers. I’m a decent R programmer. But I’m still learning all the time, and I make mistakes a lot. I’m not perfect, far from it! But maybe these things can help you on your journey in improving in R programming.

I didn’t go into as much detail as I would have liked - there are a lot of topics here, what did I miss? What do you think are the ways to get good with R? Write a comment below :)