Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2021-08-21

How Much One Shapefile Overlaps Another?

Nicholas Tierney

Categories: rstats spatial gis sf Tags: rstats spatial gis sf

5 minute read

Photo By Nick Tierney

Last week I needed to understand how to calculate how much two spatial areas overlapped. I found a nice thread on GIS stack exchange that explained an answer (from user Sandy AB). This pretty much gave me what I was looking for - but I was after something with more pictures, so here’s an example of that.

The Data

I started by getting some spatial areas from the ABS at this link, selecting, “statistical local areas level 4 2021”:

I then did a bit of processing to shrink the data down to just my hometown, Brisbane.

Let’s read this in with sf:

library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
brisbane_sf <- read_sf(here::here("content/post/2021-08-18-how-do-i-find-out-how-much-a-shapefile-overlaps-another/data/brisbane-sla4.shp"))

The Data Vis

Let’s plot my hometown, Brisbane with ggplot2 and geom_sf().

library(ggplot2)
ggplot() + 
  geom_sf(data = brisbane_sf,
          fill = "forestgreen") 

Let’s say we have a simpler shape file that we want to compare to this - which we create by simplifying the shapefile with rmapshaper.

library(rmapshaper)
#> Registered S3 method overwritten by 'geojsonlint':
#>   method         from 
#>   print.location dplyr
brisbane_simpler_sf <- brisbane_sf %>% ms_simplify(keep = 0.01)

With this simpler file, we want to know how much area we are losing. Here’s the “simpler” Brisbane in red.

ggplot() + 
  geom_sf(data = brisbane_simpler_sf, 
          fill = "firebrick")

We can overlay them to see how similar they are, full Brisbane in green, and new Brisbane in red:

ggplot() + 
  geom_sf(data = brisbane_sf,
          fill = "forestgreen", alpha = 0.5) + 
  geom_sf(data = brisbane_simpler_sf, 
          fill = "firebrick",
          alpha = 0.5)

So how do we calculate the difference between the two? We can use st_intersection to find where both shapes overlap. Here it is visualised:

st_intersection(brisbane_sf, brisbane_simpler_sf) %>% 
  ggplot() +
  geom_sf(fill = "brown", alpha = 0.5)
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries

Compare this again to the previous plot - these are the brown sections.

But say we want to calculate the ratio of how similar these two shape files are - how do we do that? Here are the steps:

  1. Find the area of the shapefile for the original Brisbane file
  2. Find the area of the overlap between the two files
  3. Calculate the ratio of the area in the original shapefile, and the area of the overlapping area
  4. Calculate the area of the overlap compared to the original Brisbane file, giving us the percentage of overlap!

The area of the shapefile for the original Brisbane file

Here are the steps:

  1. Calculate area with st_area
  2. Reduce the size of the data - then only keep the relevant data (just keeping the SA4 name (SA4_NAME21), and then area, then drop the geometry column).
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
brisbane_sf_areas <- brisbane_sf %>% 
  mutate(brisbane_original_area = st_area(.)) %>% 
  select(SA4_NAME21, brisbane_original_area) %>% 
  st_drop_geometry()
  
brisbane_sf_areas
#> # A tibble: 1 × 2
#>   SA4_NAME21          brisbane_original_area
#> * <chr>                                [m^2]
#> 1 Brisbane Inner City               81738079

The area of the overlap between the two files

  1. Calculate the intersection of these two shape files (st_intersection)
  2. Calculate that area (st_area)
  3. Then only keep the relevant data again
intersection_area <- st_intersection(brisbane_sf, brisbane_simpler_sf) %>% 
    mutate(intersect_area = st_area(.)) %>% 
    select(SA4_NAME21, intersect_area) %>% 
    st_drop_geometry()
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries

intersection_area
#> # A tibble: 1 × 2
#>   SA4_NAME21          intersect_area
#> * <chr>                        [m^2]
#> 1 Brisbane Inner City       76638525

The ratio of the area in the original shapefile, and the area of the overlapping areas

Now we have our pieces, now let us add these columns of the other data back to it:

intersection_area %>% 
    left_join(brisbane_sf_areas, 
              by = "SA4_NAME21") 
#> # A tibble: 1 × 3
#>   SA4_NAME21          intersect_area brisbane_original_area
#>   <chr>                        [m^2]                  [m^2]
#> 1 Brisbane Inner City       76638525               81738079

And then calculate the ratio - which will tell us how much the simpler shape file overlaps the original ratio.

intersection_area %>% 
    left_join(brisbane_sf_areas, 
              by = "SA4_NAME21") %>% 
    mutate(weight = intersect_area / brisbane_original_area)
#> # A tibble: 1 × 4
#>   SA4_NAME21          intersect_area brisbane_original_area   weight
#>   <chr>                        [m^2]                  [m^2]      [1]
#> 1 Brisbane Inner City       76638525               81738079 0.937611

And, because I think it’s good practice, all together as a function:

calculate_spatial_overlap <- function(shape_new,
                                      shape_old) {
  
  
  intersection_area <- st_intersection(shape_new, shape_old) %>% 
    mutate(intersect_area = st_area(.)) %>% 
    select(SA4_NAME21, intersect_area) %>% 
    st_drop_geometry()
  
  # Create a fresh area variable for counties
  shape_old_areas <- shape_old %>% 
    mutate(original_area = st_area(.)) %>% 
    select(original_area, SA4_NAME21) %>% 
    st_drop_geometry()
  
  intersection_area %>% 
    left_join(shape_old_areas, 
              by = "SA4_NAME21") %>% 
    mutate(weight = intersect_area / original_area)
  
}
calculate_spatial_overlap(
    brisbane_simpler_sf, brisbane_sf
  )
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
#> # A tibble: 1 × 4
#>   SA4_NAME21          intersect_area original_area   weight
#>   <chr>                        [m^2]         [m^2]      [1]
#> 1 Brisbane Inner City       76638525      81738079 0.937611

So this tells us:

The new shapefile covers about 93% of the old shape file.

That is, the red areas cover 93% of the green area:

Why do this?

Our use case in this example was to calculate the difference between shapefiles so we could then use this overalapping difference as a weight in subsequent measurements.

But you can do more with this - the example from the GIS stack exchange thread was trying to calculate the amount of overlap of a lot of smaller shape files on another shapefile - a measurement often referred to as “coverage”.

And that’s it - hope that helps someone!

Thanks

Thank you again to user Sandy AB from Stack Exchange, who posted the answer!