Efficient spatial joining for large dataset in R

First of all, performing a more efficient function could mean speed up the process on how quickly the computer can undertake that action (algorithmic efficiency). And...

Efficient R programming is the implementation of efficient programming practices in R. All languages are different, so efficient R code does not look like efficient code in another language. Many packages have been optimised for performance so, for some operations, achieving maximum computational efficiency may simply be a case of selecting the appropriate package and using it correctly. There are many ways to get the same result in R, and some are very slow. Therefore not writing slow code should be prioritized over writing fast code. (Taken from Efficient R book)

So, performing more efficient Spatial Joins for large data sets could imply write faster code. In this case, I assume that the over function from sp package have been optimized for performance and I won't write another function by myself or look for another R package.

Instead of that, I will show how to go parallel in R and speed up the process by using all the CPUs in your computer. Please try the commented and reproducible code example below:

Load data

# Load libraries
library(rgdal)

# Load data
download.file("http://gis.ices.dk/shapefiles/ICES_ecoregions.zip",
              destfile = "ICES_ecoregions.zip")
unzip("ICES_ecoregions.zip")

# Read ecoregions shapefiles
ices_eco <- rgdal::readOGR(".", "ICES_ecoregions_20150113_no_land",
                           verbose = FALSE)

# Make a large data.frame (361,722 rows) with positions in the North Sea:
lon <- seq(-18.025, 32.025, by = 0.05)
lat <- seq(48.025, 66.025, by = 0.05)
df <- expand.grid(lon = lon, lat = lat)
df$area <- NA # Add empty attribute called "area" to assign ecoregions after

# Create a SpatialPointsDataFrame object from dataframe
coordinates(df) <- c("lon", "lat")

# Add projection to SpatialPointsDataFrame object
proj4string(df) <- ices_eco@proj4string

Aim: get the Ecoregion from ices_eco SpatialPolygonsDataFrame object for each position in the df SpatialPointsDataFrame object

# Parallel process: using multiple CPUs cores

# Load 'parallel' package for support Parallel computation in R
library('parallel')

# Calculate the number of cores (let one core be free for your Operative System)
no_cores <- detectCores() - 1

# Cut df in "n" parts
# Higher "n": less memory requiered but slower
# Lower "n": more memory requiered but faster
n <- 1000
parts <- split(x = 1:length(df), f = cut(1:length(df), n))

Initiate cluster like this if you are on Mac or Linux OSes

MacOS LinuxOS

# Initiate Cluster of CPU cores
# Note: you have to define all the used objects in the parallel process
# eg.: ices_eco, df, n, parts, etc. before making the cluster
cl <- makeCluster(no_cores, type = "FORK")

print(cl) # summary of the cluster

Initiate cluster like this if you are on Windows OS

WindowsOS

# Initiate Cluster of CPU cores
# Note: you have to define all the used objects in the parallel process
# eg.: ices_eco, df, n, parts, etc. before making the cluster
cl <- makeCluster(no_cores, type = "PSOCK")

# Load libraries on clusters
clusterEvalQ(cl = cl, expr = c(library('sp')))

# All the objects required to run the function
# Objects to export to clusters
clusterExport(cl = cl, varlist = c("ices_eco", "df", "parts", "n"))

print(cl) # summary of the cluster

Continue running the parallel function

# Parallelize sp::over function
# Returns a list with the overlays
system.time(

  overParts <- parLapply(cl = cl, X = 1:n, fun = function(x) {
    over <- over(df[parts[[x]],], ices_eco)
    gc() # release memory
    return(over)
  })
)

# user  system elapsed
# 1.050   1.150 627.111

# Stop Cluster of CPU cores
stopCluster(cl)

# Merge df with ecoregions
for (i in 1:n) {

  message(paste("Merging part", i, "of", n))
  df$area[parts[[i]]] <- as.character(overParts[[i]]$Ecoregion)

}

Control check

# Control check by random sampling of 20 elements
randomSampling <- sample(x = 1:length(df), size = 20)

chkA <- as.character(over(df[randomSampling,], ices_eco, returnList = FALSE)$Ecoregion) # direct method
chkB <- df$area[randomSampling] # sample df

# chkA should be equal to chkB
print(cbind(chkA, chkB))

# chkA                         chkB
# [1,] NA                           NA
# [2,] "Baltic Sea"                 "Baltic Sea"
# [3,] NA                           NA
# [4,] "Baltic Sea"                 "Baltic Sea"
# [5,] "Baltic Sea"                 "Baltic Sea"
# [6,] NA                           NA
# [7,] NA                           NA
# [8,] "Celtic Seas"                "Celtic Seas"
# [9,] NA                           NA
# [10,] NA                           NA
# [11,] "Baltic Sea"                 "Baltic Sea"
# [12,] "Oceanic Northeast Atlantic" "Oceanic Northeast Atlantic"
# [13,] "Greater North Sea"          "Greater North Sea"
# [14,] NA                           NA
# [15,] NA                           NA
# [16,] "Faroes"                     "Faroes"
# [17,] NA                           NA
# [18,] NA                           NA
# [19,] NA                           NA
# [20,] "Baltic Sea"                 "Baltic Sea"

Note: if you can access more than one computer, you can use a cluster of computers all going parallel.

More information here:

  • how-to-go-parallel-in-r-basics-tips
  • parallel-computing-in-r
  • running-r-jobs-quickly-on-many-machines
  • Parallelizing and Clustering example in R

If you can live with some slight "inaccuracies" on the borders of the polygons, you can achieve a very fast processing by rasterizing the shapefile, and then extracting the value of the points on the raster using raster::extract. Something like this would work, and outputs a SpatialPointsDataFrame.

library(rgdal)
library(gdalUtils)
library(sf)
library(raster)
library(sf)
library(lazyeval)
library(tidyr)
download.file("http://gis.ices.dk/shapefiles/ICES_ecoregions.zip",
              destfile = "ICES_ecoregions.zip")
unzip("ICES_ecoregions.zip")

# read eco region shapefiles
ices_eco <- sf::read_sf("ICES_ecoregions_20150113_no_land.shp")
## Make a large data.frame (361,722 rows) with positions in the North Sea:
lon <- seq(-18.025, 32.025, by=0.05)
lat <- seq(48.025, 66.025, by=0.05)
c <- expand.grid(lon=lon, lat=lat)

# Get the Ecoregion for each position
pings  <- SpatialPointsDataFrame(c, 
                                 data = data.frame(id = 1:dim(c)[1]), 
                                 proj4string = CRS(st_crs(ices_eco)$proj4string))

extr_data = list()
temprast  <- tempfile(fileext = ".tif")
tempshape <- tempfile(fileext = ".shp")
counter = 1
for (poly_n in seq_along(along = ices_eco$Ecoregion))  {

  message("Working on polygon: ", poly_n)
  # extract one polygon andrasterize it with 0.1 deg resolution
  subshape    <- ices_eco[poly_n,]
  st_write(subshape, tempshape, update = TRUE, quiet = TRUE)

  # crop the points dataframe on the extent of the rasterized polygon
  # to save time
  subpoints   <- raster::crop(pings, extent(as.numeric(st_bbox(subshape))[c(1,3,2,4)]))

  # if there are points "left", extract the value of the rasterized polygon, for each point
  if (!is.null(subpoints)) {
    rast <- gdal_rasterize(tempshape, temprast, burn = poly_n,
                           tr = c(0.01,0.01), a_nodata = -999,
                           output_Raster = TRUE)
    extr_points <- raster::extract(rast, subpoints, sp = TRUE)

    extr_points <- extr_points[which(extr_points@data[ , 2] == poly_n), ] %>%
      as("sf")

    # If any of the points fell in the polygon, extract the value and return them in
    # the list. otherwise, return null

    if ((dim(extr_points)[1]) != 0 ) {
      extr_data[[counter]] <- extr_points
      counter <- counter + 1
    }
  }
  unlink(temprast)
  unlink(tempshape)
}

# Do some juggling to remove duplicates and go back to a SpatialPointsDataFrame
out <- data.table::rbindlist(extr_data) %>%
  tibble::as_tibble() %>% 
  # Following two lines needed to remove few "duplicated" points. I think you get those
  # for points falling on the boundary between two cells. Increasing resolution in rasterization 
  # could avoid this
  dplyr::group_by(id) %>% 
  dplyr::summarize_(ecoregion = interp(~first(var), var = as.name(names(extr_data[[1]])[2])),
                    geometry  = interp(~vargeo[1],  vargeo = as.name("geometry"))) %>% 
  # transform bakck to a spatialpointsdataframe
  sf::st_as_sf() %>%
  as("Spatial") 

# Add the extracted info to the original SpatialPointsDataFrame
pings@data <- dplyr::left_join(pings@data, out@data) 
head(pings@data)
summary(pings)

>     head(pings@data)
  id ecoregion
1  1        17
2  2        17
3  3        17
4  4        17
5  5        17
6  6        17
>     summary(pings)
Object of class SpatialPointsDataFrame
Coordinates:
        min    max
lon -18.025 32.025
lat  48.025 66.025
Is projected: FALSE 
proj4string :
[+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0]
Number of points: 361722
Data attributes:
       id           ecoregion     
 Min.   :     1   Min.   : 9.00   
 1st Qu.: 90431   1st Qu.: 9.00   
 Median :180862   Median :11.00   
 Mean   :180862   Mean   :11.86   
 3rd Qu.:271292   3rd Qu.:15.00   
 Max.   :361722   Max.   :17.00   
                  NA's   :180720  

On my PC, this completes in about 1 minute. Note that in the code above, Im also processing one "polygon" at a time to further increase speed and avoid creating a huge temporary raster file. Additionally, Im using the sf functions for reading and writing shapefiles, which are much faster than the rgdal ones.

Also, I'm rasterizing the shapefile to a 0.01x0.01 degrees resolution'. You can reduce the possibility of "errors" by increasing the resolution of the rasterization changing the tr parameter in rast <- gdal_rasterize(tempshape, temprast, burn = poly_n, tr = c(0.01,0.01), a_nodata = -999, output_Raster = TRUE)

HTH !

Tags:

R

Spatial Join