how to overlay a polygon over SpatialPointsDataFrame and preserving the SPDF data?

From the sp::over help:

 x = "SpatialPoints", y = "SpatialPolygons" returns a numeric
      vector of length equal to the number of points; the number is
      the index (number) of the polygon of ‘y’ in which a point
      falls; NA denotes the point does not fall in a polygon; if a
      point falls in multiple polygons, the last polygon is
      recorded.

So if you convert your SpatialPolygonsDataFrame to SpatialPolygons you get back a vector of indexes and you can subset your points on NA:

> over(pts,as(ply,"SpatialPolygons"))
  [1] NA  1  1 NA  1  1 NA NA  1  1  1 NA NA  1  1  1  1  1 NA NA NA  1 NA  1 NA
 [26]  1  1  1 NA NA NA NA NA  1  1 NA NA NA  1  1  1 NA  1  1  1 NA NA NA  1  1
 [51]  1 NA NA NA  1 NA  1 NA  1 NA NA  1 NA  1  1 NA  1  1 NA  1 NA  1  1  1  1
 [76]  1  1  1  1  1 NA NA NA  1 NA  1 NA NA NA NA  1  1 NA  1 NA NA  1  1  1 NA

> nrow(pts)
[1] 100
> pts = pts[!is.na(over(pts,as(ply,"SpatialPolygons"))),]
> nrow(pts)
[1] 54
> head(pts@data)
         var1 var2
2  0.04001092    v
3  0.58108350    v
5  0.85682609    q
6  0.13683264    y
9  0.13968804    m
10 0.97144627    o
> 

For the doubters, here's evidence that the conversion overhead is not a problem:

Two functions - first Jeffrey Evans' method, then my original, then my hacked conversion, then a version based on gIntersects based on Josh O'Brien's answer:

evans <- function(pts,ply){
  prid <- over(pts,ply)
  ptid <- na.omit(prid) 
  pt.poly <- pts[as.numeric(as.character(row.names(ptid))),]
  return(pt.poly)
}

rowlings <- function(pts,ply){
  return(pts[!is.na(over(pts,as(ply,"SpatialPolygons"))),])
}

rowlings2 <- function(pts,ply){
  class(ply) <- "SpatialPolygons"
  return(pts[!is.na(over(pts,ply)),])
}

obrien <- function(pts,ply){
pts[apply(gIntersects(columbus,pts,byid=TRUE),1,sum)==1,]
}

Now for a real-world example, I've scattered some random points over the columbus data set:

require(spdep)
example(columbus)
pts=data.frame(
    x=runif(100,5,12),
    y=runif(100,10,15),
    z=sample(letters,100,TRUE))
coordinates(pts)=~x+y

Looks good

plot(columbus)
points(pts)

Check the functions are doing the same thing:

> identical(evans(pts,columbus),rowlings(pts,columbus))
[1] TRUE

And run 500 times for benchmarking:

> system.time({for(i in 1:500){evans(pts,columbus)}})
   user  system elapsed 
  7.661   0.600   8.474 
> system.time({for(i in 1:500){rowlings(pts,columbus)}})
   user  system elapsed 
  6.528   0.284   6.933 
> system.time({for(i in 1:500){rowlings2(pts,columbus)}})
   user  system elapsed 
  5.952   0.600   7.222 
> system.time({for(i in 1:500){obrien(pts,columbus)}})
  user  system elapsed 
  4.752   0.004   4.781 

As per my intuition, its not a great overhead, in fact it might be less of an overhead than converting all the row indexes to character and back, or running na.omit to get missing values. Which incidentally leads to another failure mode of the evans function...

If a row of the polygons data frame is all NA (which is perfectly valid), then the overlay with SpatialPolygonsDataFrame for points in that polygon will produce an output data frame with all NAs, which evans() will then drop:

> columbus@data[1,]=rep(NA,20)
> columbus@data[5,]=rep(NA,20)
> columbus@data[17,]=rep(NA,20)
> columbus@data[15,]=rep(NA,20)
> set.seed(123)
> pts=data.frame(x=runif(100,5,12),y=runif(100,10,15),z=sample(letters,100,TRUE))
> coordinates(pts)=~x+y
> identical(evans(pts,columbus),rowlings(pts,columbus))
[1] FALSE
> dim(evans(pts,columbus))
[1] 27  1
> dim(rowlings(pts,columbus))
[1] 28  1
> 

BUT gIntersects is faster, even with having to sweep the matrix to check intersections in R rather than in C code. I suspect its the prepared geometry skills of GEOS, creating spatial indexes - yeah, with prepared=FALSE it takes a bit longer, about 5.5 seconds.

I'm surprised there isn't a function to either straight return the indices or the points. When I wrote splancs 20 years ago the point-in-polygon functions had both...


sp provides a shorter form to select features based on spatial intersection, following the OP example:

pts[ply,]

as of:

points(pts[ply,], col = 'red')

Behind the scenes this is short for

pts[!is.na(over(pts, geometry(ply))),]

The thing to note is that there is a geometry method that drops attributes: over changes behaviour if its second argument has attributes or not (this was OP's confusion). This works accross all Spatial* classes in sp, although some over methods require rgeos, see this vignette for details, e.g. the case of multiple matches for overlapping polygons.


You were on the right track with over. The rownames of the returned object correspond to the row index of the points. You can implement your exact approach with just a few addition lines of code.

library(sp)
set.seed(357)

pts <- data.frame(x=rnorm(100), y=rnorm(100), var1=runif(100), 
                  var2=sample(letters, 100, replace=TRUE))
  coordinates(pts) <- ~ x + y

ply <- matrix(c(-1,-1, 1,-1, 1,1, -1,1, -1,-1), ncol=2, byrow=TRUE)
  ply <- SpatialPolygons(list(Polygons(list(Polygon(ply)), ID=1)))
    ply <- SpatialPolygonsDataFrame(Sr=ply, data=data.frame(polyvar=357))

# Subset points intersecting polygon
prid <- over(pts,ply)
  ptid <- na.omit(prid) 
    pt.poly <- pts[as.numeric(as.character(row.names(ptid))),]  

plot(pts)
  axis(1); axis(2)
    plot(ply, add=TRUE, border="red")
      plot(pt.poly,pch=19,add=TRUE) 

Tags:

Overlay

R