RTree spatial index does not result in faster intersection computation

My answer is essentially based on another answer by @gene here:

More Efficient Spatial join in Python without QGIS, ArcGIS, PostGIS, etc

He proposed the same solution using two differents methods, with or without a spatial index.

He (correctly) stated:

What is the difference ?

  • Without the index, you must iterate through all the geometries (polygons and points).
  • With a bounding spatial index (Spatial Index RTree), you iterate only through the geometries which have a chance to intersect with your current geometry ('filter' which can save a considerable amount of calculations and time...)
  • but a Spatial Index is not a magic wand. When a very large part of the dataset has to be retrieved, a Spatial Index cannot give any speed benefit.

These sentences are self-explanatory, but I preferred quoting @gene instead of proposing the same conclusions as mine (so, all the credits go to its brilliant work!).

For a better understanding of the Rtree spatial index, you may retrieve some useful informations following these links:

  • Rtree: Spatial indexing for Python;
  • Using Rtree spatial indexing with OGR.

Another great introduction about the using of spatial indexes may be this article by @Nathan Woodrow.


Just to add to mgri answer.

It is important to understand what is a spatial index (How To Properly Implement a Bounding Box for Shapely & Fiona?). With my example in How to efficiently determine which of thousands of polygons intersect with a linestring

enter image description here

You can create a spatial index with the polygons

idx = index.Index()
for feat in poly_layer:
    geom = shape(feat['geometry'])
    id = int(feat['id'])
    idx.insert(id, geom.bounds,feat)

Limits of the spatial index (polygon bounds in green)

enter image description here

Or with the LineStrings

  idx = index.Index()
  for feat in line_layer:
      geom = shape(feat['geometry'])
      id = int(feat['id'])
      idx.insert(id, geom.bounds,feat)

Limits of the spatial index (LineString bound in red)

enter image description here

Now, you iterate only through the geometries which have a chance to intersect with your current geometry (in yellow)

enter image description here

I use here the LineStrings spatial index (the results are the same but with your example of 4000 polygons and 4 lines ....).

for feat1 in poly_layer:
    geom1 = shape(feat1['geometry'])
    for id in idx.intersection(geom1.bounds):
        feat2 = line_layer[id]
        geom2 = shape(feat2['geometry'])
        if geom2.intersects(geom1):
            print 'Polygon {} intersects line {}'.format(feat1['id'], feat2['id'])

  Polygon 2 intersects line 0
  Polygon 3 intersects line 0
  Polygon 6 intersects line 0
  Polygon 9 intersects line 0

You can also use a generator (example.py)

def build_ind():
     with fiona.open('polygon_layer.shp') as shapes:
         for s in shapes:
             geom = shape(s['geometry'])
             id = int(s['id'])
             yield (id, geom.bounds, s)

 p= index.Property()
 tree = index.Index(build_ind(), properties=p)
 # first line of line_layer
 first = shape(line_layer.next()['geometry'])
 # intersection of first with polygons
 tuple(tree.intersection(first.bounds))
 (6, 2, 3, 9)

You can examine the GeoPandas script sjoin.py to understand the use of Rtree.

There are many solutions but don't forget that

  • Spatial Index is not a magic wand...

Edit: To clarify this answer, I incorrectly believed that all the intersection tests took approximately the same amount of time. This is not the case. The reason I did not get the speed up I expected from using a spatial index is that the selection of intersection tests are the ones that took the longest to do in the first place.

As gene and mgri have already said, a spatial index is not a magic wand. Although the spatial index pared down the number of intersection tests that needed to be performed from 12936 to only 1816, the 1816 tests are the test that took the vast majority of time to compute in the first place.

The spatial index is being used correctly, but the assumption that each intersection test takes roughly the same amount of time is what is incorrect. The time required by the intersection test can vary greatly (0.05 seconds versus 0.000007 seconds).