Processing large geotiff using python

A great feature of raster data is that it often allows block-wise processing. You can "break" the raster up into rectangular windows to reduce the memory footprint of your process, or to process blocks in parallel and get results faster.

The documentation for GDAL's Python bindings is thin and examples of getting raster windows using ReadAsArray() are scarce, but I found this in the GDAL tests: https://github.com/OSGeo/gdal/blob/77544764f51420e42468641fc3d5a087f8ea6d8f/autotest/gcore/numpy_rw.py#L115. Pass in some offsets and width and height of your window and you have a band subset as a numpy array.

To apply this in your program, loop over blocks of the raster, and within that loop over the bands of the raster.


When you ReadAsArray(), you're creating a numpy array, which essentially means you're loading it to memory. For a 30GB TIF, expect something around that ballpark. Meaning, you'll only even be able to work with it if you have a 64GB+ workstation or server. And even then, it's not desirable.

For larger rasters, consider working with blocks. Here is a good tutorial on this (starting from page 24). Notice, however, that this document was writen for an older version of GDAL. Since version 2.0, getting the custom block size of a raster is done with GDALRasterBand::GetBlockSize() method.

Also take in mind that a raster is most efficiently iterated over its default block size (each raster format will have its own). However, the block type of a TIFF can be tile or strip. Tile means a rectangular block, and TIFFs created with GDAL default to that (a 256x256 tile, specifically), but TIFFs with strip type have blocks that span the entire width of the raster, and may even be only one row high. This not only difficults certain processes (such as kernel), but may also be too memory-intensive. Check your block size beforehand, and, if a strip, choose a personalized block size.