Spatial clustering with PostGIS?

There are at least two good clustering methods for PostGIS: k-means (via kmeans-postgresql extension) or clustering geometries within a threshold distance (PostGIS 2.2)


1) k-means with kmeans-postgresql

Installation: You need to compile and install this from source code, which is easier to do on *NIX than Windows (I don't know where to start). If you have PostgreSQL installed from packages, make sure you also have the development packages (e.g., postgresql-devel for CentOS).

Download, extract, build and install:

wget http://api.pgxn.org/dist/kmeans/1.1.0/kmeans-1.1.0.zip
unzip kmeans-1.1.0.zip
cd kmeans-1.1.0/
make USE_PGXS=1
sudo make install

Enable the extension in a database (using psql, pgAdmin, etc.):

CREATE EXTENSION kmeans;

Usage/Example: You should have a table of points somewhere (I drew a bunch of pseudo random points in QGIS). Here is an example with what I did:

SELECT kmeans, count(*), ST_Centroid(ST_Collect(geom)) AS geom
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

the 5 I provided in the second argument of the kmeans window function is the K integer to produce five clusters. You can change this to whatever integer you want.

Below is the 31 pseudo random points I drew and the five centroids with the label showing the count in each cluster. This was created using the above SQL query.

Kmeans


You can also attempt to illustrate where these clusters are with ST_MinimumBoundingCircle:

SELECT kmeans, ST_MinimumBoundingCircle(ST_Collect(geom)) AS circle
FROM (
  SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
  FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

Kmeans2


2) Clustering within a threshold distance with ST_ClusterWithin

This aggregate function is included with PostGIS 2.2, and returns an array of GeometryCollections where all the components are within a distance of each other.

Here is an example use, where a distance of 100.0 is the threshold that results in 5 different clusters:

SELECT row_number() over () AS id,
  ST_NumGeometries(gc),
  gc AS geom_collection,
  ST_Centroid(gc) AS centroid,
  ST_MinimumBoundingCircle(gc) AS circle,
  sqrt(ST_Area(ST_MinimumBoundingCircle(gc)) / pi()) AS radius
FROM (
  SELECT unnest(ST_ClusterWithin(geom, 100)) gc
  FROM rand_point
) f;

ClusterWithin100

The largest middle cluster has a enclosing circle radius of 65.3 units or about 130, which is larger than the threshold. This is because the individual distances between the member geometries is less than the threshold, so it ties it together as one larger cluster.


I've written function that calculates clusters of features based on distance between them and build convex hull over this features:

CREATE OR REPLACE FUNCTION get_domains_n(lname varchar, geom varchar, gid varchar, radius numeric)
    RETURNS SETOF record AS
$$
DECLARE
    lid_new    integer;
    dmn_number integer := 1;
    outr       record;
    innr       record;
    r          record;
BEGIN

    DROP TABLE IF EXISTS tmp;
    EXECUTE 'CREATE TEMPORARY TABLE tmp AS SELECT '||gid||', '||geom||' FROM '||lname;
    ALTER TABLE tmp ADD COLUMN dmn integer;
    ALTER TABLE tmp ADD COLUMN chk boolean DEFAULT FALSE;
    EXECUTE 'UPDATE tmp SET dmn = '||dmn_number||', chk = FALSE WHERE '||gid||' = (SELECT MIN('||gid||') FROM tmp)';

    LOOP
        LOOP
            FOR outr IN EXECUTE 'SELECT '||gid||' AS gid, '||geom||' AS geom FROM tmp WHERE dmn = '||dmn_number||' AND NOT chk' LOOP
                FOR innr IN EXECUTE 'SELECT '||gid||' AS gid, '||geom||' AS geom FROM tmp WHERE dmn IS NULL' LOOP
                    IF ST_DWithin(ST_Transform(ST_SetSRID(outr.geom, 4326), 3785), ST_Transform(ST_SetSRID(innr.geom, 4326), 3785), radius) THEN
                    --IF ST_DWithin(outr.geom, innr.geom, radius) THEN
                        EXECUTE 'UPDATE tmp SET dmn = '||dmn_number||', chk = FALSE WHERE '||gid||' = '||innr.gid;
                    END IF;
                END LOOP;
                EXECUTE 'UPDATE tmp SET chk = TRUE WHERE '||gid||' = '||outr.gid;
            END LOOP;
            SELECT INTO r dmn FROM tmp WHERE dmn = dmn_number AND NOT chk LIMIT 1;
            EXIT WHEN NOT FOUND;
       END LOOP;
       SELECT INTO r dmn FROM tmp WHERE dmn IS NULL LIMIT 1;
       IF FOUND THEN
           dmn_number := dmn_number + 1;
           EXECUTE 'UPDATE tmp SET dmn = '||dmn_number||', chk = FALSE WHERE '||gid||' = (SELECT MIN('||gid||') FROM tmp WHERE dmn IS NULL LIMIT 1)';
       ELSE
           EXIT;
       END IF;
    END LOOP;

    RETURN QUERY EXECUTE 'SELECT ST_ConvexHull(ST_Collect('||geom||')) FROM tmp GROUP by dmn';

    RETURN;
END
$$
LANGUAGE plpgsql;

Example of using this function:

SELECT * FROM get_domains_n('poi', 'wkb_geometry', 'ogc_fid', 14000) AS g(gm geometry)

'poi' - name of layer, 'wkb_geometry' - name of geometry column, 'ogc_fid' - primary key of table, 14000 - cluster distance.

The result of using this function:

enter image description here


So far, the most promising I found is this extension for K-means clustering as a window function: http://pgxn.org/dist/kmeans/

However I haven't been able to install it successfully yet.


Otherwise, for basic grid clustering, you could use SnapToGrid.

SELECT
    array_agg(id) AS ids,
    COUNT( position ) AS count,
    ST_AsText( ST_Centroid(ST_Collect( position )) ) AS center,
FROM mytable
GROUP BY
    ST_SnapToGrid( ST_SetSRID(position, 4326), 22.25, 11.125)
ORDER BY
    count DESC
;