pgr_createTopology with large datasets

The following is what I am using. Some of it is specific to our deployment environment since we are using docker and some bash scripts to deploy and set up the server. You could easily get rid of all the argeparse/os.getenv and hardcode the connection if you wanted.

import argparse
from os import getenv
import psycopg2

parser = argparse.ArgumentParser()
parser.add_argument("-H", "--host", help="host location of postgres database", type=str)
parser.add_argument("-U", "--user", help="username to connect to the database", type=str)
parser.add_argument("-d", "--dbname", help="database name", type=str)
parser.add_argument("-p", "--port", help="port to connect to postgres", type=str)
args = parser.parse_args()
password = getenv('POSTGRES_PASSWORD')

conn = psycopg2.connect(
    f"dbname={args.dbname} user={args.user} host={args.host} port={args.port} password={password}"
)
cur = conn.cursor()
print("connected to database")

cur.execute("SELECT MIN(id), MAX(id) FROM ways;")
min_id, max_id = cur.fetchone()
print(f"there are {max_id - min_id + 1} edges to be processed")
cur.close()

interval = 200000
for x in range(min_id, max_id+1, interval):
    cur = conn.cursor()
    cur.execute(
    f"select pgr_createTopology('ways', 0.000001, 'the_geom', 'gid', rows_where:='id>={x} and id<{x+interval}');"
)
    conn.commit()
    x_max = x + interval - 1
    if x_max > max_id:
        x_max = max_id
    print(f"edges {x} - {x_max} have been processed")

cur = conn.cursor()
cur.execute("""ALTER TABLE ways_vertices_pgr
  ADD COLUMN IF NOT EXISTS lat float8,
  ADD COLUMN IF NOT EXISTS lon float8;""")

cur.execute("""UPDATE ways_vertices_pgr
  SET lat = ST_Y(the_geom),
      lon = ST_X(the_geom);""")

conn.commit()

Thank you @James for sharing this. It helped a lot.

For those of you who want to rebuild the whole topology: normally, this is achieved by clean:=true.

Since the ways table gets processed step by step (in our case every 200.000 rows) you cannot use the clean flag here because this would reset source and target columns and truncate the vertices table over and over.

See https://github.com/pgRouting/pgrouting/blob/master/sql/topology/pgrouting_topology.sql for more information about clean flag.

Therefore you can reset them manually (modify the schema geo. according to your DB):

UPDATE geo.ways SET source = NULL, target = NULL;
TRUNCATE TABLE geo.ways_vertices_pgr RESTART IDENTITY;

After that, you can use the script provided by @James.


I had the same problem using PostgreSQL 11.6 + PostGIS 2.5 + pgRouting 3.0.0 beta on an Ubuntu VPS.

Updating PostGIS to version 3.0.0 fixed the issue.