scikit-learn DBSCAN memory usage

The problem apparently is a non-standard DBSCAN implementation in scikit-learn.

DBSCAN does not need a distance matrix. The algorithm was designed around using a database that can accelerate a regionQuery function, and return the neighbors within the query radius efficiently (a spatial index should support such queries in O(log n)).

The implementation in scikit however, apparently, computes the full O(n^2) distance matrix, which comes at a cost both memory-wise and runtime-wise.

So I see two choices:

  1. You may want to try the DBSCAN implementation in ELKI instead, which when used with an R*-tree index usually is substantially faster than a naive implementation.

  2. Otherwise, you may want to reimplement DBSCAN, as the implementation in scikit apparently isn't too good. Don't be scared of that: DBSCAN is really simple to implement yourself. The trickiest part of a good DBSCAN implementation is actually the regionQuery function. If you can get this query fast, DBSCAN will be fast. And you can actually reuse this function for other algorithms, too.

Update: by now, sklearn no longer computes a distance matrix and can, e.g., use a kd-tree index. However, because of "vectorization" it will still precompute the neighbors of every point, so the memory usage of sklearn for large epsilon is O(n²), whereas to my understanding the version in ELKI will only use O(n) memory. So if you run out of memory, choose a smaller epsilon and/or try ELKI.


You can do this using scikit-learn's DBSCAN with the haversine metric and ball-tree algorithm. You do not need to precompute a distance matrix.

This example clusters over a million GPS latitude-longitude points with DBSCAN/haversine and avoids memory usage problems:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Note that this specifically uses scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed, which blows up your RAM real quick. But if you use Anaconda, you can quickly set this up with:

conda install scikit-learn=0.15

Or, create a clean virtual environment for this clustering task:

conda create -n clusterenv python=3.4 scikit-learn=0.15 matplotlib pandas jupyter
activate clusterenv