Finding clusters of numbers in a list

import itertools
import numpy as np

numbers = np.array([123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430])
nd = [0] + list(np.where(np.diff(numbers) > 15)[0] + 1) + [len(numbers)]

a, b = itertools.tee(nd)
next(b, None)
res = {}
for j, (f, b) in enumerate(itertools.izip(a, b)):
    res[j] = numbers[f:b]

If you can use itertools and numpy. Adapted pairwise for the iterator tricks. The +1 is needed to shift the index, adding the 0 and len(numbers) onto the list makes sure the first and last entries are included correctly.

You can obviously do this with out itertools, but I like tee.


Using the generator to separate the logic: (one function does one thing)

numbers = [123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430]

def cut_indices(numbers):
    # this function iterate over the indices that need to be 'cut'
    for i in xrange(len(numbers)-1):
        if numbers[i+1] - numbers[i] > 15:
            yield i+1

def splitter(numbers):
    # this function split the original list into sublists.
    px = 0
    for x in cut_indices(numbers):
        yield numbers[px:x]
        px = x
    yield numbers[px:]

def cluster(numbers):
    # using the above result, to form a dict object.
    cluster_ids = xrange(1,len(numbers))
    return dict(zip(cluster_ids, splitter(numbers)))

print cluster(numbers)

The above codes give me

{1: [123, 124, 128], 2: [160, 167], 3: [213, 215, 230, 245, 255, 257], 4: [400, 401, 402], 5: [430]}

Not strictly necessary if your list is small, but I'd probably approach this in a "stream-processing" fashion: define a generator that takes your input iterable, and yields the elements grouped into runs of numbers differing by <= 15. Then you can use that to generate your dictionary easily.

def grouper(iterable):
    prev = None
    group = []
    for item in iterable:
        if prev is None or item - prev <= 15:
            group.append(item)
        else:
            yield group
            group = [item]
        prev = item
    if group:
        yield group

numbers = [123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430]
dict(enumerate(grouper(numbers), 1))

prints:

{1: [123, 124, 128],
 2: [160, 167],
 3: [213, 215, 230, 245, 255, 257],
 4: [400, 401, 402],
 5: [430]}

As a bonus, this lets you even group your runs for potentially-infinite lists (as long as they're sorted, of course). You could also stick the index generation part into the generator itself (instead of using enumerate) as a minor enhancement.


You can achieve that with no (explicit) loops using numpy / pandas:

import pandas as pd    
import numpy as np

n = 15
numbers = [123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430]
nnumbers = np.array(numbers)
clusters = pd.DataFrame({
    'numbers': numbers,
    'segment': np.cumsum([0] + list(1*(nnumbers[1:] - nnumbers[0:-1] > n))) + 1
}).groupby('segment').agg({'numbers': set}).to_dict()['numbers']

The trick is to shift the list of numbers and compare the difference with your threshold (15) to find 'breaks' between segments. Of course, first element will not be a break. Then use cumsum function to get the segments and do the group by using a set function (in case there are duplicates). Hope this is helpful even though many years have passed since posting this question.

Tags:

Python

List