Find clusters of bacteria

Detecting features is easily done with the scipy ndimage measurements module. It has the added advantage of speed, if you go this way.

import numpy as np
from scipy.ndimage.measurements import label, find_objects

q = np.genfromtxt('bacteria.txt', dtype='S1', comments=':', delimiter=1)
arr = (q == b'#')  # convert to boolean mask because ' ' evaluates to True

labelled, num_features = label(arr)

def count_edge_objects(labelled):
    hulls = find_objects(labelled)
    nbr_edgeobjects = 0
    for rowslice, colslice in hulls:
        if (rowslice.start == 0 or rowslice.stop == labelled.shape[0] or
            colslice.start == 0 or colslice.stop == labelled.shape[1]):
            nbr_edgeobjects += 1
    return nbr_edgeobjects

print('{} objects'.format(num_features - count_edge_objects(labelled)))
# output: 
# 4 objects

In the dataset you've shown there are 2 objects near the edge: the one at the top and the one at the bottom. Remark that I'm currently assuming the dataset has an equal amount of characters on each line (if not, check out the missing_values option of np.genfromtxt)


The problem you had is that from the moment you formed two clusters, you couldn't join them. Even if eventually the two clusters were meant to be joined by the addition of intermediate nodes.

This can be solved by an application of a union-find data structure. An unoptimized python version is:

s = """\
                     ###                    \
                     #####                  \
                     #######                \
                    #######                 \
                     ######                 \
                     ###### ##              \
                     ####  #####            \
                       ## ######        ####\
                    #    ######       ####  \
                  ###  ##########    #####  \
                #######  ####  ##   ######  \
                ######### ##   #      ##### \
      #           ####   ###          ###   \
     #####        ####    #     ##     ##   \
     #####                    ######    #   \
    ######                   ########       \
     ####                     ########      \
                              #######       \
                              #######       \
"""
representatives = {i: i for i, c in enumerate(s) if c == '#'}
nrows, ncols = 19, 44

def neighbours(idx):
    i, j = divmod(idx, ncols)
    if i > 0: yield idx - ncols
    if i < nrows - 1: yield idx + ncols
    if j > 0: yield idx - 1
    if j < ncols - 1: yield idx + 1

def representative(a):
    while representatives[a] != a: a = representatives[a]
    return a

def join(a, b):
    repr_a, repr_b = representative(a), representative(b)
    if repr_a != repr_b: representatives[repr_a] = repr_b

for idx in representatives:
    for n in neighbours(idx):
        if s[n] == '#': join(idx, n)

cluster_count = len(set(map(representative, representatives)))

Result:

6

You could also have created also a graph and used depth first search to find the connected components. The advantage of the above method is that it's incremental and you can update easily the clusters with the addition of new points.

Tags:

Python