Numpy: get the index of the elements of a 1d array as a 2d array

Although the request is for a numpy solution, I decided to see whether there is an interesting numba-based solution. And indeed there is! Here's an approach that represents the partitioned list as a ragged array stored in a single preallocated buffer. This takes some inspiration from the argsort approach proposed by Paul Panzer. (For an older version that didn't do as well, but was simpler, see below.)

@numba.jit(numba.void(numba.int64[:], 
                      numba.int64[:], 
                      numba.int64[:]), 
           nopython=True)
def enum_bins_numba_buffer_inner(ints, bins, starts):
    for x in range(len(ints)):
        i = ints[x]
        bins[starts[i]] = x
        starts[i] += 1

@numba.jit(nopython=False)  # Not 100% sure this does anything...
def enum_bins_numba_buffer(ints):
    ends = np.bincount(ints).cumsum()
    starts = np.empty(ends.shape, dtype=np.int64)
    starts[1:] = ends[:-1]
    starts[0] = 0

    bins = np.empty(ints.shape, dtype=np.int64)
    enum_bins_numba_buffer_inner(ints, bins, starts)

    starts[1:] = ends[:-1]
    starts[0] = 0
    return [bins[s:e] for s, e in zip(starts, ends)]

This processes a ten-million item list in 75ms, which is nearly a 50x speedup from a list-based version written in pure Python.

For a slower but somewhat more readable version, here's what I had before, based on recently added experimental support for dynamically sized "typed lists," which allow us to fill up each bin in an out-of-order way much more quickly.

This wrestles with numba's type inference engine a bit, and I'm sure there's a better way to handle that part. This also turns out to be almost 10x slower than the above.

@numba.jit(nopython=True)
def enum_bins_numba(ints):
    bins = numba.typed.List()
    for i in range(ints.max() + 1):
        inner = numba.typed.List()
        inner.append(0)  # An awkward way of forcing type inference.
        inner.pop()
        bins.append(inner)

    for x, i in enumerate(ints):
        bins[i].append(x)

    return bins

I tested these against the following:

def enum_bins_dict(ints):
    enum_bins = defaultdict(list)
    for k, v in enumerate(ints):
        enum_bins[v].append(k)
    return enum_bins

def enum_bins_list(ints):
    enum_bins = [[] for i in range(ints.max() + 1)]
    for x, i in enumerate(ints):
        enum_bins[i].append(x)
    return enum_bins

def enum_bins_sparse(ints):
    M, N = ints.max() + 1, ints.size
    return sparse.csc_matrix((ints, ints, np.arange(N + 1)),
                             (M, N)).tolil().rows.tolist()

I also tested them against a precompiled cython version similar to enum_bins_numba_buffer (described in detail below).

On a list of ten million random ints (ints = np.random.randint(0, 100, 10000000)) I get the following results:

enum_bins_dict(ints)
3.71 s ± 80.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_list(ints)
3.28 s ± 52.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_sparse(ints)
1.02 s ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_numba(ints)
693 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_cython(ints)
82.3 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

enum_bins_numba_buffer(ints)
77.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Impressively, this way of working with numba outperforms a cython version of the same function, even with bounds-checking turned off. I don't yet have enough familiarity with pythran to test this approach using it, but I'd be interested to see a comparison. It seems likely based on this speedup that the pythran version might also be quite a bit faster with this approach.

Here's the cython version for reference, with some build instructions. Once you have cython installed, you'll need a simple setup.py file like this:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy

ext_modules = [
    Extension(
        'enum_bins_cython',
        ['enum_bins_cython.pyx'],
    )
]

setup(
    ext_modules=cythonize(ext_modules),
    include_dirs=[numpy.get_include()]
)

And the cython module, enum_bins_cython.pyx:

# cython: language_level=3

import cython
import numpy
cimport numpy

@cython.boundscheck(False)
@cython.cdivision(True)
@cython.wraparound(False)
cdef void enum_bins_inner(long[:] ints, long[:] bins, long[:] starts) nogil:
    cdef long i, x
    for x in range(len(ints)):
        i = ints[x]
        bins[starts[i]] = x
        starts[i] = starts[i] + 1

def enum_bins_cython(ints):
    assert (ints >= 0).all()
    # There might be a way to avoid storing two offset arrays and
    # save memory, but `enum_bins_inner` modifies the input, and
    # having separate lists of starts and ends is convenient for
    # the final partition stage.
    ends = numpy.bincount(ints).cumsum()
    starts = numpy.empty(ends.shape, dtype=numpy.int64)
    starts[1:] = ends[:-1]
    starts[0] = 0

    bins = numpy.empty(ints.shape, dtype=numpy.int64)
    enum_bins_inner(ints, bins, starts)

    starts[1:] = ends[:-1]
    starts[0] = 0
    return [bins[s:e] for s, e in zip(starts, ends)]

With these two files in your working directory, run this command:

python setup.py build_ext --inplace

You can then import the function using from enum_bins_cython import enum_bins_cython.


Here is an O(max(x)+len(x)) approach using scipy.sparse:

import numpy as np
from scipy import sparse

x = np.array("1 2 2 0 0 1 3 5".split(),int)
x
# array([1, 2, 2, 0, 0, 1, 3, 5])


M,N = x.max()+1,x.size
sparse.csc_matrix((x,x,np.arange(N+1)),(M,N)).tolil().rows.tolist()
# [[3, 4], [0, 5], [1, 2], [6], [], [7]]

This works by creating a sparse matrix with entries at positions (x[0],0), (x[1],1), ... Using the CSC (compressed sparse column) format this is rather simple. The matrix is then converted to LIL (linked list) format. This format stores the column indices for each row as a list in its rows attribute, so all we need to do is take that and convert it to list.

Note that for small arrays argsort based solutions are probably faster but at some not insanely large size this will cross over.

EDIT:

argsort-based numpy-only solution:

np.split(x.argsort(kind="stable"),np.bincount(x)[:-1].cumsum())
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]

If the order of indices within groups doesn't matter you can also try argpartition (it happens to make no difference in this small example but this is not guaranteed in general):

bb = np.bincount(x)[:-1].cumsum()
np.split(x.argpartition(bb),bb)
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]

EDIT:

@Divakar recommends against the use of np.split. Instead, a loop is probably faster:

A = x.argsort(kind="stable")
B = np.bincount(x+1).cumsum()
[A[B[i-1]:B[i]] for i in range(1,len(B))]

Or you could use the brand new (Python3.8+) walrus operator:

A = x.argsort(kind="stable")
B = np.bincount(x)
L = 0
[A[L:(L:=L+b)] for b in B.tolist()]

EDIT(EDITED):

(Not pure numpy): As an alternative to numba (see @senderle's post) we can also use pythran.

Compile with pythran -O3 <filename.py>

import numpy as np

#pythran export sort_to_bins(int[:],int)

def sort_to_bins(idx, mx):
    if mx==-1: 
        mx = idx.max() + 1
    cnts = np.zeros(mx + 2, int)
    for i in range(idx.size):
        cnts[idx[i] + 2] += 1
    for i in range(3, cnts.size):
        cnts[i] += cnts[i-1]
    res = np.empty_like(idx)
    for i in range(idx.size):
        res[cnts[idx[i]+1]] = i
        cnts[idx[i]+1] += 1
    return [res[cnts[i]:cnts[i+1]] for i in range(mx)]

Here numba wins by a whisker performance-wise:

repeat(lambda:enum_bins_numba_buffer(x),number=10)
# [0.6235917090671137, 0.6071486569708213, 0.6096088469494134]
repeat(lambda:sort_to_bins(x,-1),number=10)
# [0.6235359431011602, 0.6264424560358748, 0.6217901279451326]

Older stuff:

import numpy as np

#pythran export bincollect(int[:])

def bincollect(a):
    o = [[] for _ in range(a.max()+1)]
    for i,j in enumerate(a):
        o[j].append(i)
    return o

Timings vs. numba (old)

timeit(lambda:bincollect(x),number=10)
# 3.5732191529823467
timeit(lambda:enumerate_bins(x),number=10)
# 6.7462647299980745

One potential option depending on the size of your data is to just drop out of numpy and use collections.defaultdict:

In [248]: from collections import defaultdict

In [249]: d = defaultdict(list)

In [250]: l = np.random.randint(0, 100, 100000)

In [251]: %%timeit
     ...: for k, v in enumerate(l):
     ...:     d[v].append(k)
     ...:
10 loops, best of 3: 22.8 ms per loop

Then you end up with a dictionary of {value1: [index1, index2, ...], value2: [index3, index4, ...]}. The time scaling is pretty close to linear with the size of the array, so 10,000,000 takes ~2.7s on my machine, which seems reasonable enough.