Python: How to obtain a random subset

What subset you obtain will depend largely on the criterion you specify for including or excluding elements. If you have a function criterion that accepts an element and returns a Boolean to indicate inclusion in the subset, the actual creation process becomes simply

from random import randrange

def random_subset(s, criterion=lambda x: randrange(2)):
    return set(filter(criterion, s))

filter creates a lazy generator, so the return subset is the only place the selection gets stored. The default criterion is very simple and has a uniform distribution. randrange is similar to randint except that it is exclusive in the right bound. At least as of Python 3.2+, both functions produce fairly uniform results regardless of range size.

You can further refine the criterion by using random:

from random import random

criterion = lambda x: random() < 0.5

Applying a threshold like that may seem like overkill, but it lets you adjust the distribution. You can have a function that generates criteria for whatever threshold you like:

def make_criterion(threshold=0.5):
    return lambda x: random() < threshold

You could use it to get a smaller subset:

random_subset(s, make_criterion(0.1))

In fact, you can make the criterion as complicated as you would like. The following example is a contrived callable class that operates on sets of strings. If a string with a matching first character has already been added, it automatically rejects the current element. If the second letter has been seen already, it sets the probability of inclusion to 0.25. Otherwise, it flips a coin:

class WeirdCriterion:

    def __init__(self):
        self.first = set()
        self.second = set()

    def __call__(self, x):
        n = len(x)
        if n > 0:
            if x[0] in self.first:
                return False
            self.first.add(x[0])
            if n > 1:
                if x[1] in self.second:
                    return not randrange(4)
                self.second.add(x[1])
        return randrange(2)

This example wouldn't be very good in practice because sets are unordered, and can give different iteration orders between different runs of the same script. What it shows, however, is a method for creating a criterion that is random, but is adjusted in response to elements that are already in the subset.

Avoiding Numpy

Now that I have a better understanding of your original intent, you can leverage the fact that Python 3 has infinite length integers and that choices accepts a length parameter to get the correct length. I don't recommend this approach though:

from random import choices, sample
from math import factorial

def random_subset(s):
    n = len(s)
    nf = factorial(n)
    # yes, there are better ways of doing this, even in pure python
    weights = [nf / (factorial(k) * factorial(n - k)) for k in range(n + 1)]
    length = choices(range(n + 1), weights, k=1)[0]
    return sample(s, length)

A better solution for computing the binomial coefficients could be something like:

def pascal(n):
    result = [1] * (n + 1)
    if n < 2:
        return result
    for i in range(2, n + 1):
        for j in range(i - 1, 0, -1):
            result[j] += result[j - 1]
    return result

I just realized I can simply go through each element in s and decide independently to keep it or not. Something like this

from random import randint

def random_subset(s):
    out = set()
    for el in s:                                                                                                                    
        # random coin flip
        if randint(0, 1) == 0:
            out.add(el)
    return out

This has the correct distribution.

Python: How to obtain a random subset

Tags:

Python

Python 3.X

Set

Subset

Related

Recent Posts