Calculate mean on values in python collections.Counter

collections.Counter() is a subclass of dict. Just use Counter().values() to get a list of the counts:

counts = Counter(some_iterable_to_be_counted)
mean = numpy.mean(counts.values())

Note that I did not call Counter.most_common() here, which would produce the list of (key, count) tuples you posted in your question.

If you must use the output of Counter.most_common() you can filter out just the counts with a list comprehension:

mean = numpy.mean([count for key, count in most_common_list])

If you are using Python 3 (where dict.values() returns a dictionary view), you could either pass in list(counts.values()), or use the standard library staticstics.mean() function, which takes an iterable (including dict.values() dictionary view).

If you meant to calculate the mean key value as weighted by their counts, you'd do your own calculations directly from the counter values. In Python 2 that'd be:

from __future__ import division

mean = sum(key * count for key, count in counter.iteritems()) / sum(counter.itervalues())

The from __future__ import should be at the top of your module and ensures that you won't run into overflow issues with large floating point numbers. In Python 3 that'd be simplified to:

mean = sum(key * count for key, count in counter.items()) / sum(counter.values())

The median could be calculated with bisection; sort the (key, count) pairs by key, sum the counts, and bisect the half-way point into a accumulated sum of the counts. The index for the insertion point points to the median key in the sorted keys list.


While you can offload everything to numpy after making a list of values, this will be slower than needed. Instead, you can use the actual definitions of what you need.

The mean is just the sum of all numbers divided by their count, so that's very simple:

sum_of_numbers = sum(number*count for number, count in counter.items())
count = sum(count for n, count in counter.items())
mean = sum_of_numbers / count

Standard deviation is a bit more complex. It's the square root of variance, and variance in turn is defined as "mean of squares minus the square of the mean" for your collection. Soooo...

total_squares = sum(number*number * count for number, count in counter)
mean_of_squares = total_squares / count
variance = mean_of_squares - mean * mean
std_dev = math.sqrt(variance)

A little bit more manual work, but should also be much faster if the number sets have a lot of repetition.

Tags:

Python

Numpy