Repeatedly taking mean values of non-empty subsets of a set: $2,\,3,\,5,\,15,\,875,\,...$

Out of curiosity, I computed this directly. Apparently $a_5 = |S_5| = 603919253973 \approx 6\cdot 10^{11}$.

I wrote a program to calculate all subset sums $s$ for each subset size $k$. (There are just under $10^{12}$ such $s$'s.) Then the values $\frac s k$ are the members of $S_5$. For checking the algorithm, I also obtained $|S_{5,125}| = 33947876$ and $|S_{5,200}| = 1088970851$, i.e. restricted to the smallest 125 and 200 elements of $S_4$.

This value for $|S_5|$ is fairly close to Greg Martin's upper bound, which isn't surprising to me; there are $2^{875}-1$ subsets and less than $7\cdot 10^{12}$ places for their sums to go. Most of the gap between $|S_5|$ and that upper bound is due to the fact that, for whatever reason, only $10^{12}$ of the possible sums actually occur. Part of the remaining gap can be explained by divisibility—the naïve model where $\frac{\phi(d)}{k}$ of the $\frac s k$'s have reduced denominator $d$ (which holds if the $s$'s are evenly distributed mod $k$) appears to fit quite accurately for $S_5$. I suspect that $|S_6|$ is also not far from its corresponding upper bound, the trivial bound being $$lcm(1\dots875)\cdot \sum_{k=1}^{|S_5|} k \approx 10^{405}.$$

I also used a birthday method to guess $|S_5|$: take averages of random subsets of $S_4$ until there is a duplicate. Since the take-random-subset operation does not produce uniformly random averages, this provides a too-low estimator rather than an unbiased one. For $S_5$ in particular, it guesses a lower bound of $10^{10}$ (improvable to $10^{11}$ using some ad-hoc methods to reduce the non-uniformity). Unfortunately it takes expected $\Theta(\sqrt{N})$ samples for a size $N$, which makes it quite hopeless for $S_6$.


Not a complete answer, but some ideas:

The number of distinct means of $k$-element subsets of an $n$-element set is at least $k(n-k)+1$. For example, when $k=3$ and $n=9$, the following subsets all have different means: $\{x_1,x_2,x_3\}$, $\{x_1,x_2,x_4\}$, ..., $\{x_1,x_2,x_9\}$, $\{x_1,x_3,x_9\}$, ..., $\{x_1,x_8,x_9\}$, $\{x_2,x_8,x_9\}$, ..., $\{x_3,x_8,x_9\}$. Applying this with $n=875$ and $k=438$ already gives 191,407 distinct means.

We can build on this though. Of the 875 means counted by $a_4$, 52 of them have a factor of 13 in their denominator, while the other 823 do not. Taking $k=412$ and $n=823$, we get 169,333 subsets with distinct means. But furthermore, of those 52, there are numerators corresponding to each nonzero residue class modulo 13. Therefore we can take each of the 169,333 subsets and get 13 variants of it with different means (the subset itself, together with the subset with a single element appended, that element having a denominator divisible by 13 and a numerator from each of the nonzero residue classes modulo 13). That gives 2,201,329 means that a little thought verifies are distinct.

One could experiment with denominator factors other than 13 (perhaps composite ones) to squeeze more out of this argument.

Finally, note that the mean of a $p$-element subset and the mean of a $q$-element subset, if $p$ and $q$ are relatively prime, are quite likely to be distinct from each other. (Both primes would have to be cancelled from their denominators by the sums of the elements in the subsets.) So one should be able to combine various collections of means in this way and improve the lower bound. (Of course, taking $p$ and $q$ near $875/2$ seems the best place to explore.)

(added later) As for the upper bound, let's bound the number of $k$-element means separately and mostly forget about whether they could coincide. Obviously there are 875 $1$-element means. For $2\le k\le 875$, there are obviously at most $\binom{875}k$ $k$-element means. However, we can get a different upper bound as follows: The largest of the 875 elements is $1$ of course, and the least common denominator of the 875 elements is 17,297,280. Therefore every single $k$-element mean is a rational number between $0$ and $1$ whose denominator divides $17\text{,}297\text{,}280k$, and there are at most $17\text{,}297\text{,}280k-1$ of them (not counting $0$ and $1$ themselves, which are already counted by the $1$-element means). Therefore an upper bound for $a_5$ is $$ 875 + \sum_{k=2}^{875} \min\bigg\{ \binom{875}k, 17\text{,}297\text{,}280k-1 \bigg\} = 6\text{,}568\text{,}806\text{,}008\text{,}597. $$ So at least we know that $a_5$ is between $2\times10^6$ and $7\times10^{12}$.


First I should want to confirm the Japhet's results for $|S_{5,125}|$ and $|S_{5,200}|$. I cannot use the large amount of disk ( 17 Tb ? ) on a sole system and NFS units are not fast enough for intensive computation. But it's not the purpose of this answer and if I find a fast solution between brute force and soft computation, I'll answer twice.


The values of A(n) are very big and we cannot expect to compute them by brute force, $A(5)$ being a feat. I should want to share some observations, perhaps someone can take this further ( before me ).

First, come back to the question : How to compute A(n) ?. It is difficult because many elements of S(n) are found many times ( multiples occurences ) and we cannot merely add them.

Let $D(n)$ = all the partitions found at step $n$ , in surjection with $S(n)$. Let $F(n)$ the frequencies of each result.

Then we can write : cardinal $D(n) = \sum_{i \in S(n)} F(i)$

while

$A(n) = \sum_{i \in S(n)} 1$

$D(n)$ is known and if we find all the $S(i)$ coupled with their $F(i)$, we have a way to check the result against the $D(n)$ value.

It may seem harder but obviously we are about to search effective optimization operators for inductions between consecutive $n$ to make the results computationable. It's not an useless complication.

Let's compute the couples $C(n) = \{ (s,f) \in S(n)xF(n) \} $ for $n=4$.


Don't ask symmetries-lovers if they are happy ...

Frequencies of the means on the segment

Frequencies of the means on the segment


Zoom : Frequencies of the means on the segment

Zoom : Frequencies of the means on the segment

Note : the repetition width is constant as it seems. For $n=4$ , the $width = 1/48$


another Zoom : Frequencies of the means on the segment

another Zoom : Frequencies of the means on the segment

yes, in the cyclic width, you see an image of all the curve at the previous step. There are duplicates.


Note : many high ABC quality numbers and primaries numbers are in the remarquable values involved in this question. I'm very troubled by a confuse view giving a semi cyclic function pointing on H-Q ABC numbers and primaries. Anyway, it's an excellent summer end's puzzle.

I like too much this question because it has concerns with simulations I am following around thermodynamics , QM and information propagation. I hope that another answers will help and that this one will evolve.

Feel free to ask for my scripts. The tools are ( the very efficient ) linux sort , C++ and Grapher. The last version mixes brute force and tricks coming from the fractal approach. Next version in preparation.