EGF of rooted minimal directed acylic graph

I would like to present a script to compute this statistic in order to generate activity on this question and motivate research into fast algorithms the goal being to compute enough terms for an OEIS entry. The linked paper shows that this will probably not be all that easy.

The below Perl script can be used to compute this statistic up to $a_7.$ It outputs the distributions of the number of distinct subtrees of a binary tree on $k$ nodes with $k\le 2^n$ when $a_n$ is being computed.

Here´is the output for $a_6,$ which confirms the values presented in the comments.

1: u (1)
2: 2 u^2 (2)
3: u^2 + 4 u^3 (5)
4: 6 u^3 + 8 u^4 (14)
5: 4 u^3 + 22 u^4 + 16 u^5 (42)
6: 32 u^4 + 68 u^5 + 32 u^6 (132)
7: u^3 + 20 u^4 + 152 u^5 + 192 u^6 (365)
8: 10 u^4 + 196 u^5 + 584 u^6 (790)
9: 12 u^4 + 158 u^5 + 1140 u^6 (1310)
10: 160 u^5 + 1436 u^6 (1596)
11: 6 u^4 + 96 u^5 + 1692 u^6 (1794)
12: 68 u^5 + 1568 u^6 (1636)
13: 88 u^5 + 1284 u^6 (1372)
14: 24 u^5 + 1256 u^6 (1280)
15: u^4 + 36 u^5 + 1112 u^6 (1149)
16: 6 u^5 + 760 u^6 (766)
17: 24 u^5 + 854 u^6 (878)
18: 408 u^6 (408)
19: 18 u^5 + 504 u^6 (522)
20: 308 u^6 (308)
21: 416 u^6 (416)
22: 48 u^6 (48)
23: 8 u^5 + 246 u^6 (254)
24: 92 u^6 (92)
25: 160 u^6 (160)
26: 32 u^6 (32)
27: 144 u^6 (144)
28: 0 (0)
29: 72 u^6 (72)
30: 0 (0)
31: u^5 + 52 u^6 (53)
32: 6 u^6 (6)
33: 12 u^6 (12)
34: 0 (0)
35: 42 u^6 (42)
36: 0 (0)
37: 0 (0)
38: 0 (0)
39: 24 u^6 (24)
40: 0 (0)
41: 0 (0)
42: 0 (0)
43: 0 (0)
44: 0 (0)
45: 0 (0)
46: 0 (0)
47: 10 u^6 (10)
48: 0 (0)
49: 0 (0)
50: 0 (0)
51: 0 (0)
52: 0 (0)
53: 0 (0)
54: 0 (0)
55: 0 (0)
56: 0 (0)
57: 0 (0)
58: 0 (0)
59: 0 (0)
60: 0 (0)
61: 0 (0)
62: 0 (0)
63: u^6 (1)
64: 0 (0)
-
u + 3 u^2 + 15 u^3 + 111 u^4 + 1119 u^5 + 14487 u^6

The script is actually highly compact and may perhaps benefit from being re-written in C. This is the code:

#! /usr/bin/perl -w
#

sub gf2str {
    my ($gf) = @_;

    return "0" if scalar(keys(%$gf)) == 0;

    my @terms;
    foreach my $exp (sort { $a <=> $b } keys %$gf){
        my $contr = $gf->{$exp};

        if($contr == 1 && $exp == 1){
            push @terms, "u";
        }
        elsif($contr == 1){
            push @terms, "u^$exp";
        }
        elsif($exp == 1){
            push @terms, 
            sprintf "%d u", $contr;
        }
        else{
            push @terms,
            sprintf "%d u^%d", $contr, $exp;
        }
    }

    join(' + ', @terms);
}


MAIN: {
    my $mx = shift || 1;

    my %grand;


    my $memo = [];
    push @{ $memo->[0]->[0] }, {};

    for(my $n=1; $n <= 2**$mx; $n++){
        for(my $m=0; $m<=$n-1; $m++){
            for(my $dst1 = 0; $dst1 < $mx; $dst1++){
                for(my $dst2 = 0; $dst2 < $mx; $dst2++){
                    if(exists($memo->[$m]->[$dst1]) &&
                       exists($memo->[$n-1-$m]->[$dst2])){
                        my $t1 = $memo->[$m]->[$dst1];
                        my $t2 = $memo->[$n-1-$m]->[$dst2];

                        for my $ta (@$t1){
                            for my $tb (@$t2){
                                my $tree = {};

                                @$tree{ keys %$ta } = 
                                    (1) x scalar(keys %$ta);
                                @$tree{ keys %$tb } = 
                                    (1) x scalar(keys %$tb);

                                $tree->{$tree} = 1;

                                my $count = scalar(keys %$tree);
                                if($count <= $mx){
                                    push @{ $memo->[$n]->[$count] },
                                    $tree;
                                }
                            }
                        }
                    }
                }
            }
        }

        my %gf = (); my $total = 0;
        for(my $dst = 0; $dst <= $mx; $dst++){
            if(exists($memo->[$n]->[$dst])){
                my $val = scalar(@{ $memo->[$n]->[$dst] });
                $gf{$dst} = $val;

                $total += $val;
                $grand{$dst} += $val;
            }
        }


        print "$n: ";
        print gf2str(\%gf);
        print " ($total)\n";
    }

    print "-\n";

    print gf2str(\%grand);
    print "\n";
}

OK, now it looks better: I'm quite confident the sequence starts

(1,1)
(2,3)
(3,15)
(4,111)
(5,1119)
(6,14487)
(7,230943)
(8,4395855)
(9,97608831)
(10,2482988079)
(11,71321533887)
(12,2286179073663)
(13,80984105660415)
(14,3144251526824991)
(15,132867034410319359)

and that's computed within a few seconds, using the following approach:

based on function count :: [[Bool]] -> Int where count xss is the number of dags with map length xss nodes at the respective level, and in each level, coded by an element xs :: [Bool] of xss, the entries of xs mark whether this node should have a predecessor.

In more detail, here's the specification of count:

We define a function (just for specification, it is not in the source below) shape :: DAG -> [[Bool]] that takes a DAG (any DAG, may have several roots), computes the list of level sets, then for each set, a canonical ordering (a list) of its nodes (lexicographic by left-child, right child, using the ordering in the lower levels), then for each node, whether it has a predecessor (a node higher up that points here). Now count s gives the number of DAGs d that have shape d == s.

The point is that we can define count recursively (induction by the number of levels), and we never really construct DAGs - we just count.

And while we count, we avoid recomputations, using memoFix (a fixpoint combinator with a cache, really). You may simply think count arg = case arg ... return $ count ...

To run this with ghc, you need packages lens and memoize. You can load the source code in ghci and evaluate expressions like count [[False],[True],[True]]. (It seems the code indentation here is broken. Watch out that expressions inside do are aligned properly.)

import Control.Monad ( guard, forM_ )
import Control.Applicative
import Control.Lens
import Data.List (tails, sort)
import Data.Function.Memoize
import System.IO

type Shape = [[Bool]]

main = forM_ [ 1 .. ] $ \ s -> do
       print ( s, sum $ map count $ shapes s )
   hFlush stdout

shapes s = do  sh <- deep_shapes (s-1) ;  return $ [False] : sh

deep_shapes :: Int -> [Shape]
deep_shapes 0 = return []
deep_shapes s = do
  x <- [ 1 .. s ] ; xs <- deep_shapes (s-x)
  return $ (replicate x True) : xs

count :: Shape -> Int
count = memoFix $ \ self arg -> case arg of
       [] -> 1
       (sh : ape) -> sum $ do
      guard $ and $ map not sh
      top <- pairs (length sh) ape
      return $ self $ apply top ape

type Node = (Int,Int)
type Pair = (Node,Node)

apply :: [Pair] -> Shape -> Shape
apply top shape = 
    foldr ( \ (h,k) sh -> sh & ix (length shape - h) . ix k .~ False ) 
      shape $ do (p,q) <- top ; [p,q]

pairs s shape = pick s $ sort $ do
    let cs = candidates shape
        lower = concat $ drop 1 cs
            top = concat $ take 1 cs
    (left,right) <- [(lower,top),(top,top),(top,lower)]
    (,) <$> left <*> right

candidates :: [[Bool]] -> [[(Int,Int)]]
candidates shape = ( do
   (h,ops) <- zip [length shape, length shape-1 ..] shape
   return $ do (n, _) <- zip [0..] ops ; return (h,n) ) ++ [[(0,0)]]

pick :: Int -> [a] -> [[a]]
pick 0 _ = return []
pick s xs = do
  z : ys <- tails xs ; guard $ length ys >= s-1
      zs <- pick (s-1) ys ; return $ z : zs

Update: An $\mathcal{O}(n^6)$ dynamic programming solution was found on the Project Euler forums : $T(100)$. A faster recurrence was discovered in [Genetrini 2017] : $T(350)$.

I'm hoping that this post will serve as a full exposition of our ideas thus far - so feel free to edit or add to it. We are trying to count trees by their number of distinct subtrees. The compiled list and OEIS A254789 (offset 1):

$$ \begin{array} ( 1 & 1 \\ 2 & 1 \\ 3 & 3 \\ 4 & 15 \\ 5 & 111 \\ 6 & 1119 \\ 7 & 14487 \\ 8 & 230943 \\ 9 & 4395855 \\ 10 & 97608831 \\ 11 & 2482988079 \\ 12 & 71321533887 \\ 13 & 2286179073663 \\ 14 & 80984105660415 \\ 15 & 3144251526824991 \\ 16 & 132867034410319359 \\ 17 & 6073991827274809407 \\ 18 & 298815244349875677183 \\ 19 & 15746949613850439270975 \\ 20 & 885279424331353488224511 \\ 21 & 52902213099156326247243519 \\ \end{array} $$

$$ \; \\ $$

Motivation

The question of counting such trees is a natural extension of the results in $[1]$, neatly summarized

paper abstract

$$ \; \\ $$

Problem Statement

In this thread we will be concerned with one type of tree, namely unlabeled plane rooted full binary trees. For convenience and clarity we drop the descriptive titles and simply call them "trees". In such trees each internal node has exactly two children (full binary) and we establish the order of children (plane). Instead of approximating the number of distinct subtrees given a tree, we want to count the number of trees with $k$ distinct subtrees, or rather the number of trees with $k$ nodes in its compacted DAG. More precisely, let $\mathcal{T}$ be the set of all trees, and for a given tree $\tau \in \mathcal{T}$, let $S(\tau)$ be the set of subtrees in $\tau$. Then we want to count $\tilde{\mathcal{T}_k} = |\mathcal{T}_k|$ where

$$ \mathcal{T}_k = \left\{ \tau \in \mathcal{T}, \; \left| \, S(\tau) \, \right| = k \right\} $$

Note that $\left| \, S(\tau) \, \right| = \left| \, \text{dag}(\tau) \, \right|$ is the number of nodes in the compacted DAG of $\tau$.

$$ \; \\ $$

Examples

To understand what we are counting and how the compaction works, here is an enumeration of $\mathcal{T}_k$ for $k\leq4$. The trees are colored black, their compacted DAGs blue, and their subtrees red. Note how each node in the DAGs corresponds to a particular subtree. [Higher Quality]

$ \qquad \qquad \qquad \qquad $ trees

You may notice that the OEIS is offset by 1 and that the example trees are not full (some internal nodes have less than two children). By removing the leaves from each of our trees we notice a bijection between full binary trees with $k$ subtrees and binary trees with $k-1$ subtrees, hence the offset. All the methods of counting developed have an analog slightly altered to fit this interpretation (eg. in the canonical form of the DAGs each node may have less than two children). Since this doesn't provide any speedup in computation, we will ignore this interpretation henceforth.

$$ \; \\ $$

Preliminary Observations

  • If a tree has $n$ internal nodes, then it has $n\!+\!1$ leaves. The number of trees with $n$ internal nodes is the Catalan number $C_n = \frac{1}{n+1}{2n \choose n} \approx \frac{4^n}{\sqrt{\pi n^3}} $.

  • If a tree $\tau$ has height $h$, then $\left| \, \tau \, \right| \in [2h\!-\!1,2^h\! -\! 1]$ and $\left| \, S (\tau) \, \right| \geq h $. The number of trees with height $h$ satisfies the recurrence $T_{(h)} = T_{(h-1)}^2 + 2 T_{(h-1)} \sum_{k=1}^{h-2} T_{(k)}$. As seen on OEIS A001699, $T_{(h)} \approx 1.5^{2^h}$.

  • The result of $[1]$ gives us a rough estimate for $\tilde{\mathcal{T}_k}$. It tells us that the expected number of subtrees for a tree with $n$ internal nodes is $$\tilde{K}_n = 2 \sqrt{\frac{\log 4}{\pi}} \frac{n}{\sqrt{\log n}} \left( 1 + \mathcal{O}\left(\tfrac{1}{\log n} \right) \right)$$ Suppose we fix $\tilde{K}_n=k$ and solve for $n$. Then we get the expected number of internal nodes of a tree with $k$ subtrees. Given this estimate of $n$, we can guess that there will be roughly $C_n$ different trees with $k$ subtrees.$$ \; $$ $ \qquad \qquad \qquad \qquad \qquad $ rough estimate $$ \; $$ The fact that this underestimates the true values can perhaps be attributed to the $\mathcal{O}(\frac{1}{\log n})$ term, which I took to be zero. In any case we see exponential growth, which means that we will need to develop a method to count rather than enumerate trees.

$$ \; \\ $$

A Method of Enumeration

Each tree $\tau = x\,(L,R)$ is uniquely defined by its set of subtrees, which can be written in terms of those of the left and right subtrees $L$ and $R$. $$ S(\tau) = \{\tau\} \cup S(L) \cup S(R) $$ This leads to a natural algorithm to enumerate $\tilde{\mathcal{T}_k}$. We build $\mathcal{T}$ by glueing trees together via an added root node. The resulting set of subtrees is the union of the two glued trees plus a new element that denotes the entire tree. Letting $\imath$ denote the singleton tree, the explicit trees we find after each glueing iteration are

$$ \begin{array} ( T^{(1)} &= \{\imath\} & = \{1\} \\ T^{(2)} &= \{\imath, \imath(\imath,\imath)\} & = \{1,2\}\\ T^{(3)} &= \{\imath, \imath(\imath,\imath), \imath(\imath(\imath,\imath),\imath), \imath(\imath,\imath(\imath,\imath)), \imath(\imath(\imath,\imath),\imath(\imath,\imath))\} & = \{1, 2, 3, 4, 5\} \\ \end{array} $$

$$ \begin{align} T^{(1)} &= \{ \{1\} \} \\ T^{(2)} &= \{ \{1\},\{1,2\} \} \\ T^{(3)} &= \{ \{1\},\{1,2\}, \{1,2,3\}, \{1,2,4\}, \{1,2,5\} \} \\ \end{align} $$

Thus we find $\tilde{\mathcal{T}_1}=1, \; \tilde{\mathcal{T}_2}=1, \; \tilde{\mathcal{T}_3}=3$. It is clear that after the $k$th iteration, you will have enumerated all trees in $\mathcal{T}_k$. Of course, you will also enumerate trees with more than $k$ subtrees, so it is prudent to prune any such trees along the way. The downside to this algorithm is that it enumerates every tree. Since the number of trees grows exponentially, we find that computing $\tilde{\mathcal{T}_k}$ becomes intractable for $k>9$. For larger $k$, we will need to develop a method of counting. An implementation was written in Python: code. User Marko Riedel also posted various implementations in (B1) (B2) (B3).

$$ \; \\ $$

A Method of Counting

This method was developed by @d8d0d65b3f7cf42's in his posts (A1) (A2) and then slightly optimized here. Instead of enumerating trees, we count DAGs. We start by characterizing the DAGs in a unique way. Every node $v$ in the DAG represents a unique subtree. We let the height of $v$ be the height of the tree it represents. Note that this equals the length of the longest chain from $v$ to the node $\imath$ that represents the singleton tree. By grouping nodes of the same height we form layers and the graph takes shape.

$\qquad \qquad \qquad \qquad \qquad \qquad\qquad \qquad$ dag shape

Noting the natural ordering of the nodes, we can represent each DAG in canonical form:

  1. There is a unique root with no parents and a unique sink with no children. Note the root and sink correspond to the entire tree and the singleton tree, respectively.
  2. Every node except the sink has exactly two children below it, one of which is on the adjacent layer. The children are ordered, meaning we make the distinction $(a,b) \neq (b,a)$.
  3. Every node except the root has at least one parent.
  4. If $u,v$ are nodes on the same layer with $u < v$, then the ordered children of $u$ must be lexically smaller than those of $v$. That is, if $(u_1, u_2)$ and $(v_1,v_2)$ are the children of $u,v$ respectively, then either $u_1 < v_1$ or $u_1 = v_1$ and $u_2 < v_2$.

One interesting observation is that the number of canonical DAGs with shape $(1,1,1,\ldots,.,1)$ is $(2k-3)!!$. There is a correspondence between these DAGs and a restrictive set of trees constructed in the recursive enumeration method. If at each iteration you only glue two trees if one is a subtree of the other, then you end up with these DAGs exactly. However, this doesn't play a role in our counting computation.

To count $\tilde{\mathcal{T}_k}$ we do the following. First we generate the possible shapes that the DAGs can take. The possible shapes are a subset of the compositions of $k$. With a great/little amount of effort you can enumerate these exactly/approximately by pruning any shapes that have layers with too many nodes (eg. a layer cannot be wider than the number of possible children pairs below). Next, the idea is to keep track of the nodes that have parents. So for each shape there will be many different boolean coverings where each node is assigned a value of $\mathtt{True}$ if it has atleast one parent or $\mathtt{False}$ otherwise. We can count the number of DAGs that have a given $\mathtt{shape}$ and $\mathtt{covering}$ by inducting on the height of the graph. This leads to a recursive algorithm in which we attach the top-layer nodes to those below them in every valid way.

It is here that another optimization manifests. If there are $\ell_1$ nodes in the top layer of the DAG, $\ell_2$ in the second layer and $\ell_{3}$ nodes below, then there are ${\ell_2^2 + 2\ell_2\ell_3 \choose \ell_1}$ ways of connecting the top layer to the rest of the DAG. This number grows fast, becoming unwieldy for even modest shapes. An alternative idea is to choose the children and then count the number of ways to assign the top layer to those children. Suppose we connect the top layer to exactly $\alpha$ nodes in the second layer and $\beta$ nodes below. Using inclusion exclusion, we find that the number of ways to connect the top layer in such a manner is $$ M(\ell_1, \alpha, \beta) = \sum_{j=0}^{\alpha + \beta} (-1)^j \sum_{i=0}^{j}{\alpha \choose i} {\beta \choose j-i} {(\alpha-i)^2 + 2(\alpha-i)(\beta-j+i) \choose \ell_1}$$

Each summand can be realized as such: we choose $\ell_1$ of the possible children pairs given that we leave $i$ children uncovered in the second layer and $j-i$ children uncovered in the lower layers. Finally, a shape $S$ and covering $c$ is canonical if the top layer of $S$ has one node (the DAG contains a root) and if $c$ assigns every other node $\mathtt{True}$ (requirement 3). Below is pseudocode to count the number of DAGs for each covering of a given shape.

$\qquad$Shape $S$ is represented by a tuple of integers
$\qquad$Covering $c$ is represented by a binary string
$\qquad$Let $M(\ell_1,\alpha,\beta)$ be the inclusion exclusion formula as above
$\qquad$Let $D[S,c]$ represent the number of DAGs with shape $S$ and covering $c \\$
$\qquad\mathtt{Count}(S):$
$\qquad\qquad \mathtt{Count}(S[2\colon \!])$ recurse on the subshape
$\qquad\qquad \ell_1 \leftarrow S[1]$ the number of children in the top layer
$\qquad\qquad$ for each set of children $\varsigma$
$\qquad \qquad \qquad \alpha \leftarrow$ the number of children in the second layer $S[2]$
$\qquad \qquad \qquad \beta \leftarrow$ the number of children below the second layer
$\qquad\qquad \qquad $ for each covering $c$ of $S[2\colon\!]$
$\qquad\qquad \qquad \qquad D[S,\varsigma \vee c]$ += $M(\ell_1, \alpha, \beta) \,D[S[2\colon \! ], c]\\$
$\qquad \qquad$ if $S$ has a root node, ie $S[1]=1$:
$\qquad \qquad \qquad c^* \leftarrow$ full covering of $S$
$\qquad \qquad \qquad \tilde{\mathcal{T}_{|S|}}$ += $D[S,c^*]\\$

$ \; \\ $
$\qquad \qquad \qquad \qquad \qquad \qquad \qquad$ shape, covering, top layer assigment

$$ \begin{align} S &= (3,2,3,1,2,1,1) \\ \ell_1, \alpha, \beta &= 3,2,2 \\ c &= \;\;\;\;\;0010101011 \\ \varsigma &= 0001100100100 \\ \varsigma \vee c &= 0001110101111 \\ \end{align} $$

Here is a dirty Python implementation. This code confirms @d8d0d65b3f7cf42's values for $k\leq16$ and was used to obtain values for $k\leq 21$ -- though it takes about 14 hours for $k=21$. I fixed the memory issues by removing old values from the memoization table (being careful not to re-add any recomputed canonical shape). Excitingly, I think I have a way to directly count by shape (which would get us ~10 more values). It counts by an equivalence relation: a $k \times k$ matrix for which the $i,j$th entry is the number of trees of size $i$ that would add $j$ subtrees when taking the union.


The two codes can be easily altered to count trees where we don't care about the order of children (no longer "plane" trees). In each you simply need to comment one line of code. In the enumeration method, you only perform one of the two gluings $x(\tau_1,\tau_2), x(\tau_2, \tau_1)$. In the counting method, the only thing that changes is the third binomial coefficient in the inclusion exclusion formula. The list of values of this sister sequence for $k\leq19$ is [1, 1, 2, 6, 25, 137, 945, 7927, 78731, 906705, 11908357, 175978520, 2893866042, 52467157456, 1040596612520, 22425725219277, 522102436965475, 13064892459014192, 349829488635512316].