Definition of General Associativity for binary operations

This is a really good question and it comes up (among other places) in tensor category theory, where equality is replaced by canonical isomorphism and suddenly things cannot be handwaved away... I'll try to give an answer; unfortunately it is lacking in illustrations (due to math.stackexchange not having packages like xypic) and examples (due to my laziness). Sorry also for its length...

I will write $R$ for $\mathbb R$ because really there is nothing specific about $\mathbb R$ that you are using here (and, honestly, because I am too lazy to type in the \mathbb part all the time). Here are two ways of stating general associativity in $R$:

First way: Here is a mild variation on the way to state "general associativity" that you have suggested: We can define, for every positive integer $n$, a map $\operatorname{sum}_n : R^n \to R$; we define it by induction over $n$:

  • The map $\operatorname{sum}_1 : R^1 \to R$ is defined as the map sending each $\left(r\right) \in R^1$ to $R$.

  • If $\operatorname{sum}_n : R^n \to R$ is defined for some $n$, then $\operatorname{sum}_{n+1} : R^{n+1} \to R$ is defined by $\operatorname{sum}_{n+1} \left(a_1,a_2,...,a_{n+1}\right) = \operatorname{sum}_n \left(a_1,a_2,...,a_n\right) + a_{n+1}$.

Then, "general associativity" states that whenever $n$ and $k$ are integers such that $0 < k < n$, and whenever $\left(r_1,r_2,...,r_n\right)\in R^n$ is an $n$-tuple, we have

$\operatorname{sum}_n \left(r_1,r_2,...,r_n\right) = \operatorname{sum}_k \left(r_1,r_2,...,r_k\right) + \operatorname{sum}_{n-k} \left(r_{k+1},r_{k+2},...,r_n\right)$.

So you don't like this formulation, because rather than explicitly formalizing the concept of a bracketing, it only formalizes its "first step" (i.e., it allows "moving the outermost brackets"). Before I continue, let me say that for most applications this is perfectly enough, as you can use induction instead of referring to some obscure notion of "bracketings" and your proofs will likely only gain in clarity. But sometimes one really has to go the final mile. This is more complicated:

Second way: The best way to formalize the concept of a bracketing (for a binary operation) is the notion of a binary tree. The Wikipedia page on this is quite confusing, and the other relevant pages on the Wikipedia (associahedron, Tamari lattice) are not very elementary, so let me try to provide a definition. The simplest way to define an (unlabelled) binary tree with $n$ leaves (for $n$ a positive integer) is to say that:

  • the empty set $\emptyset$ is a binary tree with $1$ leaf, called the empty binary tree (don't think of the leaves as any actual objects -- all we care about here is the number of leaves of a tree, which is just a number that we keep track of);

  • if $A$ and $B$ are two binary trees with $a$ and $b$ leaves, respectively, then the pair $\left(A,B\right)$ is a binary tree with $a+b$ leaves; this tree is said to have left child $A$ and right child $B$ (bad terminology if you ask me, since these children are used to create the tree...).

This is an inductive definition, meaning that we consider all objects which are forced to be binary trees by these axioms to be binary trees, and no others.

So each of $\emptyset$, $\left(\emptyset, \emptyset\right)$, $\left(\left(\emptyset, \emptyset\right), \emptyset\right)$, $\left(\emptyset,\left(\left(\emptyset,\left(\emptyset,\emptyset\right)\right),\emptyset\right)\right)$ is a binary tree. The triple $\left(\emptyset,\emptyset,\emptyset\right)$ is not a binary tree, because we have no axiom that could make a triple a binary tree.

(Caveat lector: Binary trees are not trees in the sense of graph theory, and actually not trees in the sense of computer science either. Normally, "trees" are not allowed to be empty, and there is no distinction between left and right children. Here the empty tree exists and so does the left-right distinction: (for example) the binary trees $\left(\left(\emptyset, \emptyset\right), \emptyset\right)$ and $\left(\emptyset,\left(\emptyset, \emptyset\right)\right)$ are not the same.)

Of course, binary trees are so called because they can be drawn as trees. To draw the empty binary tree, draw a single dot, which is called the "root" of the tree. To draw a tree of the form $\left(A,B\right)$, you need to draw a dot, which stands for the "root" of this tree, and then one edge from this "root" that goes southwest and another edge from this "root" that goes southeast. Now, draw the tree $A$ (this is a recursive algorithm) in such a way that its "root" is placed at the end of the southwest edge. Also, draw the tree $B$ in such a way that its "root" is placed at the end of the southeast edge.

Sadly I cannot draw on here, but if you look at this image and pretend that $\alpha$, $\beta$, $\gamma$ and the circles are dots, then the left binary tree is $\left(\emptyset, \left(\emptyset, \emptyset\right)\right)$, and the right binary tree is $\left(\left(\emptyset, \emptyset\right), \emptyset\right)$. Also, in this picture (think of the halfmoons/bananas as dots too) you see all five possible binary trees with $4$ leaves. (Yes, the "leaves" are the halfmoons/bananas; they are dots from which no edges go down.)

Now, to any nonnegative integer $n$ and any binary tree $T$ with $n$ leaves, we can assign a map $s_T : R^n \to R$; it is defined (recursively) as follows:

  • If $T$ is the empty binary tree (so that $n=1$), then $s_T : R^1 \to R$ is the map sending each $\left(r\right)$ to $r$.

  • If $T$ has the form $\left(A,B\right)$ for two binary trees $A$ and $B$ having respectively $a$ and $b$ leaves, then we already have maps $s_A : R^a \to R$ and $s_B : R^b \to R$ (this is an inductive definition), and we need to define a map $s_T : R^{a+b} \to R$. Define it by $s_T\left(r_1,r_2,...,r_{a+b}\right) = s_A\left(r_1,r_2,...,r_a\right) + s_B\left(r_{a+1},r_{a+2},...,r_{a+b}\right)$.

With the maps $s_T$ thus defined, "general associativity" says that for any fixed positive integer $n$, the maps $s_T : R^n \to R$ for all possible binary trees $T$ with $n$ leaves are the same.

Let us see what this means for $n = 1$. There is only one binary tree with $1$ leaf, namely the empty tree, and the corresponding map sends each $\left(r_1\right) \in R^1$ to $r_1$. So the statement we are making is trivial for $n = 1$.

Let us see what "general associativity" means for $n = 2$. There is only one binary tree with $2$ leaves, namely the tree $\left(\emptyset,\emptyset\right)$, and the corresponding map sends each $\left(r_1,r_2\right) \in R^2$ to $r_1 + r_2$. Again the statement is trivial.

The first nontrivial consequence is what we get for $n = 3$. There are two binary trees with $3$ leaves, namely $\left(\emptyset, \left(\emptyset, \emptyset\right)\right)$ and $\left(\left(\emptyset, \emptyset\right), \emptyset\right)$. The $s_T$ for the first of these trees is the map sending each $\left(r_1,r_2,r_3\right) \in R^3$ to $r_1 + \left(r_2+r_3\right)$. The $s_T$ for the second of these trees is the map sending each $\left(r_1,r_2,r_3\right) \in R^3$ to $\left(r_1+r_2\right) + r_3$. So "general associativity" for $n=3$ yields that $r_1 + \left(r_2+r_3\right) = \left(r_1+r_2\right) + r_3$ for all $r_1,r_2,r_3 \in R$. This is just the usual associativity law.

For $n = 4$, there are five binary trees with $4$ leaves, which I list here along with the corresponding maps $s_T$:

$\left(\emptyset, \left(\emptyset, \left(\emptyset,\emptyset\right)\right)\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $r_1+\left(r_2+\left(r_3+r_4\right)\right)$;

$\left(\emptyset, \left(\left(\emptyset,\emptyset\right), \emptyset\right)\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $r_1+\left(\left(r_2+r_3\right)+r_4\right)$;

$\left(\left(\emptyset, \emptyset\right), \left(\emptyset,\emptyset\right)\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $\left(r_1+r_2\right)+\left(r_3+r_4\right)$;

$\left(\left(\emptyset, \left(\emptyset,\emptyset\right)\right), \emptyset\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $\left(r_1+\left(r_2+r_3\right)\right)+r_4$;

$\left(\left(\left(\emptyset,\emptyset\right), \emptyset\right), \emptyset\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $\left(\left(r_1+r_2\right)+r_3\right)+r_4$.

So "general associativity" for $n=4$ says that

$r_1+\left(r_2+\left(r_3+r_4\right)\right) = r_1+\left(\left(r_2+r_3\right)+r_4\right) = r_1+\left(\left(r_2+r_3\right)+r_4\right) = \left(r_1+\left(r_2+r_3\right)\right)+r_4 = \left(\left(r_1+r_2\right)+r_3\right)+r_4$.

In general, the "bracketings" of $n$ symbols using a fixed binary operation correspond to the binary trees with $n$ leaves.

Now this got damn long and nowhere near readable. Can any LaTeX wizard step in and suggest a simple drawing package that does work on math.stackexchange? It doesn't take much to draw a tree, and it should clear up a lot...

EDIT: Here is a third way, which really is completely equivalent to the second way, but has the advantage of a less confusing definition. The idea is to use Dyck words instead of trees. (The Wikipedia page on Catalan numbers actually provides a good synopsis of what I am saying -- and some good pictures of binary trees. It calls "full binary tree" what I call "binary tree".)

Third way: What is a Dyck word? In combinatorics, "word" is just a fancy word (oops) for "tuple" (usually finite), and the "letters" of the word simply mean the entries of the tuple. Unless there is risk of confusion with actual products (or numbers), one abbreviates a tuple (or word) $\left(a_1,a_2,...,a_k\right)$ as $a_1a_2...a_k$. So the word $\left(0,1,1,1,0\right)$ is written as $01110$.

Now, let $n \in \mathbb{N}$. A Dyck word of length $2n$ is defined to be a word whose letters are $0$'s and $1$'s, each appearing $n$ times in total (so the word has altogether $2n$ letters), such that for every $i$, there are at least as many $0$'s among the first $i$ letters as there are $1$'s among them. So, for example, $0110$ is not a Dyck word, because among the first $3$ letters there are fewer $0$'s than $1$'s (namely, one $0$ and two $1$'s). Also, $10001$ is not a Dyck word, because among the first $1$ letter there are fewer $0$'s than $1$'s (no $0$ at all and one $1$). Also, $0100$ is not Dyck because its total number of $0$'s does not equal its total number of $1$'s. But $001011$ is a Dyck word (of length $2\cdot 3=6$), as you can easily check: read the word from left to right, keeping track of how many $0$'s and how many $1$'s you have encountered; if the count for $1$'s overtakes the count for $0$'s, then your word is not Dyck; if both counts are equal at the end, then it is Dyck; otherwise it is not Dyck.

(Wikipedia uses the two letters "X" and "Y" instead of $0$ and $1$; other than this, there is no difference.)

Now, to any nonnegative integer $n$ and any Dyck word $w$ of length $2n$, we can assign a map $s_w : R^{n+1} \to R$; it is defined (recursively) as follows:

  • If $n = 0$ (so that the word $w$ is empty -- a $0$-tuple), then $s_w : R^1 \to R$ is the map sending each $\left(r\right)$ to $r$.

  • Otherwise, let $j$ be the smallest positive $i$ such that the number of $0$'s among the first $i$ letters of $w$ equals the number of $1$'s among the first $i$ letters of $w$. (Such an $i$ exists, because $i = 2n$ does the trick; thus, the smallest such $i$ also exists.) Notice that this $j$ must be even, because otherwise the number of $0$'s among the first $j$ letters of $w$ could not equal the number of $1$'s among these letters for parity reasons. Write $w$ as $w_1w_2...w_{2n} = \left(w_1,w_2,...,w_{2n}\right)$. (Then, it is easy to see that $w_1 = 0$ and $w_j = 1$.) Let $u$ be the word $w_2w_3...w_{j-1}$, and let $v$ be the word $w_{j+1}w_{j+2}...w_{2n}$. (It is easy to see that both $u$ and $v$ are Dyck words.) Then, we define a map $s_w : R^{n+1} \to R$ by $s_w\left(r_1,r_2,...,r_{n+1}\right) = s_u\left(r_1,r_2,...,r_{j/2}\right) + s_v\left(r_{j/2+1},r_{j/2+2},...,r_n\right)$.

With the maps $s_w$ thus defined, "general associativity" says that for any fixed positive integer $n$, the maps $s_w : R^{n+1} \to R$ for all possible Dyck words of length $2n$ are the same.

Let us see what this means for $n = 1$. We have only one Dyck word of length $2\cdot 1 = 2$, namely $01$. Let this word be $w$. Looking up the definition of $s_w$, we see that $u$ and $v$ both are the empty word. Hence, every $\left(r_1,r_2\right) \in R^2$ satisfies $s_w\left(r_1,r_2\right) = s_u\left(r_1\right) + s_v\left(r_2\right) = r_1+r_2$. Of course, "general associativity" says nothing insightful, because there is only one $w$.

Let us see what we get for $n = 2$. We have two Dyck words of length $2\cdot 2 = 4$, namely $0011$ and $0101$.

Let $w$ be the Dyck word $0011$. Then, in our above definition of $s_w$, we have $j = 4$, $u = 01$ and $v = \left(\text{empty word}\right)$. Hence, every $\left(r_1,r_2,r_3\right) \in R^3$ satisfies

$s_w\left(r_1,r_2,r_3\right) = \underbrace{s_u\left(r_1,r_2\right)}_{=r_1+r_2} + \underbrace{s_v\left(r_3\right)}_{=r_3} = \left(r_1+r_2\right) + r_3$.

Similarly, if $w$ is the Dyck word $0101$, then $s_w\left(r_1,r_2,r_3\right) = r_1 + \left(r_2+r_3\right)$. So "general associativity" for $n = 2$ claims $\left(r_1+r_2\right) + r_3 = r_1 + \left(r_2+r_3\right)$, which is exactly what you would expect (except that the $n$ here is shifted by $1$ compared to the second way above).

You asked for references. Unfortunately I also know no better than google for "bracketings and associativity", and it is very hard to find reader-friendly treatments of these formalization issues. There is Huang and Tamari's famous Problems of associativity paper, but this is not about formalizing associativity; it is about the combinatorics of the bracketings (or Dyck words or binary trees). A whole book dedicated to this kind of combinatorics (Associahedra, Tamari Lattices and Related Structures) has been published in 2012, and you can find all (or most?) of its chapters on the internet (for example, Ross Street's Parenthetic Remarks) if you are interested. Sadly, this is again on a level where questions like "how is a bracketing defined?" appear as too trivial. These foundational issues are usually ignored when they arise in elementary algebra because authors are too lazy to discuss them or do not want to expect their readers to put up with the large amount of formalism and pedantry involved in their discussion. But sometimes people care about them when they reappear in the theory of tensor categories (aka monoidal categories) because they are less easy to argue away in that situation (and the First way I showed above is often not enough for tensor categories); many nevertheless try to. MacLane, in his Natural associativity and commutativity, uses "iterates" of the associativity morphism in a category to encode our bracketings. It is not very explicit but his paper seems fairly well-written (but you have to know some category theory...).

I should also say that a few good algebra texts prove "general associativity" formalized according to the First way I sketched above, or in similar veins. See, for instance, Theorem 2 in §I.1 of Claude Chevalley, Fundamental Concepts of Algebra, 1956, which gives a slightly stronger version of the First way.


The meaning of “associativity of the binary operation $\cdot$ on the set $A$ holds for $k$ (items)” is (as Wardlaw writes) that “[If $a_1, a_2, \dots,a_k$ are elements of $A$, then] any bracketing of $a_1,a_2,\dots,a_k$ equals the standard form.”

When we say a binary operation is associative (without mentioning “for $k$”), we mean it’s associative for all positive integers $k$. That means we can write $\prod\limits_{i=1}^n a_i$ and know that (regardless of how big $n$ is) it’s well-defined, because the way in which we evaluate it doesn’t matter.

The standard form in the definition is (or can be taken to be) Wardlaw’s left associative product, $$\left(\dots\left(\left(a_1\cdot a_2\right)\cdot a_3\right)\dots\cdot a_k\right),$$

which is the element of $A$ obtained by finding $a_1\cdot a_2$, multiplying it by $a_3$, and so on, up to a final product with $a_k$.

For the definition to be entirely clear, one should also know what’s meant by any bracketing.

“Any bracketing” means the result of any particular sequence taken from among all the possible sequences that can be used to transform $a_1,a_2,\dots,a_k$ into an element of $A$ using $k-1$ applications of $\cdot$.

Bracketings are usually expressed with parenthesization. Example: For $k=5$, this is one of the bracketings that exist:

$$\left(\left(a_1\cdot a_2\right)\cdot \left(\left(a_3\cdot a_4\right)\cdot a_5\right)\right).$$

This is the element of $A$ obtained by applying $\cdot\,$ according to the parenthesization. (Where two applications are performed on the same line, the result doesn’t depend on the order in which those evaluations are made.)

$$ \begin{align} \left(\left(a_1\cdot a_2\right)\cdot \left(\left(a_3\cdot a_4\right)\cdot a_5\right)\right)&= \overbrace{(a_1\cdot a_2)}^{\mathrm{Let}\, v=a_1\cdot a_2} \cdot \left(\overbrace{(a_3\cdot a_4)}^{\mathrm{Let}\, w=a_3\cdot a_4}\cdot a_5\right)\\ &= v\cdot\overbrace{(w\cdot a_5)}^{\mathrm{Let}\, x=w\cdot a_5}\\ &=\overbrace{\left(v\cdot x\right)}^{\mathrm{Let}\, z=v\cdot x}\\ &=z. \end{align} $$

By the way, so long as a “standard form” for bracketing $a_1,a_2,\dots,a_k$ is well-defined, its specific form is irrelevant, since if all forms of bracketing $a_1,a_2,\dots,a_k$ are equal to the same specific one of the possible bracketings, associativity holds.

Does that help?


To put the question in a broader perspective, it all comes down to an unfortunate notational convention.

A binary operation is just a function $f:X \times X \rightarrow X$, where $X$ is a set, EXCEPT that it has the peculiar convention of writing the function with "infix notation" like 1+2 rather than ordinary function notation like, say, "add(1,2)".

"Bracketing" is the price we pay for using infix notation, since without bracketing, expressions like 12 / 6 / 2 are ambiguous $-$ does that mean (12/6)/2, which equals 1, or 12/(6/2), which equals 4? And since we don't like to bother writing so many parentheses all the time, we even go through the trouble of establishing conventional rules about order of operations: The rules that say that 1+3*4 implicitly means 1+(3*4) and 5 - 4 - 3 means (5-4)-3. In elementary school you had to pay the price of learning all these additional rules.

If we all used ordinary function notation instead, we wouldn't need bracketing or conventions about precedence: Instead of (9 - 15) / 3 we just write $\operatorname{divide}( \operatorname{subtract}(9,15), 3)$. Instead of a complicated "bracketing" like $$(a_0+(a_1+a_2))+((a_3+a_4)+(a_5+a_6)),$$ we would just write $$\operatorname{add}(\operatorname{add}(a_0,\operatorname{add}(a_1,a_2)),\operatorname{add}(\operatorname{add}(a_3,a_4),\operatorname{add}(a_5,a_6))).$$

You can treat this as the formal definition. (You can even create an algorithm to do this kind of conversion automatically; algorithms of this kind are well-known.) In fact, to take things a little further, the parentheses and commas aren't even needed. We could just write

/ - 9 15 3
+ + a0 + a1 a2 + + a3 a4 + a5 a6

for the two examples above. There is no ambiguity. (Make sure you see why.) This kind of notation, which requires no parentheses, is called "prefix" or "Polish" notation. Again, it's really just ordinary function notation except we recognized that we don't need to bother writing out all the parentheses and commas.

(There is also "postfix" or "reverse Polish" notation, where (9-15)/3 would be expressed as 9 15 - 3 /. For the purpose of algorithm computation, postfix notation is the much more natural choice. But we'll stick to prefix since it closely corresponds to familiar function notation.)

In the following examples, the first two constitute all the possible "bracketings" of a binary operation $f$ over $a_0,a_1,a_2$; the third is not legal:

f a0 f a1 a2
f f a0 a1 a2
f a0 a1 f a2

Again, the first is equivalent to $f(a_0, f(a_1,a_2))$ and the second is equivalent to $f(f(a_0,a_1),a_2)$. The third cannot be parsed. (Aside: In general, what's an easy way to tell whether an expression with f's and a's is legal? Read the list from left to right. As you read, keep track of how many f's and how many a's you've seen so far. The running count of a's should be greater than the running count of f's precisely when you reach the end of the list, and no sooner.)

It's now clear what "any possible bracketing" means: It's any way to intersperse $n$ copies of $f$ among $a_0, a_1, ..., a_n$ (while keeping $a_0, a_1, ..., a_n$ in the same order) to form a legal expression in prefix notation.

Parentheses are just a silly way of notating what function notation does just as well $-$ no, better $-$ at expressing.


We've shown that bracketing is just an awkward consequence of using infix notation. But infix notation isn't just a load of crap as I've made it seem like. There's more to it than that. In fact, infix notation is a terrific convention, in the following sense:

In prefix notation, if I want to write $A*B*C$, I have to choose between writing either "* A * B C" or "* * A B C". But if $*$ is an associative operation, then both ways are equal. In other words, the prefix notation is now forcing me to make a distinction between two orders of evaluation that I shouldn't have to distinguish between. Notation should not force me to make irrelevant distinctions. That's the advantage of writing $A*B*C$: it leaves unspecified whether I mean $(A*B)*C$ or $A*(B*C)$, which is a good thing because the distinction is irrelevant. Thus infix notation makes sense for $*$ and for many of our everyday operations too, which are often associative.

You're trying to prove general associativity, which, as you recognized, requires an understanding of what "bracketing" means. But bracketing is a relic of infix notation, which in turn is a notation intended to hide the bracketing for associative operations! That's why thinking in terms of prefix notation is the way to go here, where (a priori) we're not dealing with a general-associative operation.