Mathematical Explanation of the Difference between SQL Joins: Inner, Outer, Left, Right

Let $A, B$ be sets. We think of $A$ and $B$ as tables, and their elements as rows. Each element of $x\in A$ is a list of data entries, one for each column of $A$.

(Edit: WLOG assume $A$ and $B$ do not have duplicate entries. If they do, add a unique index column to each.)

Let $R$ be any relation, that is, a subset $R \subseteq A \times B$, where we write $a \sim \, b$ if $(a,b) \in R$. In SQL $R$ corresponds to the statement that appears after "ON", e.g., A.name = B.name corresponds to the relation $x \sim y$ if and only if the entry in the name column of for a row $x \in A$ is the same as the name column in a row of $y \in A$.

Then $$A \operatorname{ INNER JOIN } B \operatorname{ON} R = \{(a,b) \in A \times B \, |\, a \sim b\}\, (=R).$$

(Edit: Here $(a,b)$ represents the concatenation of the entries of rows $a$ and $b$, corresponding to SELECT * FROM A JOIN B ON R. Of course the actual output may differ depending on the implementation.)

But here, if $a \in A$ is such that there is no corresponding $b$ such that $a \sim b$, then $a$ will not appear in the join. If you take a left join, you want every $a$ to appear regardless. So you add a special element $\operatorname{NULL}$ and add it to your relation. $\operatorname{NULL}$ obeys the rules

$a \sim \operatorname{NULL}$ iff there is no $b \in B$ with $a \sim b$

$\operatorname{NULL} \sim b$ iff there is no $a \in A$ with $a \sim b$

Now let $$\hat{A} = A \cup \{\operatorname{NULL}\},$$ $$\hat{B} = B \cup \{\operatorname{NULL}\}.$$

Then we have

$$A \operatorname{ INNER JOIN } B \operatorname{ON} R = \{(a,b) \in A \times B \, | a \sim b\}$$ $$A \operatorname{ LEFT JOIN } B \operatorname{ON} R = \{(a,b) \in A \times \hat{B} \, | a \sim b\}$$ $$A \operatorname{ RIGHT JOIN } B \operatorname{ON} R = \{(a,b) \in \hat{A} \times B \, | a \sim b\}$$ $$A \operatorname{ OUTER JOIN } B \operatorname{ON} R = \{(a,b) \in \hat{A} \times \hat{B} \, | a \sim b\}.$$

Thus we'll have the pairs $(a, \operatorname{NULL})$ appear on the left join whenever $a$ doesn't match any $b$, and $(\operatorname{NULL}, b)$ whenever $b$ doesn't match any $a$ in the right join. (note that we don't have $\operatorname{NULL} \sim \operatorname{NULL}$, so we never have $(\operatorname{NULL}, \operatorname{NULL})$.)

The reason that Venn diagrams are used to depict joins is that usually joins are usually done on relations as simple as the one given above, $R$ corresponding to A.name = B.name. In that case, if $\text{names}(T)$ is the set of names that appear in a table $T$, that is, $\text{names}(T)$ = SELECT DISTINCT names FROM T, then

\begin{align*}\text{names}(A\operatorname{ INNER JOIN } B \operatorname{ON} R) &= \text{names}(A)\cap \text{names}(B) \\ \text{names}(A\operatorname{ LEFT JOIN } B \operatorname{ON} R) &= \text{names}(A)\\ \text{names}(A\operatorname{ RIGHT JOIN } B \operatorname{ON} R) &= \text{names}(B)\\ \text{names}(A\operatorname{ OUTER JOIN } B \operatorname{ON} R) &= \text{names}(A)\cup \text{names}(B).\end{align*}

However, this completely loses sight of the fact that joins may be one-to-one, many-to-one, or many-to-many, and personally I've found those Venn diagrams more confusing than helpful when learning about joins.

Mathematical Explanation of the Difference between SQL Joins: Inner, Outer, Left, Right

Tags:

Computer Science

Discrete Mathematics

Related

Recent Posts