Is the Jaccard distance a distance?

The trick is to use a transform called the Steinhaus Transform. Given a metric $(X, d)$ and a fixed point $a \in X$, you can define a new distance $D'$ as $$D'(x,y) = \frac{2D(x,y)}{D(x,a) + D(y,a) + D(x,y)}.$$ It's known that this transformation produces a metric from a metric. Now if you take as the base metric $D$ the symmetric difference between two sets, what you end up with is the Jaccard distance (which actually is known by many other names as well).

For more information and references, check out Ken Clarkson's survey Nearest-neighbor searching and metric space dimensions (Section 2.3).


Here is an elementary proof of the Steinhaus transform (from which said metricity follows as a special case, as noted in Suresh's answer).

Lemma. Let $p,q > 0$ and $r\geq 0$ such that $p \le q$. Then, $\frac{p}{q} \le \frac{p+r}{q+r}.$

Corollary. Let $d(x,y)$ be a metric. Then, for arbitrary (but fixed) $a$, the map $\delta$ defined by \begin{equation*} \delta(x,y) := \frac{2d(x,y)}{d(x,a)+d(y,a)+d(x,y)} \end{equation*} (and $\delta(a,a)=0$) is a metric.

Proof. Only the triangle inequality for $\delta$ is nontrivial. Let $p=d(x,y)$, $q=d(x,y)+d(x,a)+d(y,a)$, and $r=d(x,z)+d(y,z)-d(x,y)$. Applying the lemma, we obtain \begin{eqnarray*} \delta(x,y) &=& \frac{2d(x,y)}{d(x,a)+d(y,a)+d(x,y)} \le \frac{2d(x,z)+2d(y,z)}{d(x,a)+d(y,a)+d(x,z)+d(y,z)}\\ &=& \frac{2d(x,z)}{d(x,a)+d(z,a)+d(x,z)+d(y,z)+d(y,a)-d(z,a)} + \frac{2d(y,z)}{d(y,a)+d(z,a)+d(y,z)+d(x,z)+d(x,a)-d(z,a)}\\ &\le& \delta(x,z)+\delta(y,z), \end{eqnarray*} where the last inequality again uses triangle inequality for $d$.


Possibly the simplest proof of the triangle inequality for the jaccard distance comes from the fact that it is the collision probability of the MinHash algorithm, and that's all we need. Let $H(X) = \text{argmin}_{i\in X} \pi(i)$ where $\pi(i)$ is a uniformly random permutation.

\begin{align*} J(X,Y) &= \Pr\left[H(X) = H(Y)\right] \\ 1 - J(X,Y) &= \Pr\left[H(X) \neq H(Y)\right].\\ \end{align*} So for any $Z$, \begin{align*} \Pr\left[H(X) = H(Y)\right] &\ge \Pr\left[H(X) = H(Z) \land H(Y) = H(Z)\right] \\ \Pr\left[H(X) \neq H(Y)\right] &\le \Pr\left[H(X) \neq H(Z) \lor H(Y) \neq H(Z)\right] \end{align*} But by the union bound, \begin{align*} \begin{split} \Pr\big[H(X) \neq H(Z) \lor H(Y) \neq H(Z)\big] &\le \Pr\big[H(X) \neq H(Z)\big] + \Pr\big[H(Y) \neq H(Z)\big] \end{split} \end{align*}

My co-author used this to prove that a particular jaccard generalization is a metric after I'd been struggling to prove it for a month, and I couldn't believe it.