Why it is more accurate to evaluate $x^2-y^2$ as $(x+y)(x-y)$ in floating point system?

Each single elementary operation has a truncation error of at most $1/2$ ulp (unit in the last place), which is about a relative error of less than $\mu=2^{-53}$. Let's express the floating point realizations of the two expressions using these relative errors $|δ_i|\le \mu$.

The errors are \begin{align} fl(fl(x^2)-fl(y^2))&=(x^2(1+δ_1)-y^2(1+δ_2))(1+δ_3) \\ &=(x^2-y^2)+x^2δ_1-y^2δ_2+(x^2-y^2)δ_3+\text{higher order terms} \end{align} which has as first order upper bound $[x^2+y^2+|x^2-y^2|]\mu=2\max(x^2,y^2)\mu$, and $$ fl(fl(x+y)fl(x-y))=((x+y)(1+δ_1)\,(x-y)(1+δ_2))(1+δ_3) $$ where the upper bound for the first order terms is $3 |x^2-y^2| μ$.

To demonstrate that this calculation is indeed reflected in actual floating point arithmetic, consider $x\in [1,2]$ and $y=x+ε$ with $ε$ fixed to some small value, here $\varepsilon=7\cdot 2^{-m}$ for some $m$ in the upper range of the mantissa length. Then the exact difference is $2εx+ε^2$. The error bound for the first formula now appears as $2x^2\mu$ and for the second formula as $6xε\mu$.

The plots show that the derived error bounds are valid and rather tight for both formulas, the error of the second formula is only really visible in the logarithmic plot and is zero for all practical purposes.

enter image description here

Why it is more accurate to evaluate $x^2-y^2$ as $(x+y)(x-y)$ in floating point system?

Tags:

Numerical Methods

Related

Recent Posts