Algorithm for pow(float, float)

Since IEEE-754 binary floating-point numbers are fractions, computing a^b is technically an algebraic operation. However the common approach to implement powf(float a, float b) is as e^{b * log a}, i.e. using transcendental functions.

There are a few caveats, however. log a is undefined for a < 0, while powf() allows computation with some negative a. Exponentiation, in the form of expf() suffers from error magnification, as I explained in my answer to this question. This requires us to compute log a with higher than single precision for an accurate powf() result. There are various techniques to achieve this, a simple way is to use limited amounts of double-float computation, references for which I provided in my answer to this question. The essence of double-float is that each floating-point operand is represented as a pair of float values called the "head" and the "tail", which satisfy the relation |tail| ≤ ½ * ulp (|head|) when properly normalized.

The c code below shows an exemplary implementation of this approach. It assumes that the IEEE 754-2008 operation FMA (fused multiply-add) is available, which is exposed in C as the standard math functions fma(), fmaf(). It does not provide for handling of errno or floating-point exceptions, but it does provide for the correct handling of all 18 special cases enumerated by the ISO C standard. Tests have been performed with denormal support enabled; the code may or may not work properly within a non-IEEE-754 flush-to-zero (FTZ) environment.

The exponentiation part employs a simple argument reduction based directly on the semi-logarithmic encoding of floating-point numbers, then applies a polynomial minimax approximation on the primary approximation interval. The logarithm computation is based on log(x) = 2 atanh ((x-1) / (x+1)), combined with selective use of double-float computation, to achieve a relative accuracy of 8.3e-10. The computation of b * log a is performed as a double-float operation, the accuracy of the final exponentiation is improved by linear interpolation, by observing that e^x is its own derivative, and that therefore e^x+y ≅ e^x + y ⋅ e^x, when |y| ≪ |x|.

Double-float computation becomes a bit iffy near overflow boundaries, there are two instances of this in the code. When the head portion of the input to exp is just causing the result to overflow to infinity, the tail portion may be negative, such that the result of powf() is actually finite. One way to address this is to decrease the value of the "head" by one ulp in such a case, and alternative is to compute the head via addition in round-to-zero mode where readily available, since this will ensure like signs for head and tail. The other caveat is that if the exponentiation does overflow, we cannot interpolate the result, as doing so would create a NaN.

It should be noted that the accuracy of the logarithm computation used here is not sufficient to ensure a faithfully-rounded powf() implementation, but it provides a reasonably small error (the maximum error I have found in extensive testing is less than 2 ulps) and it allows the code to be kept reasonably simple for the purpose of demonstrating relevant design principles.

#include <math.h> // for helper functions, e.g frexpf(), ldexpf(), isinff(), nextafterf()

/* Compute log(a) with extended precision, returned as a double-float value 
   loghi:loglo. Maximum relative error about 8.36545e-10.
*/
void my_logf_ext (float a, float *loghi, float *loglo)
{
    const float LOG2_HI =  0x1.62e430p-1f;  //  6.93147182e-1
    const float LOG2_LO = -0x1.05c610p-29f; // -1.90465421e-9

    float m, r, i, s, t, p, qhi, qlo;
    int e;

    /* Reduce argument to m in [181/256, 362/256] */
    m = frexpf (a, &e);
    if (m < 0.70703125f) { /* 181/256 */
        m = m + m;
        e = e - 1;
    }
    i = (float)e;

    /* Compute q = (m-1)/(m+1) as a double-float qhi:qlo */
    p = m + 1.0f;
    m = m - 1.0f;
    r = 1.0f / p;
    qhi = m * r;
    t = fmaf (qhi, -2.0f, m);
    s = fmaf (qhi, -m, t);
    qlo = r * s;

    /* Approximate atanh(q), q in [-75/437, 53/309] */ 
    s = qhi * qhi;
    r =             0x1.08c000p-3f;  // 0.1292724609
    r = fmaf (r, s, 0x1.22cde0p-3f); // 0.1419942379
    r = fmaf (r, s, 0x1.99a160p-3f); // 0.2000148296
    r = fmaf (r, s, 0x1.555550p-2f); // 0.3333332539
    t = fmaf (qhi, qlo + qlo, fmaf (qhi, qhi, -s)); // s:t = (qhi:qlo)**2
    p = s * qhi;
    t = fmaf (s, qlo, fmaf (t, qhi, fmaf (s, qhi, -p))); // p:t = (qhi:qlo)**3
    s = fmaf (r, p, fmaf (r, t, qlo));

    /* log(a) = 2 * atanh(q) + i * log(2) */
    t = fmaf ( LOG2_HI * 0.5f, i, qhi);
    p = fmaf (-LOG2_HI * 0.5f, i, t);
    s = (qhi - p) + s;                        
    s = fmaf ( LOG2_LO * 0.5f, i, s);
    r = t + t;
    *loghi = t = fmaf (2.0f, s, r);
    *loglo = fmaf (2.0f, s, r - t);
}

/* Compute exp(a). Maximum ulp error = 0.86565 */
float my_expf_unchecked (float a)
{
    float f, r;
    int i;

    // exp(a) = exp(i + f); i = rint (a / log(2))
    r = fmaf (0x1.715476p0f, a, 0x1.8p23f) - 0x1.8p23f; // 1.442695, 12582912.0
    f = fmaf (r, -0x1.62e400p-01f, a); // log_2_hi // -6.93145752e-1
    f = fmaf (r, -0x1.7f7d1cp-20f, f); // log_2_lo // -1.42860677e-6
    i = (int)r;
    // approximate r = exp(f) on interval [-log(2)/2,+log(2)/2]
    r =             0x1.694000p-10f; // 1.37805939e-3
    r = fmaf (r, f, 0x1.125edcp-7f); // 8.37312452e-3
    r = fmaf (r, f, 0x1.555b5ap-5f); // 4.16695364e-2
    r = fmaf (r, f, 0x1.555450p-3f); // 1.66664720e-1
    r = fmaf (r, f, 0x1.fffff6p-2f); // 4.99999851e-1
    r = fmaf (r, f, 0x1.000000p+0f); // 1.00000000e+0
    r = fmaf (r, f, 0x1.000000p+0f); // 1.00000000e+0
    // exp(a) = 2**i * exp(f);
    r = ldexpf (r, i);
    return r;
}

/* a**b = exp (b * log (a)), where a > 0, and log(a) is computed with extended 
   precision as a double-float.  The maximum ulp error found so far is 1.95083
   ulp for a = 0.698779583, b = 224.566528
*/
float my_powf_core (float a, float b)
{
    const float MAX_IEEE754_FLT = 0x1.fffffep127f;
    const float EXP_OVFL_BOUND = 0x1.62e430p+6f;
    const float EXP_OVFL_UNFL_F = 104.0f;
    const float MY_INF_F = 1.0f / 0.0f;
    float lhi, llo, thi, tlo, phi, plo, r;

    /* compute lhi:llo = log(a) */
    my_logf_ext (a, &lhi, &llo);
    /* compute phi:plo = b * log(a) */
    thi = lhi * b;
    if (fabsf (thi) > EXP_OVFL_UNFL_F) { // definitely overflow / underflow
        r = (thi < 0.0f) ? 0.0f : MY_INF_F;
    } else {
        tlo = fmaf (lhi, b, -thi);
        tlo = fmaf (llo, b, +tlo);
        /* normalize intermediate result thi:tlo, giving final result phi:plo */
#if FAST_FADD_RZ
        phi = __fadd_rz (thi, tlo);// avoid premature ovfl in exp() computation
#else // FAST_FADD_RZ
        phi = thi + tlo;
        if (phi == EXP_OVFL_BOUND){// avoid premature ovfl in exp() computation
            phi = nextafterf (phi, 0.0f);
        }
#endif // FAST_FADD_RZ
        plo = (thi - phi) + tlo;
        /* exp'(x) = exp(x); exp(x+y) = exp(x) + exp(x) * y, for |y| << |x| */
        r = my_expf_unchecked (phi);
        /* prevent generation of NaN during interpolation due to r = INF */
        if (fabsf (r) <= MAX_IEEE754_FLT) {
            r = fmaf (plo, r, r);
        }
    }
    return r;
}

float my_powf (float a, float b)
{
    const float MY_NAN_F = 0.0f / 0.0f;
    const float MY_INF_F = 1.0f / 0.0f;
    int expo_odd_int;
    float r;

    /* special case handling per ISO C specification */
    expo_odd_int = fmaf (-2.0f, floorf (0.5f * b), b) == 1.0f;
    if ((a == 1.0f) || (b == 0.0f)) {
        r = 1.0f;
    } else if (isnanf (a) || isnanf (b)) {
        r = a + b;  // convert SNaN to QNanN or trigger exception
    } else if (isinff (b)) {
        r = ((fabsf (a) < 1.0f) != (b < 0.0f)) ? 0.0f :  MY_INF_F;
        if (a == -1.0f) r = 1.0f;
    } else if (isinff (a)) {
        r = (b < 0.0f) ? 0.0f : MY_INF_F;
        if ((a < 0.0f) && expo_odd_int) r = -r;
    } else if (a == 0.0f) {
        r = (expo_odd_int) ? (a + a) : 0.0f;
        if (b < 0.0f) r = copysignf (MY_INF_F, r);
    } else if ((a < 0.0f) && (b != floorf (b))) {
        r = MY_NAN_F;
    } else {
        r = my_powf_core (fabsf (a), b);
        if ((a < 0.0f) && expo_odd_int) {
            r = -r;
        }
    }
    return r;
}

The general algorithm tends to be computing the float power as the combination of the integer power and the remaining root. The integer power is fairly straightforward, the root can be computed using either Newton - Raphson method or Taylor series. IIRC numerical recipes in C has some text on this. There are other (potentially better) methods for doing this too, but this would make a reasonable starting point for what is a surprisingly complex problem to implement. Note also that some implementations use lookup tables and a number of tricks to reduce the computation required.

Floating Point

Algorithm

Math

Algorithm for pow(float, float)

Tags:

Related

Recent Posts