How to convert floating point algorithm to fixed point?

The basic idea for a lookup table is simple -- you use the fixed point value as an index into an array to look up the value. The problem is if your fixed point values are large, your tables become huge. For a full table with a 32-bit FP type you need 4*232 bytes (16GB) which is impractically large. So what you generally do is use a smaller table (smaller by a factor of N) and the linearly interpolate between two values in the table to do the lookup.

In your case, you appear to want to use a 223 reduction so you need a table with just 513 elements. To do the lookup, you then use the upper 9 bits as an index into the table and use the lower 23 bits to interpolate. eg:

FP32 cos_table[513] = { 268435456, ...
FP32 cosFP32(FP32 x) {
    int i = x >> 23;  // upper 9 bits to index the table
    int fract = x & 0x7fffff;  // lower 23 bits to interpolate
    return ((int64_t)cos_table[i] * ((1 << 23) - fract) + (int64_t)cos_table[i+1] * fract + (1 << 22)) >> 23;
}

Note that we have to do the multiplies in 64 bits to avoid overflows, same as any other multiplies of FP32 values.

Since cos is symmetric, you could use that symmetry to reduce the table size by another factor of 4, and use the same table for sin, but that is more work.


If your're using C++, you can define a class with overloading to encapsulate your fixed point type:

class fixed4_28 {
    int32_t  val;
    static const int64_t fract_val = 1 << 28;
 public:
    fixed4_28 operator+(fixed4_28 a) const { a.val = val + a.val; return a; }
    fixed4_28 operator-(fixed4_28 a) const { a.val = val - a.val; return a; }
    fixed4_28 operator*(fixed4_28 a) const { a.val = ((int64_t)val * a.val) >> 28; return a; }
    fixed4_28 operator/(fixed4_28 a) const { a.val = ((int64_t)val << 28) / a.val; return a; }

    fixed4_28(double v) : val(v * fract_val + 0.5) {}
    operator double() { return (double)val / fract_val; }

    friend fixed4_28 cos(fixed_4_28);
};

inline fixed4_28 cos(fixed4_28 x) {
    int i = x.val >> 23;  // upper 9 bits to index the table
    int fract = x.val & 0x7fffff;  // lower 23 bits to interpolate
    x.val = ((int64_t)cos_table[i] * ((1 << 23) - fract) + (int64_t)cos_table[i+1] * fract + (1 << 22)) >> 23;
    return x;
}

and then your code can use this type directly and you can write equations just as if you were using float or double


For sin() and cos() the first step is range reduction, which looks like "angle = angle % degrees_in_a_circle". Sadly, these functions typically use radians, and radians are nasty because that range reduction becomes "angle = angle % (2 * PI)", which means that precision depends on the modulo of an irrational number (which is guaranteed to be "not fun").

With this in mind; you want to throw radians in the trash and invent a new "binary degrees" such that a circle is split into "powers of 2" pieces. This means that the range reduction becomes "angle = angle & MASK;" with no precision loss (and no expensive modulo). The rest of sin() and cos() (if you're using a table driven approach) is adequately described by existing answers so I won't repeat it in this answer.

The next step is to realize that "globally fixed point" is awful. Far better is what I'll call "moving point". To understand this, consider multiplication. For "globally fixed point" you might do "result_16_16 = (x_16_16 * y_16_16) >> 16" and throw away 16 bits of precision and have to worry about overflows. For "moving point" you might do "result_32_32 = x_16_16 * y_16_16" (where the decimal point is moved) and know that there is no precision loss, know that there can't be overflow, and make it faster by avoiding a shift.

For "moving point", you'd start with the actual requirements of inputs (e.g. for a number from 0.0 to 100.0 you might start with "7.4 fixed point" with 5 bits of a uint16_t unused) and explicitly manage precision and range throughput a calculation to arrive at a result that is guaranteed to be unaffected by overflow and has the best possible compromise between "number of bits" and precision at every step.

For example:

 uint16_t inputValue_7_4 = 50 << 4;                   // inputValue is actually 50.0
 uint16_t multiplier_1_1 = 3;                         // multiplier is actually 1.5
 uint16_t k_0_5 = 28;                                 // k is actually 0.875
 uint16_t divisor_2_5 = 123;                          // divisor is actually 3.84375

 uint16_t x_8_5 = inputValue_7_4 * multiplier_1_1;    // Guaranteed no overflow and no precision loss
 uint16_t y_9_5 = x_8_5 + k+0_5;                      // Guaranteed no overflow and no precision loss
 uint32_t result_9_23 = (y_9_5 << 23) / divisor_2_5;  // Guaranteed no overflow, max. possible precision kept

I'd like to do it as "mechanically" as possible

There is no reason why "moving point" can't be done purely mechanically, if you specify the characteristics of the inputs and provide a few other annotations (the desired precision of divisions, plus either any intentional precision losses or the total bits of results); given that the rules that determine the size of the result of any operation and where the point will be in that result are easily determined. However; I don't know of an existing tool that will do this mechanical conversion, so you'd have to invent your own language for "annotated expressions" and write your own tool that converts it into another language (e.g. C). It's likely to cost less developer time to just do the conversion by hand instead.


/*
very very fast
float sqrt2(float);

(-1) ^ s* (1 + n * 2 ^ -23)* (2 ^ (x - 127)) float
sxxxxxxxxnnnnnnnnnnnnnnnnnnnnnnn  float f
000000000000sxxxxxxxxnnnnnnnnnnn  int indis  20 bit
*/

#define LUT_SIZE2 0x000fffff   //1Mb  20 bit
float sqrt_tab[LUT_SIZE2];
#define sqrt2(f)     sqrt_tab[*(int*)&f>>12]  //float to int


int main()
{
    //init_luts();
    for (int i = 0; i < LUT_SIZE2; i++)
    {
        int ii = i << 12;        //i to float 
        sqrt_tab[i] = sqrt(*(float*)& ii);
    }

    float f=1234.5678;
    printf("test\n");
    printf(" sqrt(1234.5678)=%12.6f\n", sqrt(f));
    printf("sqrt2(1234.5678)=%12.6f\n", sqrt2(f));


    printf("\n\ntest mili second\n");
    int begin;
    int free;

    begin = clock();
    for (float f = 0; f < 10000000.f; f++)
        ;
    free = clock() - begin;
    printf("free        %4d\n", free);

    begin = clock();
    for (float f = 0; f < 10000000.f; f++)
        sqrt(f);
    printf("sqrt()      %4d\n", clock() - begin - free);


    begin = clock();
    for (float f = 0; f < 10000000.f; f++)
        sqrt2(f);
    printf("sqrt2()     %4d\n", clock() - begin - free);


    return 0;

}

/*
 sgrt(1234.5678)   35.136416
sgrt2(1234.5678)  35.135452

test mili second
free       73
sqrt()    146
sqrt2()    7
*/