What's a proper way of type-punning a float to an int and vice-versa?

Forget casts. Use memcpy.

float xhalf = 0.5f*x;
uint32_t i;
assert(sizeof(x) == sizeof(i));
std::memcpy(&i, &x, sizeof(i));
i = 0x5f375a86 - (i>>1);
std::memcpy(&x, &i, sizeof(i));
x = x*(1.5f - xhalf*x*x);
return x;

The original code tries to initialize the int32_t by first accessing the float object through an int32_t pointer, which is where the rules are broken. The C-style cast is equivalent to a reinterpret_cast, so changing it to reinterpret_cast would not make much difference.

The important difference when using memcpy is that the bytes are copied from the float into the int32_t, but the float object is never accessed through an int32_t lvalue, because memcpy takes pointers to void and its insides are "magical" and don't break the aliasing rules.


There are a few good answers here that address the type-punning issue.

I want to address the "fast inverse square-root" part. Don't use this "trick" on modern processors. Every mainstream vector ISA has a dedicated hardware instruction to give you a fast inverse square-root. Every one of them is both faster and more accurate than this oft-copied little hack.

These instructions are all available via intrinsics, so they are relatively easy to use. In SSE, you want to use rsqrtss (intrinsic: _mm_rsqrt_ss( )); in NEON you want to use vrsqrte (intrinsic: vrsqrte_f32( )); and in AltiVec you want to use frsqrte. Most GPU ISAs have similar instructions. These estimates can be refined using the same Newton iteration, and NEON even has the vrsqrts instruction to do part of the refinement in a single instruction without needing to load constants.


Update

I no longer believe this answer is correct, due to feedback I've gotten from the committee. But I want to leave it up for informational purposes. And I am purposefully hopeful that this answer can be made correct by the committee (if it chooses to do so). I.e. there's nothing about the underlying hardware that makes this answer incorrect, it is just the judgement of a committee that makes it so, or not so.


I'm adding an answer not to refute the accepted answer, but to augment it. I believe the accepted answer is both correct and efficient (and I've just upvoted it). However I wanted to demonstrate another technique that is just as correct and efficient:

float InverseSquareRoot(float x)
{
    union
    {
        float as_float;
        int32_t as_int;
    };
    float xhalf = 0.5f*x;
    as_float = x;
    as_int = 0x5f3759df - (as_int>>1);
    as_float = as_float*(1.5f - xhalf*as_float*as_float);
    return as_float;
}

Using clang++ with optimization at -O3, I compiled plasmacel's code, R. Martinho Fernandes code, and this code, and compared the assembly line by line. All three were identical. This is due to the compiler's choice to compile it like this. It had been equally valid for the compiler to produce different, broken code.