gcc, strict-aliasing, and horror stories

SWIG generates code that depends on strict aliasing being off, which can cause all sorts of problems.

SWIGEXPORT jlong JNICALL Java_com_mylibJNI_make_1mystruct_1_1SWIG_12(
       JNIEnv *jenv, jclass jcls, jint jarg1, jint jarg2) {
  jlong jresult = 0 ;
  int arg1 ;
  int arg2 ;
  my_struct_t *result = 0 ;

  (void)jenv;
  (void)jcls;
  arg1 = (int)jarg1; 
  arg2 = (int)jarg2; 
  result = (my_struct_t *)make_my_struct(arg1,arg2);
  *(my_struct_t **)&jresult = result;              /* <<<< horror*/
  return jresult;
}

gcc, aliasing, and 2-D variable-length arrays: The following sample code copies a 2x2 matrix:

#include <stdio.h>

static void copy(int n, int a[][n], int b[][n]) {
   int i, j;
   for (i = 0; i < 2; i++)    // 'n' not used in this example
      for (j = 0; j < 2; j++) // 'n' hard-coded to 2 for simplicity
         b[i][j] = a[i][j];
}

int main(int argc, char *argv[]) {
   int a[2][2] = {{1, 2},{3, 4}};
   int b[2][2];
   copy(2, a, b);    
   printf("%d %d %d %d\n", b[0][0], b[0][1], b[1][0], b[1][1]);
   return 0;
}

With gcc 4.1.2 on CentOS, I get:

$ gcc -O1 test.c && a.out
1 2 3 4
$ gcc -O2 test.c && a.out
10235717 -1075970308 -1075970456 11452404 (random)

I don't know whether this is generally known, and I don't know whether this a bug or a feature. I can't duplicate the problem with gcc 4.3.4 on Cygwin, so it may have been fixed. Some work-arounds:

  • Use __attribute__((noinline)) for copy().
  • Use the gcc switch -fno-strict-aliasing.
  • Change the third parameter of copy() from b[][n] to b[][2].
  • Don't use -O2 or -O3.

Further notes:

  • This is an answer, after a year and a day, to my own question (and I'm a bit surprised there are only two other answers).
  • I lost several hours with this on my actual code, a Kalman filter. Seemingly small changes would have drastic effects, perhaps because of changing gcc's automatic inlining (this is a guess; I'm still uncertain). But it probably doesn't qualify as a horror story.
  • Yes, I know you wouldn't write copy() like this. (And, as an aside, I was slightly surprised to see gcc did not unroll the double-loop.)
  • No gcc warning switches, include -Wstrict-aliasing=, did anything here.
  • 1-D variable-length arrays seem to be OK.

Update: The above does not really answer the OP's question, since he (i.e. I) was asking about cases where strict aliasing 'legitimately' broke your code, whereas the above just seems to be a garden-variety compiler bug.

I reported it to GCC Bugzilla, but they weren't interested in the old 4.1.2, even though (I believe) it is the key to the $1-billion RHEL5. It doesn't occur in 4.2.4 up.

And I have a slightly simpler example of a similar bug, with only one matrix. The code:

static void zero(int n, int a[][n]) {
   int i, j;
   for (i = 0; i < n; i++)
   for (j = 0; j < n; j++)
      a[i][j] = 0;
}

int main(void) {
   int a[2][2] = {{1, 2},{3, 4}};
   zero(2, a);    
   printf("%d\n", a[1][1]);
   return 0;
}

produces the results:

gcc -O1 test.c && a.out
0
gcc -O1 -fstrict-aliasing test.c && a.out
4

It seems it is the combination -fstrict-aliasing with -finline which causes the bug.


No horror story of my own, but here are some quotes from Linus Torvalds (sorry if these are already in one of the linked references in the question):

http://lkml.org/lkml/2003/2/26/158:

Date Wed, 26 Feb 2003 09:22:15 -0800 Subject Re: Invalid compilation without -fno-strict-aliasing From Jean Tourrilhes <>

On Wed, Feb 26, 2003 at 04:38:10PM +0100, Horst von Brand wrote:

Jean Tourrilhes <> said:

It looks like a compiler bug to me... Some users have complained that when the following code is compiled without the -fno-strict-aliasing, the order of the write and memcpy is inverted (which mean a bogus len is mem-copied into the stream). Code (from linux/include/net/iw_handler.h) :

static inline char *
iwe_stream_add_event(char *   stream,     /* Stream of events */
                     char *   ends,       /* End of stream */
                    struct iw_event *iwe, /* Payload */
                     int      event_len)  /* Real size of payload */
{
  /* Check if it's possible */
  if((stream + event_len) < ends) {
      iwe->len = event_len;
      memcpy(stream, (char *) iwe, event_len);
      stream += event_len;
  }
  return stream;
}

IMHO, the compiler should have enough context to know that the reordering is dangerous. Any suggestion to make this simple code more bullet proof is welcomed.

The compiler is free to assume char *stream and struct iw_event *iwe point to separate areas of memory, due to strict aliasing.

Which is true and which is not the problem I'm complaining about.

(Note with hindsight: this code is fine, but Linux's implementation of memcpy was a macro that cast to long * to copy in larger chunks. With a correctly-defined memcpy, gcc -fstrict-aliasing isn't allowed to break this code. But it means you need inline asm to define a kernel memcpy if your compiler doesn't know how turn a byte-copy loop into efficient asm, which was the case for gcc before gcc7)

And Linus Torvald's comment on the above:

Jean Tourrilhes wrote: >

It looks like a compiler bug to me...

Why do you think the kernel uses "-fno-strict-aliasing"?

The gcc people are more interested in trying to find out what can be allowed by the c99 specs than about making things actually work. The aliasing code in particular is not even worth enabling, it's just not possible to sanely tell gcc when some things can alias.

Some users have complained that when the following code is compiled without the -fno-strict-aliasing, the order of the write and memcpy is inverted (which mean a bogus len is mem-copied into the stream).

The "problem" is that we inline the memcpy(), at which point gcc won't care about the fact that it can alias, so they'll just re-order everything and claim it's out own fault. Even though there is no sane way for us to even tell gcc about it.

I tried to get a sane way a few years ago, and the gcc developers really didn't care about the real world in this area. I'd be surprised if that had changed, judging by the replies I have already seen.

I'm not going to bother to fight it.

Linus

http://www.mail-archive.com/[email protected]/msg01647.html:

Type-based aliasing is stupid. It's so incredibly stupid that it's not even funny. It's broken. And gcc took the broken notion, and made it more so by making it a "by-the-letter-of-the-law" thing that makes no sense.

...

I know for a fact that gcc would re-order write accesses that were clearly to (statically) the same address. Gcc would suddenly think that

unsigned long a;

a = 5;
*(unsigned short *)&a = 4;

could be re-ordered to set it to 4 first (because clearly they don't alias - by reading the standard), and then because now the assignment of 'a=5' was later, the assignment of 4 could be elided entirely! And if somebody complains that the compiler is insane, the compiler people would say "nyaah, nyaah, the standards people said we can do this", with absolutely no introspection to ask whether it made any SENSE.