Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

There does not seem a straightforward runtime method to patch feature detection. This detection happens rather early in the dynamic linker (ld.so).

Binary patching the linker seems the easiest method at the moment. @osgx described one method where a jump is overwritten. Another approach is just to fake the cpuid result. Normally cpuid(eax=0) returns the highest supported function in eax while the manufacturer IDs are returned in registers ebx, ecx and edx. We have this snippet in glibc 2.25 sysdeps/x86/cpu-features.c:

__cpuid (0, cpu_features->max_cpuid, ebx, ecx, edx);

/* This spells out "GenuineIntel".  */
if (ebx == 0x756e6547 && ecx == 0x6c65746e && edx == 0x49656e69)
      /* feature detection for various Intel CPUs */
/* another case for AMD */
    kind = arch_kind_other;
    get_common_indeces (cpu_features, NULL, NULL, NULL, NULL);

The __cpuid line translates to these instructions in /lib/ld-linux-x86-64.so.2 (/lib/ld-2.25.so):

172a8:       31 c0                   xor    eax,eax
172aa:       c7 44 24 38 00 00 00    mov    DWORD PTR [rsp+0x38],0x0
172b1:       00 
172b2:       c7 44 24 3c 00 00 00    mov    DWORD PTR [rsp+0x3c],0x0
172b9:       00 
172ba:       0f a2                   cpuid  

So rather than patching branches, we could as well change the cpuid into a nop instruction which would result in invocation of the last else branch (as the registers will not contain "GenuineIntel"). Since initially eax=0, cpu_features->max_cpuid will also be 0 and the if (cpu_features->max_cpuid >= 7) will also be bypassed.

Binary patching cpuid(eax=0) by nop this can be done with this utility (works for both x86 and x86-64):

#!/usr/bin/env python
import re
import sys

infile, outfile = sys.argv[1:]
d = open(infile, 'rb').read()
# Match CPUID(eax=0), "xor eax,eax" followed closely by "cpuid"
o = re.sub(b'(\x31\xc0.{0,32}?)\x0f\xa2', b'\\1\x66\x90', d)
assert d != o
open(outfile, 'wb').write(o)

An equivalent Perl variant, -0777 ensures that the file is read at once instead of separating records at line feeds:

perl -0777 -pe 's/\x31\xc0.{0,32}?\K\x0f\xa2/\x66\x90/' < /lib/ld-linux-x86-64.so.2 > ld-linux-x86-64-patched.so.2
# Verify result, should display "Success"
cmp -s /lib/ld-linux-x86-64.so.2 ld-linux-x86-64-patched.so.2 && echo 'Not patched' || echo Success

That was the easy part. Now, I did not want to replace the system-wide dynamic linker, but execute only one particular program with this linker. Sure, that can be done with ./ld-linux-x86-64-patched.so.2 ./a, but the naive gdb invocations failed to set breakpoints:

$ gdb -q -ex "set exec-wrapper ./ld-linux-x86-64-patched.so.2" -ex start ./a
Reading symbols from ./a...done.
Temporary breakpoint 1 at 0x400502: file a.c, line 5.
Starting program: /tmp/a 
During startup program exited normally.
(gdb) quit
$ gdb -q -ex start --args ./ld-linux-x86-64-patched.so.2 ./a
Reading symbols from ./ld-linux-x86-64-patched.so.2...(no debugging symbols found)...done.
Function "main" not defined.
Temporary breakpoint 1 (main) pending.
Starting program: /tmp/ld-linux-x86-64-patched.so.2 ./a
[Inferior 1 (process 27418) exited normally]
(gdb) quit                                                                                                                                                                         

A manual workaround is described in How to debug program with custom elf interpreter? It works, but it is unfortunately a manual action using add-symbol-file. It should be possible to automate it a bit using GDB Catchpoints though.

An alternative approach that does not binary linking is LD_PRELOADing a library that defines custom routines for memcpy, memove, etc. This will then take precedence over the glibc routines. The full list of functions is available in sysdeps/x86_64/multiarch/ifunc-impl-list.c. Current HEAD has more symbols compared to the glibc 2.25 release, in total (grep -Po 'IFUNC_IMPL \(i, name, \K[^,]+' sysdeps/x86_64/multiarch/ifunc-impl-list.c):

memchr, memcmp, __memmove_chk, memmove, memrchr, __memset_chk, memset, rawmemchr, strlen, strnlen, stpncpy, stpcpy, strcasecmp, strcasecmp_l, strcat, strchr, strchrnul, strrchr, strcmp, strcpy, strcspn, strncasecmp, strncasecmp_l, strncat, strncpy, strpbrk, strspn, strstr, wcschr, wcsrchr, wcscpy, wcslen, wcsnlen, wmemchr, wmemcmp, wmemset, __memcpy_chk, memcpy, __mempcpy_chk, mempcpy, strncmp, __wmemset_chk,

It looks like there is a nice workaround for this implemented in recent versions of glibc: a "tunables" feature that guides selection of optimized string functions. You can find a general overview of this feature here and the relevant code inside glibc in ifunc-impl-list.c.

Here's how I figured it out. First, I took the address being complained about by gdb:

Process record does not support instruction 0xc5 at address 0x7ffff75c65d4.

I then looked it up in the table of shared libraries:

(gdb) info shared
From                To                  Syms Read   Shared Object Library
0x00007ffff7fd3090  0x00007ffff7ff3130  Yes         /lib64/ld-linux-x86-64.so.2
0x00007ffff76366b0  0x00007ffff766b52e  Yes         /usr/lib/x86_64-linux-gnu/libubsan.so.1
0x00007ffff746a320  0x00007ffff75d9cab  Yes         /lib/x86_64-linux-gnu/libc.so.6

You can see that this address is within glibc. But what function, specifically?

(gdb) disassemble 0x7ffff75c65d4
Dump of assembler code for function __strcmp_avx2:
   0x00007ffff75c65d0 <+0>:     mov    %edi,%eax
   0x00007ffff75c65d2 <+2>:     xor    %edx,%edx
=> 0x00007ffff75c65d4 <+4>:     vpxor  %ymm7,%ymm7,%ymm7

I can look in ifunc-impl-list.c to find the code that controls selecting the avx2 version:

  IFUNC_IMPL (i, name, strcmp,
          IFUNC_IMPL_ADD (array, i, strcmp,
                  HAS_ARCH_FEATURE (AVX2_Usable),
          IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSE4_2),
          IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSSE3),
          IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_sse2_unaligned)
          IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_sse2))

It looks like AVX2_Usable is the feature to disable. Let's rerun gdb accordingly:

GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX2_Usable gdb...

On this iteration it complained about __memmove_avx_unaligned_erms, which appeared to be enabled by AVX_Usable - but I found another path in ifunc-memmove.h enabled by AVX_Fast_Unaligned_Load. Back to the drawing board:

GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX2_Usable,-AVX_Fast_Unaligned_Load gdb ...

On this final round I discovered a rdtscp instruction in the ASAN shared library, so I recompiled without the address sanitizer and at last, it worked.

In summary: with some work it's possible to disable these instructions from the command line and use gdb's record feature without severe hacks.

I encountered this problem recently as well, and ended up solving it using dynamic CPUID faulting to interrupt execution of the CPUID instruction and override its result, which avoids touching glibc or the dynamic linker. This requires processor support for CPUID faulting (Ivy Bridge+) as well as Linux kernel support (4.12+) for exposing it to userspace through the ARCH_GET_CPUID and ARCH_SET_CPUID subfunctions of arch_prctl(). When this feature is enabled, a SIGSEGV signal will be delivered on each execution of CPUID, allowing a signal handler can emulate execution of the instruction and override the result.

The full solution is a bit involved since I also need to interpose the dynamic linker, because hardware capability detection was moved there starting with glibc 2.26+. I've uploaded the full solution online at https://github.com/ddcc/libcpuidoverride .