128-bit values - From XMM registers to General Purpose

You cannot move the upper bits of an XMM register into a general purpose register directly.
You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.

in registers (SSE2)

movq rax,xmm0       ;lower 64 bits
movhlps xmm0,xmm0   ;move high 64 bits to low 64 bits.
movq rbx,xmm0       ;high 64 bits.

punpckhqdq xmm0,xmm0 is the SSE2 integer equivalent of movhlps xmm0,xmm0. Some CPUs may avoid a cycle or two of bypass latency if xmm0 was last written by an integer instruction, not FP.

via memory (SSE2)

movdqu [mem],xmm0
mov rax,[mem]
mov rbx,[mem+8]

slow, but does not destroy xmm register (SSE4.1)

mov rax,xmm0
pextrq rbx,xmm0,1        ;3 cycle latency on Ryzen! (and 2 uops)

A hybrid strategy is possible, e.g. store to memory, movd/q e/rax,xmm0 so it's ready quickly, then reload the higher elements. (Store-forwarding latency is not much worse than ALU, though.) That gives you a balance of uops for different back-end execution units. Store/reload is especially good when you want lots of small elements. (mov / movzx loads into 32-bit registers are cheap and have 2/clock throughput.)


For 32 bits, the code is similar:

in registers

movd eax,xmm0
psrldq xmm0,xmm0,4    ;shift 4 bytes to the right
movd ebx,xmm0
psrldq xmm0,xmm0,4    ; pshufd could copy-and-shuffle the original reg
movd ecx,xmm0         ; not destroying the XMM and maybe creating some ILP
psrlq xmm0,xmm0,4
movd edx,xmm0

via memory

movdqu [mem],xmm0
mov eax,[mem]
mov ebx,[mem+4]
mov ecx,[mem+8]
mov edx,[mem+12]

Not destroying xmm register (SSE4.1) (slow like the psrldq / pshufd version)

movd eax,xmm0
pextrd ebx,xmm0,1        ;3 cycle latency on Skylake!
pextrd ecx,xmm0,2        ;also 2 uops: like a shuffle(port5) + movd(port0)
pextrd edx,xmm0,3       

The 64-bit shift variant can run in 2 cycles. The pextrq version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.


On Intel SnB-family (including Skylake), shuffle+movq or movd has the same performance as a pextrq/d. It decodes to a shuffle uop and a movd uop, so this is not surprising.

On AMD Ryzen, pextrq apparently has 1 cycle lower latency than shuffle + movq. pextrd/q is 3c latency, and so is movd/q, according to Agner Fog's tables. This is a neat trick (if it's accurate), since pextrd/q does decode to 2 uops (vs. 1 for movq).

Since shuffles have non-zero latency, shuffle+movq is always strictly worse than pextrq on Ryzen (except for possible front-end decode / uop-cache effects).

The major downside to a pure ALU strategy for extracting all elements is throughput: it takes a lot of ALU uops, and most CPUs only have one execution unit / port that can move data from XMM to integer. Store/reload has higher latency for the first element, but better throughput (because modern CPUs can do 2 loads per cycle). If the surrounding code is bottlenecked by ALU throughput, a store/reload strategy could be good. Maybe do the low element with a movd or movq so out-of-order execution can get started on whatever uses it while the rest of the vector data is going through store forwarding.


Another option worth considering (besides what Johan mentioned) for extracting 32-bit elements to integer registers is to do some of the "shuffling" with integer shifts:

mov  rax,xmm0
# use eax now, before destroying it
shr  rax,32    

pextrq rcx,xmm0,1
# use ecx now, before destroying it
shr  rcx, 32

shr can run on p0 or p6 in Intel Haswell/Skylake. p6 has no vector ALUs, so this sequence is quite good if you want low latency but also low pressure on vector ALUs.


Or if you want to keep them around:

movq  rax,xmm0
rorx  rbx, rax, 32    # BMI2
# shld rbx, rax, 32  # alternative that has a false dep on rbx
# eax=xmm0[0], ebx=xmm0[1]

pextrq  rdx,xmm0,1
mov     ecx, edx     # the "normal" way, if you don't want rorx or shld
shr     rdx, 32
# ecx=xmm0[2], edx=xmm0[3]

Tags:

Assembly

X86

Sse