Order-preserving memcpy in C++

The ordering requirements you describe are exactly what release/acquire semantics provide. (http://preshing.com/20120913/acquire-and-release-semantics/).

The problem is that the unit of atomicity for efficient guaranteed-atomic loads/stores is at most 8 bytes on all x86 and some ARM. Otherwise only 4 bytes on other ARMs. (Why is integer assignment on a naturally aligned variable atomic on x86?). Some Intel CPUs probably in practice have atomic 32 or even 64-byte (AVX512) stores, but neither Intel nor AMD have ever made any guarantees official.

We don't even know if SIMD vector stores have a guaranteed order when they potentially break up a wide aligned store into multiple 8-byte aligned chunks. Or even if those chunks are individually atomic. Per-element atomicity of vector load/store and gather/scatter? There's every reason to believe that they are per-element atomic, even if the documentation doesn't guarantee it.

If having large "objects" is performance critical, you could consider testing vector load/store atomicity on a specific server that you care about, but you're totally on your own as far as guarantees and getting the compiler to use it. (There are intrinsics.) Make sure you test between cores on different sockets, to catch cases like SSE instructions: which CPUs can do atomic 16B memory operations? tearing at 8-byte boundaries because of HyperTransport between sockets on a K10 Opteron. This is probably a really bad idea; you can't guess what if any microarchitectural conditions could make a wide vector store non-atomic in rare cases even when it normally looks like it is atomic.

You can easily have release/acquire ordering for the elements of an array like
alignas(64) atomic<uint64_t> arr[1024];.
You just have to ask the compiler nicely:

copy_to_atomic(std::atomic<uint64_t> *__restrict dst_a, 
                      const uint64_t *__restrict src, size_t len) {
    const uint64_t *endsrc = src+len;
    while (src < src+len) {
        dst_a->store( *src, std::memory_order_release );
        dst_a++; src++;

On x86-64 it doesn't auto-vectorize or anything, because compilers don't optimize atomics, and because there's no documentation that it's safe to use vectors to store consecutive elements of an array of atomic elements. :( So this basically sucks. See it on the Godbolt compiler explorer

I'd consider rolling your own with volatile __m256i* pointers (aligned load/store), and compiler barriers like atomic_thread_fence(std::memory_order_release) to prevent compile-time reordering. Per-element ordering/atomicity should be ok (but again not guaranteed). And definitely don't count on the whole 32 bytes being atomic, just that higher uint64_t elements are written after lower uint64_t elements (and those stores become visible to other cores in that order).

On ARM32: even an atomic store of a uint64_t is not great. gcc uses a ldrexd / strexd pair (LL/SC), because apparently there is no 8-byte atomic pure store. (I compiled with gcc7.2 -O3 -march=armv7-a. With armv8-a in AArch32 mode, store-pair is atomic. AArch64 also has atomic 8-byte load/store of course.)

You must avoid using a normal C library memcpy implementation. On x86, it can use weakly-ordered stores for large copies, allowing reordering between its own stores (but not with later stores that weren't part of the memcpy, because that could break later release-stores.)

movnt cache-bypassing stores in a vector loop, or rep movsb on a CPU with the ERMSB feature, could both create this effect. Does the Intel Memory Model make SFENCE and LFENCE redundant?.

Or a memcpy implementation could simply choose to do the last (partial) vector first, before entering its main loop.

Concurrent write+read or write+write on non-atomic types in UB in C and C++; that's why memcpy has so much freedom to do whatever it wants, including use weakly-ordered stores as long as it uses sfence if necessary to make sure the memcpy as a whole respects the ordering the compiler expects when it emits code for later mo_release operations.

(i.e. current C++ implementations for x86 do std::atomic with the assumption that there are no weakly-ordered stores for them to worry about. Any code that wants their NT stores to respect the ordering of compiler-generated atomic<T> code must use _mm_sfence(). Or if writing asm by hand, the sfence instruction directly. Or just use xchg if you want to do a sequential-release store and give your asm function the effect of a atomic_thread_fence(mo_seq_cst) as well.)