How can the rep stosb instruction execute faster than the equivalent loop?

In modern CPUs, rep stosb's and rep movsb's microcoded implementation actually uses stores that are wider than 1B, so it can go much faster than one byte per clock.

(Note this only applies to stos and movs, not repe cmpsb or repne scasb. They're still slow, unfortunately, like at best 2 cycles per byte compared on Skylake, which is pathetic vs. AVX2 vpcmpeqb for implementing memcmp or memchr. See https://agner.org/optimize/ for instruction tables, and other perf links in the x86 tag wiki.

See Why is this code 6.5x slower with optimizations enabled? for an example of gcc unwisely inlining repnz scasb or a less-bad scalar bithack for a strlen that happens to get large, and a simple SIMD alternative.)

rep stos/movs has significant startup overhead, but ramps up well for large memset/memcpy. (See the Intel/AMD's optimization manuals for discussion of when to use rep stos vs. a vectorized loop for small buffers.) Without the ERMSB feature, though, rep stosb is tuned for medium to small memsets and it's optimal to use rep stosd or rep stosq (if you aren't going to use a SIMD loop).

When single-stepping with a debugger, rep stos only does one iteration (one decrement of ecx/rcx), so the microcode implementation never gets going. Don't let this fool you into thinking that's all it can do.

See What setup does REP do? for some details of how Intel P6/SnB-family microarchitectures implement rep movs.

See Enhanced REP MOVSB for memcpy for memory-bandwidth considerations with rep movsb vs. an SSE or AVX loop, on Intel CPUs with the ERMSB feature. (Note especially that many-core Xeon CPUs can't saturate DRAM bandwidth with only a single thread, because of limits on how many cache misses are in flight at once, and also RFO vs. non-RFO store protocols.)

A modern Intel CPU should run the asm loop in the question at one iteration per clock, but an AMD bulldozer-family core probably can't even manage one store per clock. (Bottleneck on the two integer execution ports handling the inc/dec/branch instructions. If the loop condition was a cmp/jcc on edi, an AMD core could macro-fuse the compare-and-branch.)

One major feature of so-called Fast String operations (rep movs and rep stos on Intel P6 and SnB-family CPUs is that they avoid the read-for-ownership cache coherency traffic when storing to not-previously-cached memory. So it's like using NT stores to write whole cache lines, but still strongly ordered. (The ERMSB feature does use weakly-ordered stores).

IDK how good AMD's implementation is.

(And a correction: I previously said that Intel SnB can only handle a taken-branch throughput of one per 2 clocks, but in fact it can run tiny loops at one iteration per one clock.)

See the optimization resources (esp. Agner Fog's guides) linked from the x86 tag wiki.

Intel IvyBridge and later also ERMSB, which lets rep stos[b/w/d/q] and rep movs[b/w/d/q] use weakly-ordered stores (like movnt), allowing the stores to commit to cache out-of-order. This is an advantage if not all of the destination is already hot in L1 cache. I believe, from the wording of the docs, that there's an implicit memory barrier at the end of a fast string op, so any reordering is only visible between stores made by the string op, not between it and other stores. i.e. you still don't need sfence after rep movs.

So for large aligned buffers on Intel IvB and later, a rep stos implementation of memset can beat any other implementation. One that uses movnt stores (which don't leave the data in cache) should also be close to saturating main memory write bandwidth, but may in practice not quite keep up. See comments for discussion of this, but I wasn't able to find any numbers.

For small buffers, different approaches have very different amounts of overhead. Microbenchmarks can make SSE/AVX copy-loops look better than they are, because doing a copy with the same size and alignment every time avoids branch mispredicts in the startup/cleanup code. IIRC, it's recommended to use a vectorized loop for copies under 128B on Intel CPUs (not rep movs). The threshold may be higher than that, depending on the CPU and the surrounding code.

Intel's optimization manual also has some discussion of overhead for different memcpy implementations, and that rep movsb has a larger penalty for misalignment than movdqu.

See the code for an optimized memset/memcpy implementation for more info on what is done in practice. (e.g. Agner Fog's library).

If your CPU has CPUID ERMSB bit, then rep movsb and rep stosb commands are executed differently than on older processors.

See Intel Optimization Reference Manual, section 3.7.6 Enhanced REP MOVSB and REP STOSB operation (ERMSB).

Both the manual and my tests show that the benefits of rep stosb comparing to generic 32-bit register moves on a 32-bit CPU of Skylake microarchitecture appear only on large memory blocks, larger than 128 bytes. On smaller blocks, like 5 bytes, the code that you have shown (mov byte [edi],al; inc edi; dec ecx; jnz Clear) would be much faster, since the startup costs of rep stosb are very high - about 35 cycles. However, this speed difference has diminished on Ice Lake microarchitecture launched in September 2019, introducing the Fast Short REP MOV (FSRM) feature. This feature can be tested by a CPUID bit. It was intended for 128 bytes and shorter strings to be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but, on 32-bit, strings have to be at least 4KB in order for rep movsb to start outperforming other methods.

To get the benefits of rep stosb on the processors with CPUID ERMSB bit, the following conditions should be met:

the destination buffer has to be aligned to a 16-byte boundary;
if the length is a multiple of 64, it can produce even higher performance;
the direction bit should be set "forward" (set by the cld instruction).

According to the Intel Optimization Manual, ERMSB begins to outperform memory store via regular register on Skylake when the length of the memory block is at least 128 bytes. As I wrote, there is high internal startup ERMSB - about 35 cycles. ERMSB begins to clearly outperform other methods, including AVX copy and fill, when the length is more than 2048 bytes. However, this mainly applies to Skylake microarchitecture and not necessarily be the case for the other CPU microarchitectures.

On some processors, but not on the other, when the destination buffer is 16-byte aligned, REP STOSB using ERMSB can perform better than SIMD approaches, i.e., when using MMX or SSE registers. When the destination buffer is misaligned, memset() performance using ERMSB can degrade about 20% relative to the aligned case, for processors based on Intel microarchitecture code name Ivy Bridge. In contrast, SIMD implementation of REP STOSB will experience more negligible degradation when the destination is misaligned, according to Intel's optimization manual.

Benchmarks

I've done some benchmarks. The code was filling the same fixed-size buffer many times, so the buffer stayed in cache (L1, L2, L3), depending on the size of the buffer. The number of iterations was such as the total execution time should be about two seconds.

Skylake

On Intel Core i5 6600 processor, released on September 2015 and based on Skylake-S quad-core microarchitecture (3.30 GHz base frequency, 3.90 GHz Max Turbo frequency) with 4 x 32K L1 cache, 4 x 256K L2 cache and 6MB L3 cache, I could obtain ~100 GB/sec on REP STOSB with 32K blocks.

The memset() implementation that uses `REP STOSB`:

1297920000 blocks of 16 bytes: 13.6022 secs 1455.9909 Megabytes/sec
0648960000 blocks of 32 bytes: 06.7840 secs 2919.3058 Megabytes/sec
1622400000 blocks of 64 bytes: 16.9762 secs 5833.0883 Megabytes/sec
817587402 blocks of 127 bytes: 8.5698 secs 11554.8914 Megabytes/sec
811200000 blocks of 128 bytes: 8.5197 secs 11622.9306 Megabytes/sec
804911628 blocks of 129 bytes: 9.1513 secs 10820.6427 Megabytes/sec
407190588 blocks of 255 bytes: 5.4656 secs 18117.7029 Megabytes/sec
405600000 blocks of 256 bytes: 5.0314 secs 19681.1544 Megabytes/sec
202800000 blocks of 512 bytes: 2.7403 secs 36135.8273 Megabytes/sec
101400000 blocks of 1024 bytes: 1.6704 secs 59279.5229 Megabytes/sec
3168750 blocks of 32768 bytes: 0.9525 secs 103957.8488 Megabytes/sec (!), i.e., 10 GB/s
2028000 blocks of 51200 bytes: 1.5321 secs 64633.5697 Megabytes/sec
413878 blocks of 250880 bytes: 1.7737 secs 55828.1341 Megabytes/sec
19805 blocks of 5242880 bytes: 2.6009 secs 38073.0694 Megabytes/sec

The memset() implementation that uses `MOVDQA [RCX],XMM0`:

1297920000 blocks of 16 bytes: 3.5795 secs 5532.7798 Megabytes/sec
0648960000 blocks of 32 bytes: 5.5538 secs 3565.9727 Megabytes/sec
1622400000 blocks of 64 bytes: 15.7489 secs 6287.6436 Megabytes/sec
817587402 blocks of 127 bytes: 9.6637 secs 10246.9173 Megabytes/sec
811200000 blocks of 128 bytes: 9.6236 secs 10289.6215 Megabytes/sec
804911628 blocks of 129 bytes: 9.4852 secs 10439.7473 Megabytes/sec
407190588 blocks of 255 bytes: 6.6156 secs 14968.1754 Megabytes/sec
405600000 blocks of 256 bytes: 6.6437 secs 14904.9230 Megabytes/sec
202800000 blocks of 512 bytes: 5.0695 secs 19533.2299 Megabytes/sec
101400000 blocks of 1024 bytes: 4.3506 secs 22761.0460 Megabytes/sec
3168750 blocks of 32768 bytes: 3.7269 secs 26569.8145 Megabytes/sec (!) i.e., 26 GB/s
2028000 blocks of 51200 bytes: 4.0538 secs 24427.4096 Megabytes/sec
413878 blocks of 250880 bytes: 3.9936 secs 24795.5548 Megabytes/sec
19805 blocks of 5242880 bytes: 4.5892 secs 21577.7860 Megabytes/sec

Please note that the drawback of using the XMM0 register is that it is 128 bits (16 bytes) while I could have used YMM0 register of 256 bits (32 bytes). Anyway, stosb uses the non-RFO protocol. Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a non-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in Ivy Bridge. See https://stackoverflow.com/a/33905887/6910868 for more details and the source.

Anyway, even you compare just two of the methods that I have provided, and even though the second method is far from ideal, as you see, on 64-bit blocks rep stosb is slower, but starting from 128-byte blocks, rep stosb begin to outperform other methods, and the difference is very significant starting from 512-byte blocks and longer, provided that you are clearing the same memory block again and again within the cache.

Therefore, for REP STOSB, maximum speed was 103957 (one hundred three thousand nine hundred fifty-seven) Megabytes per second, while with MOVDQA [RCX],XMM0 it was just 26569 (twenty-six thousand five hundred sixty-nine) twenty-six thousand five hundred sixty-nine.

As you see, the highest performance was on 32K blocks, which is equal to 32K L1 cache of the CPU on which I've made the benchmarks.

Ice Lake

REP STOSB vs AVX-512 store

I have also done tests on an Intel i7 1065G7 CPU, released in August 2019 (Ice Lake/Sunny Cove microarchitecture), Base frequency: 1.3 GHz, Max Turbo frequency 3.90 GHz. It supports AVX512F instruction set. It has 4 x 32K L1 instruction cache and 4 x 48K data cache, 4x512K L2 cache and 8 MB L3 cache.

Destination alignment

On 32K blocks zeroized by rep stosb, performance was from 175231 MB/s for destination misaligned by 1 byte (e.g. $7FF4FDCFFFFF) and quickly rose to 219464 MB/s for aligned by 64 bytes (e.g. $7FF4FDCFFFC0), and then gradually rose to 222424 MB/sec for destinations aligned by 256 bytes (Aligned to 256 bytes, i.e. $7FF4FDCFFF00). After that, the speed did not rise, even if destination was aligned by 32KB (e.g. $7FF4FDD00000), and was still 224850 MB/sec.

There was no difference in speed between rep stosb and rep stosq.

On buffers aligned by 32K, the speed of AVX-512 store was exactly the same as for rep stosb, for loops starting from 2 stores in a loop (227777 MB/sec) and didn't grow for loops unrolled for 4 and even 16 stores. However, for a loop of just 1 store the speed was a little bit lower - 203145 MB/sec.

However, if the destination buffer was misaligned by just 1 byte, the speed of AVX512 store dropped dramatically, i.e. more than 2 times, to 93811 MB/sec, in contrast to rep stosb on similar buffers, which gave 175231 MB/sec.

Buffer Size

For 1K (1024 bytes) blocks, AVX-512 (205039 KB/s) was 3 times faster than rep stosb (71817 MB/s)
And for 512 bytes blocks, AVX-512 performance was always the same as for larger block types (194181 MB/s), while rep stosb dropped to 38682 MB/s. At this block type, the difference was 5 times in favor of AVX-512.
For 2K (2048) blocks, AVX-512 had 210696 MB/s, while for rep stosb it was 123207 MB/s, almost twice slower. Again, there was no difference between rep stosb and rep stosq.
For 4K (4096) blocks, AVX-512 had 225179 MB/s, while rep stosb: 180384 MB/s, almost catching up.
For 8K (8192) blocks, AVX-512 had 222259 MB/s, while rep stosb: 194358 MB/s, close!
For 32K (32768) blocks, AVX-512 had 228432 MB/s, rep stosb: 220515 MB/s - now at last! We are approaching the L0 data cache size of my CPU - 48Kb! This is 220 Gigabytes per second!
For 64K (65536) blocks, AVX-512 had 61405 MB/s, rep stosb: 70395 MB/s!
Such a huge drop when we ran out of the L0 cache! And, it was evident that, from this point, rep stosb begins to outperform AVX-512 stores.
Now let's check the L1 cache size. For for 512K blocks, AVX-512 made 62907 MB/s and rep stosb made 70653 MB/s. That's where rep stosb begins to outperform AVX-512. The difference is not yet significant, but the bigger the buffer, the bigger the difference.
Now let's take a huge buffer of 1GB (1073741824). With AVX-512, the speed was 14319 MB/s, rep stosb it as 27412 MB/s, i.e. twice as fast as AVX-512!

I've also tried to use non-temporal instructions for filling the 32K buffers vmovntdq [rcx], zmm31, but the performance was about 4 time slower than just vmovdqa64 [rcx], zmm31. How can I take benefits of vmovntdq when filling memory buffers? Should there be some specific size of the buffer in order vmovntdq to take an advantage?

Also, if the destination buffers are aligned by at least 64 bits, there is no performance difference in vmovdqa64 vs vmovdqu64. Therefore, I do have a question: does the instruction vmovdqa64 is only needed for debugging and safety when we have vmovdqu64?

Figure 1: Speed of iterative store to the same buffer, MB/s

block     AVX   stosb
-----   -----  ------
 0.5K  194181   38682
   1K  205039  205039
   2K  210696  123207
   4K  225179  180384
   8K  222259  194358 
  32K  228432  220515 
  64K   61405   70395 
 512K   62907   70653 
   1G   14319   27412

Summary on performance of multiple clearing the same memory block within the cache

rep stosb on Ice Lake CPUs begins to outperform AVX-512 stores only for repeatedly clearing the same memory buffer larger than the L0 cache size, i.e. 48K on the Intel i7 1065G7 CPU. And on small memory buffers, AVX-512 stores are much faster: for 1KB - 3 times faster, for 512 bytes - 5 times faster.

However, the AVX-512 stores are susceptible to misaligned buffers, while rep stosb is not as sensitive to misalignment.

Therefore, I have figured out that rep stosb begins to outperform AVX-512 stores only on buffers that exceed L0 data cache size, or 48KB as in case of the Intel i7 1065G7 CPU. This conclusion is valid at least on Ice Lake CPUs. An earlier Intel recommendation that string copy begins to outperform AVX copy starting from 2KB buffers also should be re-tested for newer microarchitectures.

Clearing different memory buffers, each only once

My previous benchmarks were filling the same buffer many times in row. A better benchmark might be to allocate many different buffers and only fill each buffer once, to not interfere with the cache.

In this scenario, there is no much difference at all between rep stosb and AVX-512 stores. The only difference is when all the data does not come close to a physical memory limit, under Windows 10 64 bit. In the following benchmarks, the total data size was below 8 GB with total physical ram of 16 GB. When I was allocating about 12 GB, performance drops about 20 times, regardless of the method. Windows began to discard purged memory pages, and probably did some other stuff when the memory was about to be full. The L3 cache size of 8MB on the i7 1065G7 CPU did not seem to matter the benchmarks at all. All that matters is that you didn't have to come close to physical memory limit, and it depends on your operating system on how it handles such situations. As I said, under Windows 10, if I took just half physical memory, it was OK, but it I took 3/4 of available memory, my benchmark slowed 20 times. I didn't even try to take more than 3/4. As I told, the total memory size is 16 GB. The amount available, according to the task manager, was 12 GB.

Here is the benchmark of the speed of filling various blocks of memory totalling 8 GB with zeros (in MB/sec) on the i7 1065G7 CPU with 16 GB total memory, single-threaded. By "AVX" I mean "AVX-512" normal stores, and by "stosb" I mean "rep stosb".

Figure 2: Speed of store to the multiple buffers, once each, MB/s

block    AVX  stosb
-----   ----   ----
 0.5K   3641   2759
   1K   4709   3963
   2K  12133  13163
   4K   8239  10295
   8K   3534   4675
  16K   3396   3242
  32K   3738   3581
  64K   2953   3006
 128K   3150   2857
 256K   3773   3914
 512K   3204   3680
1024K   3897   4593
2048K   4379   3234
4096K   3568   4970
8192K   4477   5339

Conclusion on clearing the memory within the cache

If your memory does not exist in the cache, than the performance of AVX-512 stores and rep stosb is about the same when you need to fill memory with zeros. It is the cache that matters, not the choice between these two methods.

The use of non-temporal store to clear the memory which is not in the cache

I was zeroizing 6-10 GB of memory split by a sequence of buffers aligned by 64 bytes. No buffers were zeroized twice. Smaller buffers had some overhead, and I had only 16 GB of physical memory, so I zeroized less memory in total with smaller buffers. I used various tests for the buffers starting from 256 bytes and up to to 8 GB per buffer. I took 3 different methods:

Normal AVX-512 store by vmovdqa64 [rcx+imm], zmm31 (a loop of 4 stores and then compare the counter);
Non-temporal AVX-512 store by vmovntdq [rcx+imm], zmm31 (same loop of 4 stores);
rep stosb.

For small buffers, the normal AVX-512 store was the winner. Then, starting from 4KB, the non-temporal store took the lead, while rep stosb still lagged behind.

Then, from 256KB, rep stosb outperformed AVX-512, but not the non-temporal store, and since that, the situation didn’t change. The winner was a non-temporal AVX-512 store, then came rep stosb and then the normal AVX-512 store.

Figure 3. Speed of store to the multiple buffers, once each, MB/s by three different methods: normal AVX-512 store, nontemporal AVX-512 store and rep stosb.

Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 2.90s, 2.30 GB/s by normal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by nontemporal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by rep stosb

Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.06s, 2.62 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.02s, 2.65 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.66s, 2.18 GB/s by rep stosb

Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.10s, 2.87 GB/s by normal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.37s, 2.64 GB/s by nontemporal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 4.85s, 1.83 GB/s by rep stosb

Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.45s, 2.73 GB/s by normal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.79s, 2.48 GB/s by nontemporal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 4.83s, 1.95 GB/s by rep stosb

Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by normal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 3.46s, 2.81 GB/s by nontemporal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by rep stosb

Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.24s, 3.04 GB/s by normal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 2.65s, 3.71 GB/s by nontemporal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.35s, 2.94 GB/s by rep stosb

Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.37s, 2.94 GB/s by normal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 2.73s, 3.63 GB/s by nontemporal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.53s, 2.81 GB/s by rep stosb

Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.19s, 3.12 GB/s by normal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 2.64s, 3.77 GB/s by nontemporal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.44s, 2.90 GB/s by rep stosb

Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.08s, 3.24 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 2.58s, 3.86 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.29s, 3.03 GB/s by rep stosb

Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.22s, 3.10 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 2.49s, 4.01 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.26s, 3.07 GB/s by rep stosb

Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.52s, 3.97 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 1.98s, 5.06 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.43s, 4.11 GB/s by rep stosb

Zeroized 10.00 GB: 20475 blocks of 512 KB for 2.15s, 4.65 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.70s, 5.87 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.81s, 5.53 GB/s by rep stosb

Zeroized 10.00 GB: 10238 blocks of 1 MB for 2.18s, 4.59 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.50s, 6.68 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.63s, 6.13 GB/s by rep stosb

Zeroized 10.00 GB: 5119 blocks of 2 MB for 2.02s, 4.96 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.59s, 6.30 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.54s, 6.50 GB/s by rep stosb

Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.90s, 5.26 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.37s, 7.29 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.47s, 6.81 GB/s by rep stosb

Zeroized 9.99 GB: 1279 blocks of 8 MB for 2.04s, 4.90 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.51s, 6.63 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.56s, 6.41 GB/s by rep stosb

Zeroized 9.98 GB: 639 blocks of 16 MB for 1.93s, 5.18 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.37s, 7.30 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.45s, 6.89 GB/s by rep stosb

Zeroized 9.97 GB: 319 blocks of 32 MB for 1.95s, 5.11 GB/s by normal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.41s, 7.06 GB/s by nontemporal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.42s, 7.02 GB/s by rep stosb

Zeroized 9.94 GB: 159 blocks of 64 MB for 1.85s, 5.38 GB/s by normal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.33s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.40s, 7.09 GB/s by rep stosb

Zeroized 9.88 GB: 79 blocks of 128 MB for 1.99s, 4.96 GB/s by normal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.42s, 6.97 GB/s by nontemporal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.55s, 6.37 GB/s by rep stosb

Zeroized 9.75 GB: 39 blocks of 256 MB for 1.83s, 5.32 GB/s by normal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.32s, 7.38 GB/s by nontemporal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.64s, 5.93 GB/s by rep stosb

Zeroized 9.50 GB: 19 blocks of 512 MB for 1.89s, 5.02 GB/s by normal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.31s, 7.27 GB/s by nontemporal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.42s, 6.71 GB/s by rep stosb

Zeroized 9.00 GB: 9 blocks of 1 GB for 1.76s, 5.13 GB/s by normal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.26s, 7.12 GB/s by nontemporal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.29s, 7.00 GB/s by rep stosb

Zeroized 8.00 GB: 4 blocks of 2 GB for 1.48s, 5.42 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.07s, 7.49 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.15s, 6.94 GB/s by rep stosb

Zeroized 8.00 GB: 2 blocks of 4 GB for 1.48s, 5.40 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.08s, 7.40 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.14s, 7.00 GB/s by rep stosb

Zeroized 8.00 GB: 1 blocks of 8 GB for 1.50s, 5.35 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.07s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.21s, 6.63 GB/s by rep stosb

Avoiding AVX-SSE transition penalties

For all the AVX-512 code, I've used the ZMM31 register, because SSE registers come from 0 to to 15, so the AVX-512 registers 16 to 31 do not have their SSE counterparts, thus do not incur the transition penalty.

How can the rep stosb instruction execute faster than the equivalent loop?

Benchmarks

Skylake

The memset() implementation that uses `REP STOSB`:

The memset() implementation that uses `MOVDQA [RCX],XMM0`:

Ice Lake

REP STOSB vs AVX-512 store

Destination alignment

Buffer Size

Summary on performance of multiple clearing the same memory block within the cache

Clearing different memory buffers, each only once

Conclusion on clearing the memory within the cache

The use of non-temporal store to clear the memory which is not in the cache

Avoiding AVX-SSE transition penalties

Tags:

Performance

Optimization

Assembly

X86

Micro Optimization

Related

Recent Posts

How can the rep stosb instruction execute faster than the equivalent loop?

Benchmarks

Skylake

The memset() implementation that uses REP STOSB:

The memset() implementation that uses MOVDQA [RCX],XMM0:

Ice Lake

REP STOSB vs AVX-512 store

Destination alignment

Buffer Size

Summary on performance of multiple clearing the same memory block within the cache

Clearing different memory buffers, each only once

Conclusion on clearing the memory within the cache

The use of non-temporal store to clear the memory which is not in the cache

Avoiding AVX-SSE transition penalties

Tags:

Performance

Optimization

Assembly

X86

Micro Optimization

Related

The memset() implementation that uses `REP STOSB`:

The memset() implementation that uses `MOVDQA [RCX],XMM0`: