Getting fast performance from a STM32 MCU

The question here really is: what is the machine code you're generating from the C program, and how does it differ from what you'd expect.

If you didn't have access to the original code, this would've been an exercise in reverse engineering (basically something starting with: radare2 -A arm image.bin; aaa; VV), but you've got the code so this makes it all easier.

First, compile it with the -g flag added to the CFLAGS (same place where you also specify -O1). Then, look at the generated assembly:

arm-none-eabi-objdump -S yourprog.elf

Notice that of course both the name of the objdump binary as well as your intermediate ELF file might be different.

Usually, you can also just skip the part where GCC invokes the assembler and just look at the assembly file. Just add -S to the GCC command line – but that will normally break your build, so you'd most probably do it outside your IDE.

I did the assembly of a slightly patched version of your code:

arm-none-eabi-gcc 
    -O1 ## your optimization level
    -S  ## stop after generating assembly, i.e. don't run `as`
    -I/path/to/CMSIS/ST/STM32F3xx/ -I/path/to/CMSIS/include
     test.c

and got the following (excerpt, full code under link above):

.L5:
    ldr r2, [r3, #24]
    orr r2, r2, #1024
    str r2, [r3, #24]
    ldr r2, [r3, #40]
    orr r2, r2, #1024
    str r2, [r3, #40]
    b   .L5

Which is a loop (notice the unconditional jump to .L5 at the end and the .L5 label at the beginning).

What we see here is that we

  • first ldr (load register) the register r2 with the value at memory location stored in r3+ 24 Bytes. Being too lazy to look that up: very likely the location of BSRR.
  • Then OR the r2 register with the constant 1024 == (1<<10), which would correspond to setting the 10th bit in that register, and write the result to r2 itself.
  • Then str (store) the result in the memory location we've read from in the first step
  • and then repeat the same for a different memory location, out of lazyness: most likely BRR's address.
  • Finally b (branch) back to the first step.

So we have 7 instructions, not three, to start with. Only the b happens once, and thus is very likely what's taking an odd number of cycles (we have 13 in total, so somewhere an odd cycle count must come from). Since all odd numbers below 13 are 1, 3, 5, 7, 9, 11, and we can rule out any numbers larger than 13-6 (assuming the CPU can't execute an instruction in less than one cycle), we know that the b takes 1, 3, 5, or 7 CPU cycles.

Being who we are, I looked at ARM's documentation of instructions and how much cycles they take for the M3:

  • ldr takes 2 cycles (in most cases)
  • orr takes 1 cycle
  • str takes 2 cycles
  • b takes 2 to 4 cycles. We know it must be an odd number, so it must take 3, here.

That all lines up with your observation:

$$\begin{align} 13 &= 2\cdot(&c_\mathtt{ldr}&+c_\mathtt{orr}&+c_\mathtt{str})&+c_\mathtt{b}\\ &= 2\cdot(&2&+1&+2)&+3\\ &= 2\cdot &5 &&&+3 \end{align}$$


As the above calculation shows, there will hardly be a way of making your loop any faster – the output pins on ARM processors are usually memory mapped, not CPU core registers, so you have to go through the usual load – modify – store routine if you want to do anything with those.

What you could of course do is not read (|= implicitly has to read) the pin's value every loop iteration, but just write the value of a local variable to it, which you just toggle every loop iteration.

Notice that I feel like you might be familiar with 8bit micros, and would be attempting to read only 8 bit values, store them in local 8 bit variables, and write them in 8 bit chunks. Don't. ARM is a 32bit architecture, and extracting 8 bit of a 32bit word might take additional instructions. If you can, just read the whole 32bit word, modify what you need, and write it back as whole. Whether that is possible of course depends on what you're writing to, i.e. the layout and functionality of your memory-mapped GPIO. Consult the STM32F3 datasheet/user's guide for info on what is stored in the 32bit containing the bit you want to toggle.


Now, I tried to reproduce your issue with the "low" period getting longer, but I simply couldn't – the loop looks exactly the same with -O3 as with -O1 with my compiler version. You'll have to do that yourself! Maybe you're using some ancient version of GCC with suboptimal ARM support.


The BSRR and BRR registers are for setting and resetting individual port bits:

GPIO port bit set/reset register (GPIOx_BSRR)

...

(x = A..H) Bits 15:0

BSy: Port x set bit y (y= 0..15)

These bits are write-only. A read to these bits returns the value 0x0000.

0: No action on the corresponding ODRx bit

1: Sets the corresponding ODRx bit

As you can see, reading these registers always gives 0, therefore what your code

GPIOE->BSRRL |= GPIO_BSRR_BS_10;
GPIOE->BRR |= GPIO_BRR_BR_10;

does effectively is GPIOE->BRR = 0 | GPIO_BRR_BR_10, but the optimizer doesn't know that, so it generates a sequence of LDR, ORR, STR instructions instead of a single store.

You can avoid the expensive read-modify-write operation by simply writing

GPIOE->BSRRL = GPIO_BSRR_BS_10;
GPIOE->BRR = GPIO_BRR_BR_10;

You might get some further improvement by aligning the loop to an adress evenly divisible by 8. Try putting one or mode asm("nop"); instructions before the while(1) loop.