Fastest polling loop - how can I trim 1 CPU cycle?

If I understand the question correctly, it's not necessarily the loop cycles that need to be reduced, but the number of cycles between consequent samples (i.e. LDR instructions). But there can be more than one LDR per iteration. You can try something like this:

    ldrb    r1, [r0]

loop:
    cbz     r1, out
    ldrb    r2, [r0]
    cbz     r2, out
    ldrb    r1, [r0]
    b       loop

out:

The spacing between the two LDRB instructions varies so the samples aren't uniformly spaced.

This may delay exit from the loop slightly, but from the problem description I can't say if it's important or not.

I happen to have access to cycle-accurate M7 model, and when the process stabilises your original loop runs on M7 in 3 cycles per iteration (meaning LDR every 3 cycles), while the proposed loop above runs in 4 cycles, but now there are two LDRs in there (so LDR every 2 cycles). Sampling rate is definitely improved.

To give credit, unrolling with CBZ as a break was proposed by @Peter Cordes in a comment.

Admittedly M3 will be slower but it's still worth a shot, if it's the sampling rate you're after.

Also you can check if LDRB instead of LDR (as in the code above) changes anything, although I don't expect it to.

UPD: I have another 2-LDR loop version which on M7 completes in 3 cycles which you can try out of interest (also CBZ breaks allow for easy balancing of the paths after the loop):

    ldr     r1, [r0]

loop:
    ldr     r2, [r0]
    cbz     r1, out_slow
    cbz     r2, out_fast
    ldr     r1, [r0]
    b       loop

out_fast:
    /* NOPs as required */

out_slow: