What is flash memory wait states?

Wait states are added to the memory access cycle initiated by the CPU. So it's indeed the CPU which has to wait for the slower Flash. The memory controller signals "not ready" to the CPU for a number of cycles (0 to 3), and while it does so the CPU remains in its current state, i.e. having written the Flash address, but not yet reading the data. Only when the memory controller signals "data ready" the CPU will read from the data bus and continue its instruction (latching the data into a register or into RAM).


To amplify stevenvh's answer, any type of logic, when given an input signal, will take some time to produce an output signal; memory is often very slow compared with other logic. Often, there will be a guarantee that the output signal will become valid within a certain amount of time, but that's it. In particular, it's possible that the signal might change several times within that interval, and there will be no indication, prior to the end of that interval, that the signal has achieved its final "correct" value.

When a typical microcontroller or microprocessor reads a byte (or word, or whatever unit) of memory, it generates an address and, some time later, looks at the value output by the memory and acts upon it. Between the time the controller generates the address and the time it looks at the value from memory, it doesn't care when or whether the output signals from the memory change. On the other hand, if the signal from memory hasn't stabilized to its final value by the time the controller looks at it, the controller will misread the memory as having held whatever value was being output at the moment it looked. Normally the controller would look at the value from memory as soon as it was ready to do something with it, but if the memory's value wouldn't be ready then, that might not work. Consequently, many controllers have an option to wait a bit longer after they're ready to process data from memory, to ensure that the output from memory is actually valid. Note that adding such delay will slow things down (the controller would have been happy to act on the data from memory sooner), but will not affect correctness of operation (unless things are slowed down so much that other timing obligations cannot be met).


The processor might need to stall on memory, but a clever design wouldn't need to.

I think the key technology you're not aware of is burst/page mode access. That allows the bandwidth of memory accesses to be very close to the processor speed (but probably Flash is still the bottleneck since I've never seen a Flash based MCU that runs at > 200MhZ)

However, the latency stays the same. For example, for the STM32F4 MCUs that I'm using, #wait states = floor(clockSpeed / 30MhZ). That means the latency is always 33ns, regardless of clock speed. There's a saying, "Money can buy bandwidth, but latency is forever..."

Even if the Flash bandwidth wasn't sufficient to keep the CPU busy, you can easily design a code cache that stores and prefetches instructions that are expected to execute. ST has a hint about this for their STM32F4 MCUs (168 MhZ):

Thanks to the ART accelerator and the 128-bit Flash memory, the number of wait states given here does not impact the execution speed from Flash memory since the ART accelerator allows to achieve a performance equivalent to 0 wait state program execution.

Actually, the statement also suggests that burst mode isn't necessary and that a very wide memory interface is also sufficient. But the idea is the same (using parallelism to hide latency). On chip, wires are free, so a 128bit memory would make sense.