FPGA CPUs, how to find the max speed?

The speed of a design is limited by several things. The biggest will most likely be the propagation delay through the combinatorial logic in your design, called the critical path. If you use a fast FPGA and write your HDL very carefully, you could probably hit 700 MHz on something like a Virtex Ultrascale+. On a lower end FPGA, for example a Spartan 6, a reasonable figure is probably more like 250 MHz. This requires pipelining everywhere so you have the absolute minimum amount of combinatorial logic between stateful components (minimize levels of logic), low fan-outs (minimize loading on logic elements), and no congested rats-nests (efficient routing paths).

The fabric logic of different FPGAs will have different timing parameters. Faster, more expensive FPGAs will have smaller delays and as a result can achieve higher clock frequencies with the same design, or run a more complex design or design with less pipelining at the same frequency. Performance within a particular process can be similar - for example, Kintex Ultrascale and Virtex Ultrascale are made on the same process and have similar cell and routing delays. It is impossible to say how fast a given design will be without running it through the tool chain and looking at the timing reports from the static timing analysis.

When doing toolchain runs to determine maximum clock speed, bear in mind that the tools are timing-driven: they will try to meet the specified timing constraints. If no timing constraints are specified, the result can be very poor as the tools will not try to optimize the design for speed. Generally, the tools will have to be run several times with different clock period constraints to find what the max achievable clock frequency.

If you can optimize your design so that the critical path is not the limit, then you'll run in to limitations in the clock generation and distribution (PLLs, DCMs, clock buffers, and global clock nets). These limits can be found in part datasheets, but getting near them with a non-trivial design is difficult. I have run stuff on a Virtex Ultrascale at 500 MHz, but this was only a handful of counters to provide triggering signals to other components.

You synthesize your design in the target technology (a particular FPGA) and let the static timing analysis tools tell you what the minimum clock period is.

Or, you add constraints to the design in the first place, and then the tools will let you know whether they're met or not.

The speed that your CPU will run will be based on your longest flop-to-flop delay in your synthesized design. The flop-to-flop delay will include clock-to-Q, routing, logic/LUT, and flop setup time. These added together form the critical path of your timing, which you can inspect in the timing report output by the place-and-route tool.

There are entire design disciplines devoted to making architectures that minimize this delay to get the most out of a given process - pipelining, parallel execution, speculative execution, and so forth. It's a fascinating, involving task, wringing that last ounce of performance out of an FPGA (or for that matter, an ASIC.)

That said, FPGA vendors will give different speed grades for their parts, which correspond to a max MHz rate. For example a -2 Xilinx Artix is a '250 MHz' part roughly speaking although it's capable of higher clock rates for highly-pipelined designs.

When you interact with the FPGA synthesis and place-and-route tools, you will need to give constraints for your design. These tell the tool flow the target flop-to-flop delay you're trying to achieve. In Quartus (Altera) and Vivado (Xilinx) these constraints use a syntax called SDC, which stands for Synopsys Design Constraints. SDC came initially from the ASIC world and has been adopted by the FPGA industry as well. Get to know SDC - it will help you get the results you want.

Altera and Xilinx have online communities for help with how to use SDC syntax and many other topics.

That all said, if you care about speed you should consider an FPGA that has a CPU hard macro in it, such as Zynq.