Steps to creating an NFA from a regular expression

In the GitHub repository below, you can find a Java implementation of Thompson's construction where first an NFA is being created from the regex and then an input string is being matched against that NFA:

https://github.com/meghdadFar/regex


Short version for general approach.
There's an algo out there called the Thompson-McNaughton-Yamada Construction Algorithm or sometimes just "Thompson Construction." One builds intermediate NFAs, filling in the pieces along the way, while respecting operator precedence: first parentheses, then Kleene Star (e.g., a*), then concatenation (e.g., ab), followed by alternation (e.g., a|b).

Here's an in-depth walkthrough for building (b|a)*b(a|b)'s NFA

Building the top level

  1. Handle parentheses. Note: In actual implementation, it can make sense to handling parentheses via a recursive call on their contents. For the sake of clarity, I'll defer evaluation of anything inside of parens.

  2. Kleene Stars: only one * there, so we build a placeholder Kleene Star machine called P (which will later contain b|a). Intermediate result:
    Non-Deterministic Finite Automata for P*

  3. Concatenation: Attach P to b, and attach b to a placeholder machine called Q (which will contain (a|b). Intermediate result:
    Non-Deterministic Finite Automata for P*bQ

  4. There's no alternation outside of parentheses, so we skip it.

Now we're sitting on a P*bQ machine. (Note that our placeholders P and Q are just concatenation machines.) We replace the P edge with the NFA for b|a, and replace the Q edge with the NFA for a|b via recursive application of the above steps.


Building P

  1. Skip. No parens.

  2. Skip. No Kleene stars.

  3. Skip. No contatenation.

  4. Build the alternation machine for b|a. Intermediate result:
    NFA for b or a


Integrating P

Next, we go back to that P*bQ machine and we tear out the P edge. We have the source of the P edge serve as the starting state for the P machine, and the destination of the P edge serve as the destination state for the P machine. We also make that state reject (take away its property of being an accept state). The result looks like this:
enter image description here


Building Q

  1. Skip. No parens.

  2. Skip. No Kleene stars.

  3. Skip. No contatenation.

  4. Build the alternation machine for a|b. Incidentally, alternation is commutative, so a|b is logically equivalent to b|a. (Read: skipping this minor footnote diagram out of laziness.)


Integrating Q

We do what we did with P above, except replacing the Q edge with the intermedtae b|a machine we constructed. This is the result:
enter image description here

Tada! Er, I mean, QED.


Want to know more?

All the images above were generated using an online tool for automatically converting regular expressions to non-deterministic finite automata. You can find its source code for the Thompson-McNaughton-Yamada Construction algorithm online.

The algorithm is also addressed in Aho's Compilers: Principles, Techniques, and Tools, though its explanation is sparse on implementation details. You can also learn from an implementation of the Thompson Construction in C by the excellent Russ Cox, who described it some detail in a popular article about regular expression matching.