How do I generate sentences from a formal grammar?

Here is a Python example using the NLTK:

from nltk import parse_cfg, ChartParser
from random import choice

def produce(grammar, symbol):
    words = []
    productions = grammar.productions(lhs = symbol)
    production = choice(productions)
    for sym in production.rhs():
        if isinstance(sym, str):
            words.append(sym)
        else:
            words.extend(produce(grammar, sym))
    return words

grammar = parse_cfg('''
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
V -> 'shot' | 'killed' | 'wounded'
Det -> 'an' | 'my' 
N -> 'elephant' | 'pajamas' | 'cat' | 'dog'
P -> 'in' | 'outside'
''')

parser = ChartParser(grammar)

gr = parser.grammar()
print ' '.join(produce(gr, gr.start()))

The example is adapted from the book. The sentences generated are syntactically correct but still total gibberish.


Your solution should follow the inductive structure of the grammar. How do you generate a random utterance for each of the following?

  • Terminal symbol
  • Nonterminal symbol
  • Sequence of right-hand sides
  • Choice of right-hand sides
  • Star closure of right-hand sides

This will all be much clearer if you write down the data structure you use to represent a grammar. The structure of your set of mutually recursive generator functions will mirror that data structure very closely.

Dealing with infinite recursion is a bit dicey. The easiest way is to generate a stream of utterances and keep a depth cutoff. Or if you're using a lazy language like Haskell you can generate all utterances and peel off as many finite ones as you like (a trickier problem than the original question, but very entertaining).


I don't know that there's a "common" algorithm for doing this. Random program generation is used in genetic programming so you could look for a grammar based GP system and see how they handle program generation. I would do a recursive rule generation algorithm like the pseudo-code:

void GenerateRule(someRule)
{
  foreach (part in someRule.Parts)
  {
    if (part.IsLiteral) OutputLiteral(part);
    if (part.IsIdentifier) Output(GenerateIdentifier(part)));
    if (part.IsRule) GenerateRule(part.Rule);
  }
}

This assumes that you've read in all of the parts into some data structure. You'd also need to handle the repetitions(randomly generate the number of times they occur) and optional rules (flip a coin to see if they are there or not).


Edit: Oh, and if the rule has more than one option, you'd just pick one of the options to go with, and process it the same way. So if some rule was (Literal|Variable), you'd randomly pick between the two.