How to build a Plinko board of words from a dictionary better than brute force?

Pretend the V W X Y Z on the bottom here actually complete words.

    B
   A O
  I R N
 T N E D
V W X Y Z

We can implement a backtracking search with heuristics so strict, it seems unlikely any wrong path would go very far.

Insert all the n sized words that begin with the same letter in a simple tree as below. Now perform a depth first search, asserting the following: each successive level needs one additional "shared" letter, meaning p(letter) instances of it on that level, with the additional requirement that their two children are the same letters (e.g., the two Rs in parentheses on level 2 could be a "shared" letter because their children are the same).

What is p(letter)? Pascal's triangle of course! n choose r is exactly the number of instances of the letter needed on the relevant level of this simple tree, according to the Plinko board. On level 3, if we've chosen R and R, we'll need 3 Ns and 3 Es to express the "shared" letters on that level. And each of the 3 Ns must have the same children letters (W,X in this case), and each of the 3 Es must also (X,Y).

                     B
            /                 \
          A                     O
      /       \             /       \   
     I        (R)         (R)        N
    / \       / \         / \       / \
   T  (N)   (N)  E      (N)  E     E   D
  V W W X   W X X Y     W X X Y   X Y Y Z

4 W's, 6 X's, 4 Y's 

UPDATE

Out of curiosity, here's some Python code :)

from itertools import combinations
from copy import deepcopy

# assumes words all start
# with the same letter and
# are of the same length
def insert(word, i, tree):
  if i == len(word):
    return
  if word[i] in tree:
    insert(word, i + 1, tree[word[i]])
  else:
    tree[word[i]] = {}
    insert(word, i + 1, tree[word[i]])

# Pascal's triangle
def get_next_needed(needed):
  next_needed = [[1, None, 0]] + [None] * (len(needed) - 1) + [[1, None, 0]]

  for i, _ in enumerate(needed):
    if i == len(needed) - 1:
      next_needed[i + 1] = [1, None, 0]
    else:
      next_needed[i + 1] = [needed[i][0] + needed[i+1][0], None, 0]
  return next_needed

def get_candidates(next_needed, chosen, parents):
  global log
  if log:
    print "get_candidates: parents: %s" % parents
  # For each chosen node we need two children.
  # The corners have only one shared node, while
  # the others in each group are identical AND
  # must have all have a pair of children identical
  # to the others' in the group. Additionally, the
  # share sequence matches at the ends of each group.
  #    I       (R)     (R)      N
  #   / \      / \     / \     / \
  #  T  (N)  (N)  E  (N)  E   E   D

  # Iterate over the parents, choosing
  # two nodes for each one
  def g(cs, s, seq, i, h):
    if log:
      print "cs, seq, s, i, h: %s, %s, %s, %s, %s" % (cs, s, seq, i, h)

    # Base case, we've achieved a candidate sequence
    if i == len(parents):
      return [(cs, s, seq)]
    # The left character in the corner is
    # arbitrary; the next one, shared.
    # Left corner:
    if i == 0:
      candidates = []
      for (l, r) in combinations(chosen[0].keys(), 2):
        _cs = deepcopy(cs)
        _cs[0] = [1, l, 1]
        _cs[1][1] = r
        _cs[1][2] = 1
        _s = s[:]
        _s.extend([chosen[0][l], chosen[0][r]])
        _h = deepcopy(h)
        # save the indexes in cs of the
        # nodes chosen for the parent 
        _h[parents[1]] = [1, 2]
        candidates.extend(g(_cs, _s, l+r, 1, _h))
        _cs = deepcopy(cs)
        _cs[0] = [1, r, 1]
        _cs[1][1] = l
        _cs[1][2] = 1
        _s = s[:]
        _s.extend([chosen[0][r], chosen[0][l]])
        _h = deepcopy(h)
        # save the indexes in cs of the
        # nodes chosen for the parent
        _h[parents[1]] = [1, 2]
        candidates.extend(g(_cs, _s, r+l, 1, _h))
      if log:
        print "returning candidates: %s" % candidates
      return candidates
    # The right character is arbitrary but the
    # character before it must match the previous one.
    if i == len(parents)-1:
      l = cs[len(cs)-2][1]
      if log:
        print "rightmost_char: %s" % l
      if len(chosen[i]) < 2 or (not l in chosen[i]):
        if log:
          print "match not found: len(chosen[i]) < 2 or (not l in chosen[i])"
        return []
      else:
        result = []
        for r in [x for x in chosen[i].keys() if x != l]:
          _cs = deepcopy(cs)
          _cs[len(cs)-2][2] = _cs[len(cs)-2][2] + 1
          _cs[len(cs)-1] = [1, r, 1]
          _s = s[:] + [chosen[i][l], chosen[i][r]]
          result.append((_cs, _s, seq + l + r))
        return result

    parent = parents[i]
    if log:
      print "get_candidates: g: parent, i: %s, %s" % (parent, i)
    _h = deepcopy(h)
    if not parent in _h:
      prev = _h[parents[i-1]]
      _h[parent] = [prev[0] + 1, prev[1] + 1]
    # parent left and right children
    pl, pr = _h[parent]
    if log:
      print "pl, pr: %s, %s" % (pl, pr)
    l = cs[pl][1]
    if log:
      print "rightmost_char: %s" % l
    if len(chosen[i]) < 2 or (not l in chosen[i]):
      if log:
        print "match not found: len(chosen[i]) < 2 or (not l in chosen[i])"
      return []
    else:
      # "Base case," parent nodes have been filled
      # so this is a duplicate character on the same
      # row, which needs a new assignment
      if cs[pl][0] == cs[pl][2] and cs[pr][0] == cs[pr][2]:
        if log:
          print "TODO"
        return []
      # Case 2, right child is not assigned
      if not cs[pr][1]:
        candidates = []
        for r in [x for x in chosen[i].keys() if x != l]:
          _cs = deepcopy(cs)
          _cs[pl][2] += 1
          _cs[pr][1] = r
          _cs[pr][2] = 1
          _s = s[:]
          _s.extend([chosen[i][l], chosen[i][r]])
          # save the indexes in cs of the
          # nodes chosen for the parent
          candidates.extend(g(_cs, _s, seq+l+r, i+1, _h))
        return candidates
      # Case 3, right child is already assigned
      elif cs[pr][1]:
        r = cs[pr][1]
        if not r in chosen[i]:
          if log:
            print "match not found: r ('%s') not in chosen[i]" % r
          return []
        else:
          _cs = deepcopy(cs)
          _cs[pl][2] += 1
          _cs[pr][2] += 1
          _s = s[:]
          _s.extend([chosen[i][l], chosen[i][r]])
          # save the indexes in cs of the
          # nodes chosen for the parent
          return g(_cs, _s, seq+l+r, i+1, _h)
    # Otherwise, fail 
    return []

  return g(next_needed, [], "", 0, {})

def f(words, n):
  global log
  tree = {}
  for w in words:
    insert(w, 0, tree)

  stack = []
  root = tree[words[0][0]]
  head = words[0][0]
  for (l, r) in combinations(root.keys(), 2):
    # (shared-chars-needed, chosen-nodes, board)
    stack.append(([[1, None, 0],[1, None, 0]], [root[l], root[r]], [head, l + r], [head, l + r]))

  while stack:
    needed, chosen, seqs, board = stack.pop()
    if log:
      print "chosen: %s" % chosen
      print "board: %s" % board
    # Return early for demonstration
    if len(board) == n:
      # [y for x in chosen for y in x[1]]
      return board

    next_needed = get_next_needed(needed)
    candidates = get_candidates(next_needed, chosen, seqs[-1])
    for cs, s, seq in candidates:
      if log:
        print "  cs: %s" % cs
        print "  s: %s" % s
        print "  seq: %s" % seq
      _board = board[:]
      _board.append("".join([x[1] for x in cs]))
      _seqs = seqs[:]
      _seqs.append(seq)
      stack.append((cs, s, _seqs, _board))

"""
    B
   A O
  I R N
 T N E D
Z Y X W V
"""
words = [
  "BONDV",
  "BONDW",
  "BONEW",
  "BONEX",
  "BOREW",
  "BOREX",
  "BAREW",
  "BAREX",
  "BORNX",
  "BORNY",
  "BARNX",
  "BARNY",
  "BAINX",
  "BAINY",
  "BAITY",
  "BAITZ"]
N = 5
log = True

import time
start_time = time.time()
solution = f(list(words), N)
print ""
print ""
print("--- %s seconds ---" % (time.time() - start_time))
print "solution: %s" % solution
print ""
if solution:
  for i, row in enumerate(solution):
    print " " * (N - 1 - i) + " ".join(row)
  print ""
print "words: %s" % words

I find this a quite interesting problem.

The first attempt was a random solver; in other words it just fills the triangle with letters and then counts how many "errors" are present (words not in the dictionary). Then a hill-climbing is performed by changing a one or more letter randomly and seeing if the error improves; if the error remains the same the changes are still accepted (so making a random-walk on plateau areas).

Amazingly enough this can solve in reasonable time non-obvious problems like 5-letter words starting with 'b':

    b
   a u
  l n r
 l d g s
o y s a e

I then tried a full-search approach to be able to answer also "no-solution" part and the idea was to write a recursive search:

First step

Just write down all acceptable words on the left side; e.g.

    b
   a ?
  l ? ?
 l ? ? ?
o ? ? ? ?

and call recursively until we find an acceptable solution or fail

Step 2

Write down all acceptable words on the right side if the second letter is greater than the second letter of the first word, e.g.

    b
   a u
  l ? r
 l ? ? k
o ? ? ? e

This is done to avoid searching symmetric solutions (for any given solution another one can be obtained by simply mirroring on the X axis)

Other steps

In the general case the first question mark is replaced with all letters in the alphabet if for all words that use the chosen question mark either

  1. the word has no question marks and is in the dictionary, or
  2. there are words in the dictionary that are compatible (all characters except question marks are a match)

If no solution is found for the specific question mark chosen there's no point in keeping searching so False is returned. Probably using some heuristics for choosing which question mark for fill first would speed up search, I didn't investigate that possibility.

For the case 2 (searching if there are compatible words) I am creating 26*(N-1) sets of words that have a prescribed character in a certain position (position 1 is not considered) and I'm using set intersection on all non-question-mark characters.

This approach is able to tell in about 30 seconds (PyPy) that there are no solution for 5-letter words starting with w (there are 468 words in the dictionary with that starting letter).

The code for this implementation can be seen at

https://gist.github.com/6502/26552858e93ce4d4ec3a8ef46100df79

(the program expects a file named words_alpha.txt containing all valid words and then must be called specifiying the initial letter and the size; as dictionary I used the file from https://github.com/dwyl/english-words)