Google Coding Challenge Question 2020 : Unspecified Words

I guess my first try would have been to replace the ? with a . in the query, i.e. change ?at to .at, and then use those as regular expressions and match them against all the words in the dictionary, something as simple as this:

import re
for q in queries:
    p = re.compile(q.replace("?", "."))
    print(sum(1 for w in words if p.match(w)))

However, seeing the input sizes as N up to 5x104 and Q up to 105, this might be too slow, just as any other algorithm comparing all pairs of words and queries.

On the other hand, note that M, the number of letters per word, is constant and rather low. So instead, you could create Mx26 sets of words for all letters in all positions and then get the intersection of those sets.

from collections import defaultdict
from functools import reduce

M = 3
words = ["cat", "map", "bat", "man", "pen"]
queries = ["?at", "ma?", "?a?", "??n"]

sets = defaultdict(set)
for word in words:
    for i, c in enumerate(word):
        sets[i,c].add(word)

all_words = set(words)
for q in queries:
    possible_words = (sets[i,c] for i, c in enumerate(q) if c != "?")
    w = reduce(set.intersection, possible_words, all_words)
    print(q, len(w), w)

In the worst case (a query that has a non-? letter that is common to most or all words in the dictionary) this may still be slow, but should be much faster in filtering down the words than iterating all the words for each query. (Assuming random letters in both words and queries, the set of words for the first letter will contain N/26 words, the intersection for the first two has N/26² words, etc.)

This could probably be improved a bit by taking the different cases into account, e.g. (a) if the query does not contain any ?, just check whether it is in the set (!) of words without creating all those intersections; (b) if the query is all-?, just return the set of all words; and (c) sort the possible-words-sets by size and start the intersection with the smallest sets first to reduce the size of temporarily created sets.

About time complexity: To be honest, I am not sure what time complexity this algorithm has. With N, Q, and M being the number of words, number of queries, and length of words and queries, respectively, creating the initial sets will have complexity O(N*M). After that, the complexity of the queries obviously depends on the number of non-? in the queries (and thus the number of set intersections to create), and the average size of the sets. For queries with zero, one, or M non-? characters, the query will execute in O(M) (evaluating the situation and then a single set/dict lookup), but for queries with two or more non-?-characters, the first set intersections will have on average complexity O(N/26), which strictly speaking is still O(N). (All following intersections will only have to consider N/26², N/26³ etc. elements and are thus negligible.) I don't know how this compares to The Trie Approach and would be very interested if any of the other answers could elaborate on that.


This question can be done by the help of Trie Data Structures. First add all words to trie ds. Then you have to see if the word is present in trie or not, there's a special condition of ' ?' So you have to take care for that condition also, like if the character is ? then simply go to next character of the word.

I think this approach will work, there's a similar Question in Leetcode.

Link : https://leetcode.com/problems/design-add-and-search-words-data-structure/