BFS five letter word chain

Let's assume n is the number of words in the dataset.

Firstly, we need to build an adjacency list for all the above words according to the given condition, i.e. there's an edge between x and y if and only if the last four letters of x are present in y. Building this adjacency list is an O(n^2 * w) operation, where w is the average size of each word in the dataset.

Secondly, all we have to do is a traditional BFS over the test data.

Here's the main function:

    public static void main(String[] args) throws IOException {
        // get words from dataset
        List<String> words = readData();
        // get the word pairs to test
        List<List<String>> testData = getTestData();
        // form an adjacency list
        Map<String, List<String>> adj = getAdjacencyList(words);
        
        // for each test, do a traditional BFS
        for (List<String> test : testData) {
            System.out.println(bfs(adj, test));
        }
    }

Here's the function to build an adjacency list according to the given condition:

    public static Map<String, List<String>> getAdjacencyList(List<String> words) {
        Map<String, List<String>> adj = new HashMap<>();
        for (int i = 0; i < words.size(); ++i) {
            String word = words.get(i);
            adj.put(word, adj.getOrDefault(word, new ArrayList<>()));
            for (int j = 0; j < words.size(); ++j) {
                if (i == j) continue;
                int count = 0;
                String other = words.get(j);
                for (int k = 1; k < 5; ++k) {
                    count += other.indexOf(word.charAt(k)) != -1 ? 1 : 0;
                }
                // if the condition is satisfied, there exists an edge from `word` to `other`
                if (count >= 4)
                    adj.get(word).add(other);
            }
        }

        return adj;
    }

And here's the BFS:

    public static int bfs(Map<String, List<String>> adj, List<String> test) {
        Queue<String> q = new LinkedList<>();
        Set<String> visited = new HashSet<>(); // to keep track of the visited words, since the graph is not necessarily a DAG
        String start = test.get(0);
        String end = test.get(1);
        // if `start` and `end` words are equal
        if (start.equals(end))
            return 0;

        q.add(start);
        visited.add(start);
        int count = 0;
        while (!q.isEmpty()) {
            count++;
            int size = q.size();
            for (int i = 0; i < size; ++i) {
                String word = q.poll();
                for (String val : adj.get(word)) {
                    if (val.equals(end))
                        return count; // return the number of edges
                    if (!visited.contains(val)) // only add the words which aren't visited yet.
                        q.add(val);
                }
            }
        }
        return -1; // if there isn't any edge
    }

@The Room have provided a pretty good answer , but I want to suggest a simple modification for the adjacency list construction part as the provided approach for building the list is of complexity O(n^2) which will lead for poor performance for large input files.

Simply you can take every possible sorted pattern of 4 characters of each word and insert it in a hash map with the id of the word (index for example).

C++ code example:

map<string , vector<int> >mappings ;

for(int i = 0 ; i < words.size();  i++){
    string word = words[i].substr(0 , 4) ; 
    sort(word.begin() , word.end()); 
    mappings[word].push_back(i); 
    for(int j = 0 ; j < 4 ; j++){
        word = words[i].substr(0 , 4) ; 
        word[j] = words[i][4]; 
        sort(word.begin() , word.end()); 
        mappings[word].push_back(i);
    }
}

Now you have a vector of the words' indices that you know there must be an edge between them and any word ending with the same 4 characters of the vector's key.

And then you can simply build the graph and with just taking care of not creating self loops (avoid making an edge with a node and itself).

Code example:

// Building the graph with complexity of O(n * log(no. of edges))
const int N = 100000; // Just and example 
vector<int>graph[N]; 
for(int i = 0 ; i < words.size(); i++){
    string tmp = words[i].substr(1 , 4); 
    sort(tmp.begin() , tmp.end()); 
    for(int j = 0 ; j < mappings[tmp].size(); j++){
        if (j == mappings[tmp][j])
            continue; 
            
        graph[i].push_back(mappings[tmp][j]);
    }
}

Finally you can loop over your test file , get the start and goal indices (When reading the file store each word as a key with a value of it's index) and then you apply the bfs function to calculate the number of edges as described in the answer of @The Room

I just wanted to suggest this answer for folks that may need a solution for a similar problem with a large inputs which will reduce the complexity of building the graph from O(N^2) to O(N * log(no. of edges)) where N is the number of words.


My approach was slightly different and there is also a subtle nuance to the question which I will come onto below:

First we create an adjacency list: ( @Volpe95 has a nice optimisation for this). A Map of Nodes is used where the word is the key.

Map<String, Node> nodes = new HashMap<>();

        List<String> words = new DataHelper().loadWords("src/main/wordsInput.dat");
        System.out.println(words);

        for (int i = 0; i < words.size(); i++) {
            String l = words.get(i);
            nodes.put(l, new Node(l));
        }

        for(Map.Entry<String,Node> l: nodes.entrySet()) {
            for(Map.Entry<String, Node> r:nodes.entrySet()) {
                if (l.equals(r)) continue;
                if (isLinkPair(l.getKey(), r.getKey())) {
                    Node t = nodes.get(l.getKey());
                    System.out.println(t);
                    t.addChild(nodes.get(r.getKey()));
                }
            }

        }

IsLinkPair checks if the last four letters of a word can be found in a possible child word.

private static boolean isLinkPair(String l, String r) {
        // last 4 chars only
        for (int i = 1; i < l.length(); i++) {
            if(r.indexOf(l.charAt(i)) == -1){
                return false;
            }
        }
        return true;
    }

A node stores each word and children as well as the edgeTo, this is used to calculate the shortest path where each node stores its parent. This child parent will always be on the shortest path. (Sedgewick stores this data in separate arrays but its often easier to group these in a class as it makes the code easier to understand)

(Getters Setters etc omitted for clarity and Equals)

public class Node {
    private Set<Node> children;
    private String word;

    private Node edgeTo;

    private int visited;

    public Node(String word) {
        children = new HashSet<>();
        this.word = word;
        edgeTo = null;
    }
}

The BFS algorithm based on Sedgewick's, searches each node, its immediate children and their children in turn and so on. It is searching ever so distant from the origin each time. Note a queue is used and this is implemented by the LinkedList in Java.

private boolean bfs(Map<String,Node> map, Node source, Node target) {
        if(source == null || target == null) return false;
        if(source.equals(target))return true;
        Queue<Node> queue = new LinkedList<>();
        source.setVisited();
        queue.add(source);
        while(!queue.isEmpty()) {
            Node v = queue.poll();
            for (Node c : v.getChildren()) {
                if(c.getVisited()==0){
                    System.out.println("visiting " + c);
                    c.setVisited();
                    c.setEdgeTo(v);
                    if(c.equals(target)) {
                        return true;
                    }
                    queue.add(c);
                }
            }
        }

        return false;
    }

Note that v is the parent and c are its children. setEdgeTo is used to set a childs parent.

Finally we check the results where source and target are the source and target words respectively:

BreadthFirstPaths bfs = new BreadthFirstPaths(nodes,source,target);
int shortestPath = bfs.getShortestPath(nodes,source,target);

So what of the nuance I mentioned above? The shortest path calculation is necessary as zcdea has two parents fzcde and bcdez and you need the one on the shortest path. To do use the edgeTo of a child, find its parent and repeat until the path is walked as shown below. That child parent relationship will always be on the shortest path because of the way the bfs searchs from an origin outwards.

// get edgeTo on target (the parent) , find this node and get its parent
    // continue until the shortest path is walked or no path is found
    public int getShortestPath(Map<String,Node> map, String source, String target) {
        Node node = map.get(target);
        int pathLength = 0;
        do {
            if(node == null || pathLength > map.size()) return NOPATH;
            if(node.equals(map.get(source))) return pathLength;
            node = map.get(node.getWord()).getEdgeTo();
            pathLength++;
        } while (true);
    }

There are always space time complexity tradeoffs to consider and optimisations.