Trie-based dictionary

richierich · July 5, 2014, 11:06am

What kind of game is this?! Is there such a thing as an arcade action twich spelling game Is 15ms really too long to do the permutations search?

Wikipedia says SOWPODS has 267,751 words. Permutations of 9 means 362,880 lookups(*), most of which are quick failures a couple of letters down the trie. So is 15ms just a straight lookup for each permutation, or have you added more intelligence to the process?

E.g. maybe you could recode exists() to return a number which was the level of the trie at which the search failed, or 0 if the search was successful. Then e.g. with letters ACFFEKNSW you would test the first permutation “ACFFEKNSW” and get 3 back so you could skip testing all other permutations beginning with ACF. Just thinking aloud - maybe you did something like that already?

Another thought, did you try just whacking the SOWPODS words into a HashSet and forget the trie altogether?

{(*) Edit: OK before you say it I see that 9 letters yield more than 362,880 possible words, that would be if all words were 9 letters long. Obv. it’s more than that! :)}

Roquen · July 5, 2014, 11:29am

Yeah an AI for a scrabble like game isn’t straight forward.

ags1 · July 5, 2014, 5:23pm

I use the trie to discard branches of possible words, I will post the code later. I’m over the moon with 15 ms, although it would be longer than that in android. If i wanted to be a real hard case, I could multithread the search, it would be very easy, but possibly insane.

Roquen · July 5, 2014, 5:46pm

Brute force isn’t the answer. I don’t know the problem space but a ternary search trie (which can be flattened) will be exponentially faster. You have a set of letters and a known set of reg-exps which can be optimized. A single traversal can produce all possible results in very little time.

ags1 · July 5, 2014, 9:36pm

Well, there will be screen shake if you hit a triple combo on your words. And there are power-ups

Here is the new node class:


    private static class ArrayCharacterNode extends ArrayNode2 {
        private boolean completeWord = false;
        private static StringBuilder sb = new StringBuilder();
        public ArrayCharacterNode() {
            instanceCount++;
        }
        public void unscramble(Map<String, Object> results, ArrayList<Character> scrambled, ArrayList<Character> unscrambled) {
            if (completeWord) {
                outputWord(results, unscrambled);
            }
            for (int i = 0; i < scrambled.size(); i++) {
                Character c = scrambled.get(i);
                ArrayCharacterNode node = get(c);
                if (node != null) {
                    ArrayList<Character> newScrambled = new ArrayList<Character>(scrambled);
                    newScrambled.remove(c);
                    ArrayList<Character> newUnscrambled = new ArrayList<Character>(unscrambled);
                    newUnscrambled.add(c);
                    node.unscramble(results, newScrambled, newUnscrambled);
                }
            }
        }

        private static void outputWord(Map<String, Object> results, ArrayList<Character> unscrambled) {
            for (int i = 0; i < unscrambled.size(); i++) {
                sb.append(unscrambled.get(i));
            }
            results.put(sb.toString(), null);
            sb.delete(0, sb.length());
        }
    }

I’m not entirely happy with it, I feel the method is creating too many objects as it walks the tree.

HashSet is usually as fast as Trie (for simple word lookups), but Trie can pull ahead if I do a vast number of checks. But remember, if I did not have the trie, for word combinations I would have to check every combination of the 9 letters. With the trie, i can exclude branches in one go.

richierich · July 6, 2014, 10:50pm

Awesome! “Achievement unlocked … Sh-sh-sh-sh-Shakespeare!”

The recursive walk down the trie looks really elegant.

Would a String be simpler than ArrayList for unscrambled? It has to be turned into a string for output anyway.

For the scrambled pool of unused letters I wonder if you could pass the same object all the way down, replacing characters back into the pool after each recursive call to unscramble(), rather than building new ones at each level. I’m not an expert in Java linked lists so there’s probably a better way to code it but would something like this work…?


	public void unscramble(Map<String, Object> results, LinkedList<Character> scrambled, String unscrambled) {
        if (completeWord) {
            results.put(unscrambled, null);
        }
        for (int i = 0; i < scrambled.size(); i++) {
            Character c = scrambled.get(i);
            ArrayCharacterNode node = get(c);
            if (node != null) {
            	//Take character from unused pool and add to possible word
            	scrambled.remove(i);
            	String newUnscrambled = unscrambled + c;
                node.unscramble(results, scrambled, newUnscrambled);
                //Replace character where it was in the unused pool
            	scrambled.add(i, c);
            }
        }

ags1 · July 6, 2014, 10:57pm

That’s a neat idea to cut down on the object creation. I can also cut out the unscrambled variable by giving each node a reference to its parent… Then every complete word can write itself out by recursing to the root of the trie.

pjt33 · July 7, 2014, 9:19am

Not really, in the sense that because strings are immutable it would increase the amount of object creation. The way to reduce that is to use a single char[] (tracking the length of the subarray which is relevant) and then use the corresponding String constructor when a word is found.

richierich · July 7, 2014, 10:24am

Yes, fair point. I guess I was thinking that creating strings was at least simpler than creating ArrayLists. On reflection what’s happening with the unscrambled string over time is that it’s constantly having extra letters added and removed from its tail end. So a single Stack used throughout the whole trie walk might be the thing, although what you said about just having a char[] and a tail pointer is effectively a low-tech stack, and surely much more efficient…

Perhaps it’s irrelevant anyway if ags1’s parent pointer thing works out. I guess that will suffer from increasing the memory footprint, which was pointed out earlier may be the bottleneck.

ags1 · July 7, 2014, 6:59pm

Memory won’t be an issue as this is now only going to run on my build machine. The game will just have a couple of megabytes of precalculated 9-letter puzzles, which I will load into HashSets for the in-game spell checks.

ags1 · July 7, 2014, 8:55pm

I added parent references to the nodes and used richierich’s approach of fiddling with a single list object for the scrambled letters as it weaves its way through the tree. Those changes bring the unscrambling time down to 7 millis, a 50-100% improvement.

I think this is about done, so I will update the first post with the final code, after I have cleaned up the class and variable names a bit.

I never did the Riven 2-chars per node approach, but with my approach of dumbing down the app to use precalculated solutions on the SD card, I don’t really need the headache it would give me.

Jono · July 8, 2014, 12:44pm

Did you consider sorting the characters of each word in the dictionary before building the trie? Then each node would need to additionally contain a list/set of the valid, unsorted, words. You sort the query characters and you would be searching for all subsets rather than all permutations. That’s down to 2^9 queries.

I might have misunderstood what you’re doing. If so, ignore this.

ags1 · July 8, 2014, 2:42pm

I’m not doing it that way, but I am excluding most permutations from the search - this is the great advantage of a trie. Have a look at the implementation in the first post.

Jono · July 8, 2014, 3:22pm

Sure, but using the trie with sorted entries would also exclude most subsets in the same way, right?

ags1 · July 8, 2014, 6:33pm

I’ve thought about that… whether you need to follow the natural letter order of the word. But I think if that was the case it would make most sense to sort the words in order of least-frequent letters first. I also wondered if reversed words would go faster.

Why do you think an alphabetical sort will make things more efficient? You want more branches in the tree, not less. Also, many words would end up on the same node (e.g. dog and god), which can’t happen with the current structure. Also, you would have to store the whole word at a word node, I don’t like that.

Jono · July 9, 2014, 11:14am

The idea was just to sort both dictionary and query so that you’re working with sets of letters instead of series. Alphabetic order was just what first popped in to my head, but going in least or most popular order sounds like a good idea.

If keeping the words’ strings in memory as well as trie nodes is too much then I guess this wouldn’t work. I’m pretty sure it’s guaranteed to be no slower than the unsorted version (with the worst case being when there are no anagrams sharing leaves) and could be a lot quicker when there are anagrams sharing nodes.