TrieHard - A Trie implementation!

I’m not sure if someone will find a Trie implementation useful, but I just finished one here:

If you can think of another use case other than determining exclusion/inclusion or auto-complete… please let me know!

I imagine this would only be useful for game libraries, not so much games.

Nice name. I would assume the next build would be: “trie hard 2 - trie harder”.

So that’s you on reddit :).

Why are you using Boolean.TRUE and Boolean.FALSE rather than just true amd false? I don’t think I’ve ever seen a Boolean object used in production code before

When dealing with generic classes in Java (like Trie<String, Boolean>) you ARE dealing with the Boolean object. Sure you can pass true and false in there, but it’s auto-boxed into the Boolean.TRUE and BOOLEAN.FALSE. The same applies to using Integer, you can pass in ints (i.e. x), but it auto-boxes it to new Integer(x).

I should change the example so it looks a little nicer, even though what I have is perfectly correct.

TrieHard has been updated so a Trie can be used as a Map now. Insertion, removal, getting, and traversal have all had speed improvements.

TrieHard now outperforms all other Trie implementations I’ve found! (put+get)

Some Strings cause an exception. I obtained a Trie through Tries.forString() and am using the Moby word list names.txt file for keys/values. I put each trim()ed String as the key and value in order. The exception is thrown for the 771st String “Ameli”. :expressionless:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 5
	at java.lang.String.charAt(String.java:658)
	at org.magnos.trie.TrieSequencerCharSequence.hashOf(TrieSequencerCharSequence.java:54)
	at org.magnos.trie.TrieSequencerCharSequence.hashOf(TrieSequencerCharSequence.java:28)
	at org.magnos.trie.Trie.put(Trie.java:165)

I wrote a fancy trie once. It is readonly though, the encoder is separate. It only supports 75bit text but is ultra efficient disk- and memory-wise. Eg, it stores 293,365 words in 889,614 bytes.

public class Trie {
	static public final int START = 0xC0000000;
	static public final int INVALID = 0x800000;
	static public final int LEAF = 0x80000000;

	static private final int END_OF_WORD = 0x80;
	static private final int HAS_MORE = 0x40;
	static private final int IS_LEAF = 0x20;

	byte[] data;

	public Trie () {
	}

	public Trie (FileHandle file) {
		DataInputStream input = new DataInputStream(file.read());
		try {
			data = new byte[input.readInt()];
			int bytes = 0;
			while (true) {
				int count = input.read(data, bytes, data.length - bytes);
				if (count == 0) break;
				bytes += count;
			}
		} catch (IOException ex) {
			throw new RuntimeException("Error reading file: " + file, ex);
		}
	}

	public int get (int c, int offset) {
		if (offset == LEAF) return INVALID;

		byte[] data = this.data;
		c -= 97;

		if (offset == START) {
			offset = c * 3;
			int index = (data[offset++] & 0xFF) << 16;
			index |= (data[offset++] & 0xFF) << 8;
			index |= data[offset] & 0xFF;
			return index;
		}
		if (offset < 0) offset = -offset;

		// Find matching char.
		int indexOffset = 0, charIndex = 0;
		byte b;
		while (true) {
			b = data[offset++];
			if ((b & 31) == c) break;
			if ((b & HAS_MORE) == 0) return INVALID;
			if ((b & IS_LEAF) == 0) indexOffset++; // A leaf has no indexes.
			charIndex++;
		}
		if ((b & IS_LEAF) != 0) return LEAF;

		int nextIndex;
		if (charIndex == 0 && (b & HAS_MORE) == 0) {
			// A single child has no indexes, use the next byte.
			nextIndex = offset;
		} else {
			// Skip remaining chars.
			while ((b & HAS_MORE) != 0)
				b = data[offset++];
			// Skip indexes before the one we want.
			int indexesStart = offset;
			for (int i = 0; i < indexOffset; i++) {
				if ((data[offset++] & 0x80) == 0) continue;
				if ((data[offset++] & 0x80) == 0) continue;
				offset++;
			}
			// Decode the one we want.
			b = data[offset++];
			nextIndex = b & 0x7F;
			if ((b & 0x80) != 0) {
				b = data[offset++];
				nextIndex |= (b & 0x7F) << 7;
				if ((b & 0x80) != 0) nextIndex |= (data[offset] & 0x7F) << 14;
			}
			nextIndex += indexesStart;
		}

		if ((data[nextIndex] & END_OF_WORD) != 0) return -nextIndex;
		return nextIndex;
	}

	public boolean contains (String text) {
		int index = START;
		for (int i = 0, n = text.length(); i < n; i++) {
			// System.out.println(text.charAt(i) + ": "
			// + (get(text.charAt(i), index) == INVALID ? "INVALID" : get(text.charAt(i), index)));
			index = get(text.charAt(i), index);
			if (index == INVALID) return false;
		}
		return index < 0;
	}
}

Optimized read only structures are neat. Let me ask/confirm a few things:
Isn’t your implementation 5 bit? (And uses lowercase letters)
What do you use to encode the trie? Are there multiple encodings?
I wonder how if you did have a 7-bit tree if it would be relatively simple to encode UTF-8.
I don’t know if it qualifies as efficient for disk storage. It might be close to zip compression. Can the structure be compressed further?

Oops, is it 5 bit? Was so long ago that I wrote it and I just pasted. :smiley:

I have a tool somewhere that encodes the trie. IIRC it loads it up into inefficient data structures, does some processing, and then writes out the magic format the code above reads. Just one encoding.

You’re right, zipping my 293,365 word trie file goes from 889,614 to 512,862 bytes. The raw data is 2.77MB, down to 797,139 bytes zipped. :slight_smile:

@ClickerMonkey: I loaded the text file with the wrong encoding. (It was not UTF.) However loading the entire list changed the String for which the exception was thrown.

@Nate: I think you could use 7 bits per node if you used a terminator character. The structure would be bigger, but if the Strings were Huffman encoded you could fit more characters into fewer nodes. Not that I am suggesting that. Just pointing it out.

@Several Kilo-Bytes

So you are still having an exception? Can you post the full stack trace if so? Also, can you send me that list of words… the words inserted before the culprit are the ones to blame mostly. Also, make sure you are using the most up-to-date JAR, I fixed a NPE about two weeks ago. Thanks!

@Nate

That’s a pretty awesome Trie! Is there a way to efficiently query all words that start with “moo” or something? I skimmed through the code, but since you didn’t post the encoder it’s harder to just stare at and understand how it works =P

I use the Moby word lists because it’s public domain and already in plain text format. My JRE is up to date. I downloaded the source as a zip file. The stack trace is:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 3
	at java.lang.String.charAt
	at org.magnos.trie.TrieSequencerCharSequence.hashOf(TrieSequencerCharSequence.java:54)
	at org.magnos.trie.TrieSequencerCharSequence.hashOf(TrieSequencerCharSequence.java:28)
	at org.magnos.trie.Trie.put(Trie.java:165)
        ...

Awesome, found the bug and fixed it! It only happened in very specific scenarios… but it’s all better now. The builds and code has been updated on GitHub.

You could walk the trie and get out words that start with, but only a contains method is implemented. The encoder is junk, super inefficient in pretty much all ways, but since it runs offline it isn’t a huge deal. It’d be better to use something like your trie then write out the readonly encoding. I’d post the encoder but I’m not yet sure if I’ll do something with the trie.

I also just added TrieSet, it implements java.util.Set and is pretty easy to create:


TreeSet<String> set = new TreeSet<String>(Tries.forStrings());

set.add( "meow" );

assert ( set.contains( "meow" ) );

I can’t think of anything I’d need this for in a game but I have a non-game use for it. I was actually looking at trie a few weeks ago wondering if there was a java lib out there that would make it easy to pull off.
* omadawn heads over to github.

TY

No problem!