Creating a Text Parser

Greetings everyone :). I’d like to create a simple text adventure game in Java. I’ve done a lot of Java application programming, and I already have the GUI for the game set up, but I’m clueless about creating a text parser to interpret the typed commands. Can anyone point me in the right direction? Thanks :).

There is StringTokenizer, String.split(1.5), Scanner (1.5)… and there is String.equalsIgnoreCase…

I think thats pretty much everything you need.

So if you’re programming for a 1.4.2 target, then basically you’re left with StringTokenizer and a whole bunch of if statements?

The first step is making a String array out of that other String.

You can do that for example with something like:

String cmd="foo bar \"foo bar\" foo";
[...]
ArrayList commands = new ArrayList();
char[] chars = cmd.toCharArray();
boolean open = false;
StringBuffer sb = new StringBuffer();
for (int i = 0; i < chars.length; i++) {
	if (chars[i] == '\"') {
		open = !open;
	} else if (chars[i] == ' ' && !open) {
		commands.add(sb.toString());
		sb = new StringBuffer();
	} else {
		sb.append(chars[i]);
	}
}
if (sb.length() > 0) {
	commands.add(sb.toString());
}

String[] pieces = new String[commands.size()];
commands.toArray(pieces);

The pieces array then contains “foo”, “bar”, “foo bar”, “foo”… all whats left is looping over it (for example).

You could use HashSets and HashMaps for interpretation. In a simple example theres just one Set with all known verbs and one Set with all known objects. Use StringTokenizer, String.toLower() and String.replaceAll() to create normalized tokens and test them against the verbs and objects to sort out unnecessary/unknown words. Now create a command-key in the form “//<…>/” and use this to look up a command object in a HashMap. Such a HashMap could look like this:


HashMap commands= new HashMap();
commands.put("open/door", new Command(){ execute(String[] args){ openDoor(); } });
commands.put("connect/camera/tv", new Command(){ execute(String[] args){ connectCameraToTV(); } });
// etc.

notice the parameter args to the execute function. In most cases you don’t want to create an extra method for each combination of objects, so you can test for the HashMap key in a loop, starting with the longest key “//”, test with one segment shorter, if you had no match: “/” and pass that as a parameter. This way you can make somewhat generic commands that take additional tokens as parameters. Instead of a inline Command instance you could also use method names as strings and use the reflection class Method to execute these. This will create a more compact mapping, but I like the advantage of the compiler checking for typos when using inline objects.

You can extend the above in arbitrary ways like using two sets for level1 objects and level2 objects to define some sort of argument order (you will have to sort the possible token order in some way to reduce combinations) or adding other grammar types than verbs and objects like adverbs, conjunctions, etc. Also some more parameters (e.g. the current room) might make sense. This is just an example to get you thinking.

Essentially you are creating a syntax tree by specifying the keys in the described form. Abusing a “flat” HashMap is just the fastest approach to create such a mapping. it is not the fastest or best approach to process it.

To map the structure more appropriate, you could create a nested HashMap tree, where all branching nodes are HashMaps and each leaf is a Command object. This way, you do not have to append your tokens to a single string key, but use the first token (e.g. the verb) to query the root-level HashMap. If the result is a Command, call execute on it and pass the remaining tokens as parameters. If it is a HashMap, use the second token to query the next deeper Command or HashMap, etc.

If you have regexp, you can also “explode” a tring this way.

This particualr exploder will group anything between " " as a single element though it doesnt
have logic for escaped quotes so you cannot otherwise use quotes…


        static final Pattern splitPattern = 
		Pattern.compile("(?:\"([^\"]*+)\")|(\\S+)");

	public static String[] explode(String text) {
		Matcher m = splitPattern.matcher(text);
		List<String> list = new ArrayList<String>();
		while (m.find()){	
			String tok = m.group(1);
			if (tok == null){
				tok = m.group(2);
			}
			list.add(tok);
		}
		String[] out = new String[list.size()];
		return list.toArray(out);
	}


Thanks for all of the replies so far. I see that String.split is actually part of 1.4, so it can still be used when you’re developing for a 1.4.2 platform.

I’m going to have to study all of these replies I’ve received so far, and if anyone would like to, please feel free to continue to reply. The more information the better. Thanks :).

see that String.split is actually part of 1.4

Ooops… my bad. I actually wrote “1.4” first, but changed it to “1.5” for some reason.

[quote]I’m going to have to study all of these replies I’ve received so far, and if anyone would like to, please feel free to continue to reply. The more information the better.
[/quote]
Tokenizing input and parsing tokens is part of what compilers do so you might research that area; finite state machines, regular expressions, context free grammars, BNF form, syntax trees. I found this document to be very informative (it’s about c/lex/yacc but it has good theory discussion):

http://epaperpress.com/lexandyacc/

I think BNF is no good in this context.

For a text adventure you want a natural language parser.

Some resources linked here: http://web.syr.edu/~mdtaffet/nlp_sites.html

One that I’ve been wanting to port to Java is http://www.link.cs.cmu.edu/link/submit-sentence-4.html

While BNF and RDPs are a very inetresting (and fun) topic I agree that here they are over-kill.

His syntax is not likely to be more complex then noun-verb, or noun-verb-indirect object at most.

Most MUDs use one for or another of regexp if they need anything more varied.