[Renamed] Ye mighty lexer compiler!

Updated 20 July 2012 with Github link

I’ve done it! Ye mighty auto tokenizer allows you to define a grammar and tokenizer strings. I’ve never done it, looked at other code, or read anything on how to do this so there is a chance I’ve used a pattern I never knew I was using, etc.

You feed it a token grammar and a string, and this son-of-a-gun will give you the next token. Warning: like any declarative language this will do what you tell it to do, therefore do not give it faulty grammars (like ones that will accept nothing)!


		ArrayList<Node> alphaNumericList = new ArrayList<Node>();
		ArrayList<Node> sentenceList = new ArrayList<Node>();
		ArrayList<Node> anotherWordList = new ArrayList<Node>();
		ArrayList<Node> wordList = new ArrayList<Node>();
		ArrayList<Node> whitespaceList = new ArrayList<Node>();
		
		Node whitespace;
		Node word;
		Node anotherWord;
		Node sentence;
		
		whitespaceList.add ( TerminalFactory.createTerminalString(" "));
		whitespaceList.add (  RepetitionFactory.createRepetition(TerminalFactory.createTerminalString(" ")) );
		whitespace = ListFactory.createList(whitespaceList);
		
		alphaNumericList.add(number());
		alphaNumericList.add(letter());
		
		wordList.add (letter());
		wordList.add(RepetitionFactory.createRepetition( letter() ));
		word = ListFactory.createList(wordList);

		anotherWordList.add(whitespace);
		anotherWordList.add(word);
		anotherWord = ListFactory.createList(anotherWordList);
		
		sentenceList.add (word);
		sentenceList.add (RepetitionFactory.createRepetition(anotherWord));
		sentenceList.add (OptionFactory.createOptional(TerminalFactory.createTerminalString(".")));
		sentence = ListFactory.createList(sentenceList);
		
		
		System.out.println ( Parser.parse("The dirty fairies are dead", sentence));
		System.out.println ( Parser.parse("The dirty fairies are dead.", sentence));
		System.out.println ( Parser.parse("The dirty fai.ries are dead.", sentence));

Output:

26
27
14

Legal disclaimer:
By downloading the said file, knowingly or not, you agree to have no rights to its code or your knowledge of the knowledge gained upon mentally processing it. You have no copying rights, understanding rights, or right to process any thoughts derived from the knowledge of the said file. You are however given the right to live and breathe under the condition you do so without violation of any stated rule in this disclaimer.

Code available at https://github.com/keldon85-avasopht/mighty-parser

As for the approach, I’ve never done anything like this before, I just looked at the rules for grammars and created factories for them. The parser does most of the work, but you can see from the [poorly (un)commented] code how it works.

One more thing; I will eventually add some form of faulty grammar detection. Basically if you can reach an end node without a decision or entering a node from a node then that node is faulty. A node is also faulty if it contains a grammar inside itself that is faulty - i.e. node.getInside() is faulty, so I will conjure up the code for it at some time.

??? That seems like a lot of code to recreate String.indexOf(). Perhaps a fuller example might help extol the benefits?

It does a lot more than string.indexOf; it allows you to define a grammar for parsing. That was just an example to show you how to define a grammar for a simple sentence, which could end with a period. You can define any grammar you want, html, css, a command system in French, you can even define a programming language and have it parse that.

Next I will upgrade the code to return the parse tree too, which will make it “ye mighty auto lexer”, or something.

I’ve made a further update to have it select from a list of grammars, returning the grammar that gave the best match. Note that ambiguous grammars will give ambiguous results, such as one grammar accepting white space (only) and another beginning with being able to accept white space. On another note I’ve added in detection of faulty grammar!

Can’t make head nor tail of your example ???

However, is this not something that is achievable with regular expressions?

The example is simple, I defined the following grammar …

word = LETTER {LETTER} .
anotherWord = whiteSpace word .
sentence = word {anotherWord} ["."] .

… and then passed the grammar the provided sentences to see how much of it is accepted by the grammar.

It is achievable using regular expressions (or JavaCC), however figuring this out by yourself from scratch without resorting to any text book or reference as to how it is (or could be) achieved is something that not a lot of people can attest to having experience with. Also it allows me to implement it anywhere, such as on an embedded device restricted to using C/assembler, or just about anywhere I can think of.

There is more to be added, for example it currently tells you how much has been accepted, but does not quite tokenize to the last reasonable end for a grammar. I’ve modified the code to determine of a string is acceptable in its entirety, but I will make one final modification to make it only return a reasonable string that is entirely compatible (if you get what I mean).

Okay, I’ve made the next update - the parser will now return the largest valid token from a string, which it was not doing before. Before the code could either determine whether a string completely satisfies a grammar or extract the largest string that is accepted. Now it will extract the largest valid grammar, hence is a complete tokenizer & lexer.

Now if I want I could have some factories that load a grammar from file or something, or even a regular expression.

sry, I dont understand anything, could you explain a simple test case please ?

—=== THIS POST CONTAINS A SIMPLE EXPLANATION AND TEST CASE ===—

Okay, basically this code can automatically tokenize any string based on the [CFG] rules you give it. And for free it also doubles up as an automatic lexer.

So as a test case, consider the following grammar (shown in the last post, and also in the code example) …

word = LETTER {LETTER} .
anotherWord = whiteSpace word .
sentence = word {anotherWord} ["."] .

… and also the following rule (also defined in the code) …

whitespace = " " {" "} .

… if you were to call Parser.parse (“This is a lovely valid sentence. This is the next sentence!”, sentence), it would return you the number of characters that form a valid sentence, which would be “This is a lovely sentence.”

If you then were to call Parser.parse (" This is not a valid sentence because sentences do not begin with whitespace.", sentence), it would return -1 to tell you that it could not return a token. If you were to pass whitespace instead of sentence then it would return 7, i.e. " "!

My latest code update will also return a lexeme (as com.avasopht.mightyParser.traversing.Token).

Edit: I did a little speed test with a valid sentence formed of 12 031 characters; it searched 2 140 577 nodes and took 1.469 seconds to parse. Bearing in mind that the grammar depicting sentences involves branching for each character, increasing the search space. If you have a specific word then it generates a list that does not require so many nodes, so if your grammar defines the text, “some_reserved_keyword” then it will not need to branch at each letter.

Something like…


digits = DIGIT {DIGIT} .
decimal = digits {["."]} {digits} {["e"] digits} .

?

Yes, although it should be …

decimal = digits ["." digits] ["e" digits] .

… as the one you wrote will accept “123…123e123e123e123”. And never infinitely accept an option, take out my grammar check and turn on debugging in Parser.java to see the damage that could do :slight_smile:

Edit: just saw the TinyCode challenges, interesting!

I added a few { and } because those parts are optional.

I wouldn’t know how I’d do it any other way.

123
123e456
123.456
123.456e789

Ah lol; I write {} as [optional] repeats and [] as options, whereas you write {} as options ! And although I don’t do it on purpose, I [often] tend to write an optional words in square brackets when writing [normal] sentences in chat :smiley:

Ok, I’ve written what appears to be the final [necessary] update. There was a possibility of terminating early and not getting the largest token size by accepting every character in the string and reaching an end node without fully satisfying the grammar completely (i.e. an incomplete branch). Although it never occurred (in my presence) that bug has been fixed. The only expected updates (as of now) will be commenting.

Note that most of the work takes place in parser, which is just your every day depth first search using iteration as opposed to recursion! The rest are simple factories, which I may update to be instances, therefore removing the need for implementing code to have to know about lists and collections.

That’s all nice, but a product without proper ‘marketing’ is worthless.

Could you show us something that might drop a few yaws?

[quote=“Riven,post:16,topic:30978”]
Very true! Well my purpose of writing it was down to two motivations: we don’t have anything like this in our code repository at work, especially for embedded [memory starved] systems, and it was really nagging me that I knew it could be done with little difficulty. Apart from that it offers nothing above the others, although I am considering left recursion removal, or at least detection of it.

If I can achieve removal of left recursion I’ll think about pushing it as a useful tool, second to that I’m working on a C++ version. As for some ‘yaw’ value, I’ll think of something!

ok, thanks for explanation you post above, I have a better understanding now, but may be I still missed something :frowning:

what is better in it than directly using regular expression ? I guess it is more powerfull enabling more stuff ?

It’s not really about it being more powerful or better, it’s all personal; I literally work up at 6am Wednesday morning after going to sleep at 1 thinking that this is possible. I had no idea I’d have a solution for Friday morning (or Thursday evening when I completely figured it out). My first idea was terribly dumb (on Wednesday morning), I’m quite surprised at how easy it was. It’s basically a really simple search for the longest valid placement of letters that lead to the end node!

Technically speaking iterative deepening should not only find the solution, but will do so without getting caught in an infinite loop on left recursion! Source code is available online for anyone who wants to have a try! I’ll even race ya!! On second thoughts it wouldn’t! It’s still infinite, so it will tell you if the entire string is valid, but not return a token.

I’ve made a further advancement in the code, which I need to update some time. Basically the parser can have the saving of the stack eliminated as it is redundant in the presence of the “choice stack”. This [should] give it a significant speed improvement.