Tagged Human JSON (THJSON) Parser

princec · July 11, 2017, 2:51pm

There is a git repository here:

Sooner or later, all developers try to store some sort of structured data to configure their applications in some way. Perhaps it’s simple configuration files pointing to databases, perhaps it’s data files for game entities, perhaps it’s i18n translation files.

There are a million data formats that you might consider using. Probably the five most popular formats are:

INI files
XML files
YAML files
JSON files
HJSON files

Each has advantages and disadvantages for various use cases. But what if you wanted a format that comfortably handled ALL use cases?

INI files can’t handle any kind of nested structure. The only concept is blocks of key=value pairs where each value must fit on a single line. It allows comments
XML files are huge and verbose, and the syntax is unnecessarily pernickety for human beings to write. It is well-specified but bloated with features that can tie you in knots. It does cope with arbitrarily complex data well, though this can be its undoing. You can comment it but often in a restricted manner. But seriously, it’s a faff and everybody hates typing it.
YAML is a hideous format designed by devils. It contains a kitchen sink, manufactured in Hell. It too is pernickety with syntax requiring actually properly formatted whitespace, which is a sure sign of Satan’s work. We shall not speak of it again.
JSON files are very simple to parse for a computer. Unfortunately the syntax is also unnecessarily pernickety for human beings to type and hurts the eyes somewhat with all the unnecessary quotes and commas, and the only data structures it can directly deal with are primitive values, arrays, and maps. What’s worse is it can’t have comments in it, which renders it pretty useless for humans.
HJSON is getting close to perfection. You can read all about HJSON at https://hjson.org and even try it out in the browser. HJSON removes all the unnecessary syntactical cruft from JSON, making mostly everything optional; and it adds a few new features such as comments in every style imaginable and multiline strings. HJSON is easily converted into JSON for feeding to other APIs. Unfortunately HJSON has one major flaw, which is that it cannot handle the concept of “classes”. We can have null, numbers, booleans, strings, arrays, and maps, all nested arbitrarily… but we cannot declare a map to be a structured type, or “class”. This is a big pain in the arse if you are dealing with modern OOP languages, because they deal a whole lot in classes.

Enter THJSON.

THJSON looks almost exactly like HJSON - it is in fact a superset of HJSON, which itself is a superset of JSON. The extra bit is the addition of a class name before a map (maps are objects that are enclosed in {} parentheses):


left_hand: sword {
    damage: 3
    weight: 1kg
}

Essentially any value that is parsed as a string that then starts to define a map with the opening curly brace, becomes instead a class, of a type name which is the string so far. No whitespace is allowed in the type name unless it is quoted.

We can do the same for arrays - define a class type for the array. We don’t actually check that the elements themselves conform to the type - indeed we don’t do anything with the class type name other than pass it to the stream listener:


inventory: item [sword, axe, shoes, tea, "no tea"]

The streaming parser presented here is pretty high-speed and generates almost no garbage. In fact it only even has to allocate any objects if you start using escapes in string data. It otherwise accepts UTF-8 data as a byte[] array, which can be in either Windows, *nix or Mac line ending format, and will fire off a streamed sequence of events to a listener pointing at subsections of the input array as tokens.

In the git repository there’s an example listener that converts the stream of tokens into a Google Json object. Classes are converted into JSON objects by simply creating a property called “class”:


"left_hand":{
    "class":"sword",
    "damage":3,
    "weight":"1kg"
}

Lists create an object of class “array” with a property “elements”:


"inventory":{
    "class":"array",
    "elements":
        [
            "sword",
            "axe",
            "shoes",
            "tea",
            "no tea"
        ]
    }

That’s not the “definitive” way to do it but it’s what the example listener does. That’s about it really.

Note that there isn’t a complementary THJSON writer in this repository… the stream reader presented here is “lossy” in that it discards comments and whitespace. It probably wouldn’t be too hard to create a writer. Maybe that’s next.

Cas

Note
I cannot stand git. I do not understand git. Whatever UI comes with Eclipse these days serves only to make it impossibly complex. I have no idea how to use it and whenever I try it literally refuses to do anything, at all. In the end I resorted to dragging and dropping my src folder onto the github webpage to add files. How they could make something so simple as source control so impossible in Eclipse is a mystery. Anyway… don’t ask me to do anything with this repo, because I can’t figure it out.

SHC · July 11, 2017, 2:55pm

Wait, a repost by mistake I guess? Seems removed soon before I post…

By the way, if you wanted to learn GIT, use the command line. I find myself lost in a different galaxy far far away when using IDE tools.

klaus · July 11, 2017, 3:03pm

Jup, I can confirm that. Learnt it by using the command line. I still do complex GIT related stuff on the command line, for the basic commit/push/pull I use the IDE.

EDIT: But to stay on-topic. Cool thing! I didn’t know of HJSON and THJSON certainly looks like a neat thing to use for structured config data. For the configuration of my game entities I use spring beans. For complex configurations I think they are a good choice, because you can inherit from other configured spring beans.

princec · July 11, 2017, 3:06pm

I couldn’t even get the basic stuff to work from Eclipse with github. I literally couldn’t figure it out, at all, even with the whole of the Internet to help me. I think maybe that’s indicative of some sort of major problem… I was sort of expecting them to streamline it for idiots like myself after a few years but here we are a few years later and it’s still ludicrous.

SVN FTW.

Cas

CJC · July 11, 2017, 4:08pm

IntelliJ makes it relatively easy. I’m no git expert, but like the others, I’d recommend looking at the command line, once you understand even the basics, it’ll be easier to use the tools.

This would definitely be more useful with a writer! Get on it Cas!

princec · July 11, 2017, 4:15pm

Hmm… exactly what input would the writer actually take?

Cas

CJC · July 11, 2017, 9:12pm

Good question. Not sure. Some form of JSON maybe, with the name of the maps (for instance) being surrounded by backslashes instead of quotes?

princec · July 11, 2017, 9:15pm

I suppose I could just do it in reverse… provide a consumer of events that a writer will just use to build up a big ol’ pretty output string.
Shouldn’t take that long…

Cas

nsigma · July 11, 2017, 9:49pm

For luddites everywhere

https://help.github.com/articles/support-for-subversion-clients/

dime26 · July 11, 2017, 11:25pm

Looks interesting! Plus you made me laugh with your comments… “YAML is a hideous format designed by devils”

I really love Git but as someone else stated just use the command line.

princec · July 11, 2017, 11:28pm

Nearly finished the output side of it too now… more work on it tomorrow.

Cas

princec · July 12, 2017, 2:32pm

There we go, all done. You can now write THJSON as well. It can be fed either JSON, a Map<String, Object>, or POJOs.

POJO output is rather simple and comes with some specific restrictions:

Object graph must be acyclic!
There are no “references”… everything is written out verbatim, every time
Though you might expect it to, the POJO writer does not natively understand Lists or Maps in your POJOs because it can’t make guarantees about the generic types or even the class involved. So it outputs them like it outputs any other POJO.
… however it can handle arrays and other nested POJOs.

In theory I could also make an annotation-controlled Java object serializer that would address some of these shortcomings, and introduce a proper object graph that can handle references and cycles, but that’s for another day.

I’ve not made the POJO deserializer yet… maybe tonight.

Also fixed numerous little bugs in parsing and output.

Cas

princec · July 18, 2017, 9:32am

So I rewrote the parser completely. It’s now hideously complicated But it also now works for any input I’ve cared to throw at it, and isn’t going to blow up with a StackOverflowError if someone creates a rather arbitrarily complex thjson document.

I’ve also added a new feature: @directives and @functions.

At the root of the thjson document, @directives are parsed verbatim and sent to the listener, and you can then interpret the data after the @ in any way you like. For example, you could use it to do some sort of “include” system:


@include "/thingy.thjson"
@include http://www.thjson.org/test.thjson # Comments at end of line are not passed to listener though

You can include @functions in any place where a member value or array value is present. The contents of the @function are again passed to the listener to be interpreted in any way you like, and you must return a String which is then given to the parser as a substituted value. The value itself is then parsed to see if it’s a boolean literal, number, or string:


{ x: @getX, y: @getY, z: [@getZ] }

Cas

princec · July 19, 2017, 10:28am

Hmm so I think trying to be fancy and have a listener return events on a byte[] array with offsets and lengths is maybe a bit too much of a step too far in the quest for “efficiency” at the expense of “flexibility” and “utility”.

Simply accepting input only in the form of a byte[] array is way less flexible than an InputStream, even if it is “faster”.

Similarly, passing “strings” to the listener in the form of subsets of the original byte[] array with offsets and lengths, when world+dog is just going to turn it into a String anyway, also seems a bit onerous. And likewise the parsing of values which is an exercise rather pointlessly left to the implementation of the listener.

So I think I’m going to change it all around to use InputStream and return Strings and parsed values to the listener, super-efficiency be damned. It’s more useful if it’s … useful.

Cas

Riven · July 19, 2017, 3:27pm

Who’s demanding this super-efficiency anyway? Get it to be ‘fast enough’, like everything else in a library/engine/game.

princec · July 19, 2017, 4:12pm

Well, exactly. It will still be very bloody quick.

Cas

princec · July 21, 2017, 10:03am

Rewrote it all to be basically “user friendly” instead of “fiddly but fast”. It’s now layered slightly, using a special input stream, which feeds a tokenizer, which then feeds a parser, which then spits out actual parsed strings and numbers. Everything possible to slow it down but make it simple

Did some benchmarks - meaningless of course relative to everyone else but relative to each other quite interesting:

Raw “getResourceAsStream” read() throughput is 17MB/sec on this machine
The THJSONInputStream read() reads at 8MB/sec, keeping track of line & column number and converting all line endings into Unix format (single \n)
The THJSONTokenizer also manages 8MB/sec despite all the extra thinking it does to read tokens and construct them
The THJSONReader parses thjson at a rate of 7MB/sec, despite all the extra thinking it does to ensure tokens are being read in the correct order

I’m happy with reading 7MB of data in a second, I think. What is actually being done with the data that is being read is likely to be far slower.

Cas

princec · July 26, 2017, 9:12am

Latest version now committed to Github. I’ve included a /lib/thjson.xml Notepad++ syntax definition file that colours THJSON files reasonably well.

This version now uses the results of @function call directives and feeds them back into the parser - though the default listener interface simply returns the original @directive as a JSON string so if you don’t want to do anything special then nothing special happens. If necessary you can enclose the entire call in quotes.

Example:


{ person: @"Sausage McGinty" }

The string is passed to the function() method in your listener. You can parse it any way you like. You could implement it as follows:


public String function(String text) {
	return "{ name: \""+text+"\", age: 44 }";
}

This would then be seen by the parser as:


{ person: { name: "Sausage McGinty", age:44 } }

It’ll allow a stack depth down to 16 levels if you get rather fancy and start having the returned thjson itself containing @calls…

So there you have it… macros.

Stuff returned from root-level @directives is not parsed - instead that’s for you to control the listener in arbitrary ways. The best uses I can think of it are some sort of “#include” functionality and some way to map class name tags to actual Java class names in this sort of manner:


@alias monster net.puppygames.treasuretomb.Monster
@alias bullet net.puppygames.treasuretomb.Bullet

I do hope people find a use for this code because it’s what I’ve been sorta looking for, for the last 15 years

Cas

princec · September 5, 2017, 9:05am

I’ve now added binary markup to thjson: https://github.com/Puppygames/thjson/blob/master/README.md

You can mark up binary in standard Base64 encoding either on a single line with ` quotes:


	base641:`YW55IGNhcm5hbCBwbGVhcw`
	base642:`YW55IGNhcm5hbCBwbGVhc3U`
	base643:`YW55IGNhcm5hbCBwbGVhc3Vy`
	base644:`YW55IGNhcm5hbCBwbGVhcw==`
	base645:`YW55IGNhcm5hbCBwbGVhc3U=`
	base646:`YW55IGNhcm5hbCBwbGVhc3Vy`

or as multiline with <<< and >>> tags:


	multilineBase64test:
	  <<<
	  abcdefghijklmnopqrstuvwxyz
	  ABCDEFGHIJKLMNOPQRSTUVWXYZ
	  0123456789+/
	  >>>

The data are returned to the listener as byte[]. I think that’s pretty much the last thing needed for this markup. Base64 is obviously not the most massively efficient binary format in the universe but it does have the advantage of being guaranteed human-readable and there’s a lot of APIS out there that use Base64.

Cas

Icecore · September 5, 2017, 10:24am

We need even Moarrrrrr custom syntax and Languages :
(it’s not sarcasm, we really need them
– but not stupid copy past or change 2-3 lines of exist like some kids do,
“you can do it Alone – it’s not so hard… really not so hard”)