Once again! fast MappedObjects implementation

kappa · July 22, 2011, 1:04pm

Pretty sure its just a classic case of miscommunication/misunderstanding between the dream team. They’re cool enough to sort it out and get back to kicking some performance butt.

http://gabrielhummel.com/wp-content/uploads/2011/06/relationship-graph.png

kevglass · July 22, 2011, 2:46pm

Awww, but imagine the fight as these two massive brains attack each other - I’d pay for view for sure!

Kev

princec · July 22, 2011, 7:15pm

Celebrity deathmatch!

Cas

cylab · July 22, 2011, 8:17pm

Yeah men, f**ck the PMs, do it right here

Riven · July 22, 2011, 10:23pm

For the apparent jerry, jerry screamers: there are no PMs. There is nothing.

Somebody steals my work with the self righteously justifying argument ‘because I don’t like the way you communicate’.

I feel betrayed and it genuinly surprises me how depressed I am. I’ll leave it at that. Don’t want to sound too pathetic.

princec · July 22, 2011, 10:38pm

I thought you donated the code to the LWJGL project? (Of which Spasi is one of our main contributors/fiddlers)

Cas

Riven · July 22, 2011, 10:54pm

There is this little thing called context.

I was working on the codebase for him, because he was requesting features. Meanwhile he deliberately didn’t inform me that he was rewriting the code from scratch (see the svn repository), and taking my code and the ideas I shared on how to implement certain features, and made them his own, all behind my back.

How is this hard to understand? Seriously.

cylab · July 22, 2011, 11:16pm

I think it’s not a matter of permission, but of style…

princec · July 22, 2011, 11:23pm

Sounds like a subtlety of communication confusion to me. Best not to either of you get wound up about it though and stick to coding. Spasi’s entitled to do as he pleases with anything in the LWJGL SVN repo and Riven is assuredly entitled to do whatsoever he wishes with his code. It’s doubtless better that they don’t fork and end up duplicating everything but at the end of the day Riven’s code is a “foreign addition” to LWJGL and quite likely to undergo style changes to get it to fit in with the existing code and Spasi is not known to do things by halves… but still: be nice if you settled your differences over it and carried on

Cas

Riven · July 22, 2011, 11:29pm

It seems I can’t get my point across. I don’t want this thread to end up in whining and whatnot, so I’ll just leave it as is.

Spasi · July 23, 2011, 12:16am

I would have preferred to have a discussion with Riven about this in private, but it sounds like he’s too upset to contact me. Since this is all public now, I guess I have to explain my version of the “context”.

First of all, this is indeed a project he started, the code committed to LWJGL was all his and I told him from the first day that the way he used bytecode transformation to implement it was brilliant. All this time I had been doing lots and lots of testing, on existing functionality, on performance and sometimes trying new features that I implemented locally, using his code. Because of my feedback the library changed and improved and again Riven was doing the final coding. Then it got into LWJGL and I started contributing to the actual code; documentation, runtime checks, minor features.

Then we discovered the performance issue with iteration, for which Riven provided a quick/hacky solution using @MappedView. Performance-wise it was great, but the API was bad and the implementation not robust. We then talked about possible alternatives and finally Riven thought that using array access was doable and that we should go that way (I liked it). A few days passed because he was busy and at some point I talked to him about current progress.

That’s when things got aggressive for reasons I don’t understand. I simply wanted to understand the technical difficulties involved and the only answers I got were “if you don’t believe me, try to do it yourself” and “it’s too hard”, without any effort to help me understand. I got the impression that he didn’t want to talk to me about it or even that it’d be a waste of time to explain it to me. Now, there could have been a hundred reasons as to why he talked to me like that; bad mood, his attention might had been focused elsewhere, didn’t have time to talk, etc. But still, that’s no way to talk to someone that has been so helpful to you, at least offer an explanation as to why you can’t talk to them. Even after that discussion ended abruptly, he still didn’t communicate at all since then. He says he feels depressed from my behavior (and rightly so, I feel bad about it too), but how do you think his behavior made me feel?

As for the code itself. Well, it’s still his code, his project. It still says @author Riven on top. I don’t see how I “took ideas he shared on how to implement certain features and made them my own”, since the only new thing in the current version is the .asArray() implementation, of which I knew nothing about; when I asked, I got shit for replies. Of course, I went ahead and tried to implement it locally (again with the intention to throw it away like my other tests). I used his codebase as a starting point, I burned almost a full day on it and a lot of progress was made, but eventually I hit a dead-end, which I guess was what he didn’t want to explain to me in the first place (note: I didn’t understand most of the code before this point). Anyway, even with the lack of robustness, by that time I could do some performance tests, which was my main interest. The results were good and since I hadn’t heard from Riven at all, I decided to refactor the code and get this thing done properly.

So, yes, I’m sorry for dropping this bomb without notice, but I was upset. When I treat someone with respect, I expect the same respect in return. Also LWJGL isn’t mine or yours or anyone’s, it’s an open source project, you knew that when you donated the code. The rewritten code isn’t mine either. You can check it out and continue working on it or even replace it with a better/alternative implementation if you want to help and think it will benefit the project.

Again, feel free to PM me if you want to discuss this further.

Riven · July 23, 2011, 7:48am

[quote=“Spasi,post:171,topic:31992”]
If you drop a bomb on me in public, don’t expect it to be settled in private.

You asked me three times what the problem was with implementing a ‘stable’ version of the array implementation. I explained to you two times that I’d have to do a stack/localvar analysis after each instruction. Then I said I didn’t want to spend weeks/months writing such code, and focused on supporting a subset of functionality that was guaranteed to be stable. After you said that ‘no, it would be easy’ for the third time, I decided not to repeat myself. If that’s ‘without any effort to help me understand’ then I indeed must be lacking communication skills.

Spasi · July 23, 2011, 8:16am

It was easy after all though, wasn’t it?

Riven · July 23, 2011, 8:26am

If that’s your reply to my explaination that I did explain everything to you, after you accused me of being ‘aggressive’ and ‘without any effort to help me understand’, then I’m completely through with you.

Spasi · July 23, 2011, 8:45am

That was not a reply to your explanation, because your explanation was funny. What do you want me to say after reading “After you said that ‘no, it would be easy’ for the third time, I decided not to repeat myself.”, it describes exactly your behavior. Which you obviously think was appropriate.

I was trying to suggest a way to do it without analysis and what I got in reply basically was “it’s stupid and I cba to explain why, leave me alone”. And I actually got it to work before switching to asm-analysis, even with nested array access etc.

Roquen · July 23, 2011, 9:00am

IMHO an easier and better way to attack the problem would be to directly process an AST. Of course by easier I mean once the large amount for work building the framework was done. And by better I mean that a much wider set of program transforms could be performed. Ya know…just the muddy the waters some more. Have a nice day.

Spasi · July 24, 2011, 9:44am

The next nightly will have the following changes:

Removed sizeof from @MappedType, it’s now calculated automatically. There’s an optional padding parameter now, so the final SIZEOF will be calculated as: max(field.offset + field.length) + padding.
@MappedType is now optional. Extending MappedObject and registering with MappedObjectTransformer is enough.
Added support for the volatile keyword.
Added support for @Pointer long fields. These will be treated as pointer values and will be 4 or 8 bytes at runtime, depending on the architecture.
The sizeof and align fields in MappedObject have been converted to methods (getSizeof and getAlign).

Spasi · July 24, 2011, 8:56pm

I spent the day investigating the performance characteristics of Unsafe memory access and I’d like to share my findings. There is a serious issue with Unsafe and even though it doesn’t invalidate many uses for this library, it’s really important that users keep it in mind when deciding where and when to use mapped objects. Please note that the problem isn’t specific to mapped objects or this implementation, it’s a general issue with Unsafe and applies the same to direct NIO buffers, that use it internally.

First of all the JVM (especially the server one) is really really good at optimizing memory access calculations. Basically, at the native code that JIT produces, there’s no real difference between addressing javaObject.x, javaArray[ x ], nioBuffer.get(x) and mappedObject.x. It’s also crazy good at removing bound/null checks, or simply doing it once before a loop. So, in general the native code generated has more or less the same number of instructions. The important difference is: whenever you use Unsafe, you’re forcing the JVM to do the memory access exactly when you tell it and in the exact same order. This is a serious disadvantage compared to POJO field access, as it limits the optimizations possible and usually leads to higher CPU register usage and limited instruction level parallelism.

Some examples first:

SpriteShootout animation loop
Java (array source, FloatBuffer target): 100%
Mapped (mapped source, mapped target): 164% naive, 100% optimized

Matrix4f multiplication
Java (POJO fields): 100%
Array (float array): 116% naive, 106% optimized
NIO (FloatBuffer): 158% naive, 106% optimized
Mapped: 133% naive, 104% optimized

Click on the links to see and compare the code between the different implementations. Percentages are relative to the fastest implementation. A note on the matrix multiplication: The temporary results are required because one of left/right could be the target matrix. This means that certain memory reads need to strictly happen before certain memory writes. It turns out that the JVM does an extremely good job with the POJO implementation, in fact the native code doesn’t look anything like the original Java code and ofc still doesn’t break the strict order required. On the other hand, for the implementations using Unsafe the native code does indeed look like the Java version, with very limited instruction reordering and much higher register usage.

A few observations on the above results:

The naive implementations are a direct translation of the corresponding base POJO implementation.
It can get really bad. I’ve been fighting with the SpriteShootout for days before I realized it had nothing to do with the loop iteration (which is full speed now with .asArray). 58% or 64% slower is a lot for hot code.
It affects array access as well, but only marginally.
It is “fixable”. As you can see from the optimized implementations, a simple instruction reordering at the Java level results in huge gains in performance. Basically I did manually what the JVM does for POJOs, except for the very low-level optimizations (like overlapping unrelated memory access and calculations for instruction-level parallelism, which are CPU dependent really).

Well, honestly, this sucks. I don’t know if this could be fixed in a future JVM, but right now it’s a real problem. Java programmers can’t possibly be asked to perform such trivial optimizations; it’s both annoying and takes time to get right. One possible solution would be bytecode-level optimization. I tried Proguard and got only marginal gains (over the naive implementations obviously), it wouldn’t do any serious instruction reordering. I got better results with Soot, this was its output when decompiled, it was almost as fast as the hand-optimized version. Unfortunately it’s quite slow and from a quick look at its API it’d be a pain to integrate it to the transformation process.

Does anyone know of an alternative solution? Ideal it should be fast and work on bytecodes directly. No need for whole-program optimizations or anything, something that works at the method level would do.

Of course, this problem would go away with official JVM support for mapped objects…

jezek2 · July 24, 2011, 10:01pm

Which is the only proper place where you can do such low-level stuff efficiently. You can bend the language/VM a lot, but some tasks are beyond such approaches. Just accept that Java isn’t 100% performance solution and do performance critical stuff in native code, as a side-effect you don’t have to use such hacks as these mapped objects and you can be happy again

I’ve done few of such experiments of enhancing Java by bytecode/AST manipulation, either I didn’t release it in the end (in this case it was project similar to project lombok) because I realized what horrible idea it was, or abandoned/minimized their usage after realizing they do more harm than good in practice…

princec · July 24, 2011, 10:33pm

It is just possible you might be trying to solve a problem we didn’t really have here.

The problem was: it is a monster pain in the arse interacting with “native” data, which invariably comes packed into big byte buffers, and has many forms. Often the data type is entirely homogenous (eg. vertex data), and in other applications, it is not at all (network data). The root inefficiency was the need to copy the data out of the byte buffer to a Java class in order to encapsulate its behaviour nicely and then the need to write the data back to the buffer. The secondary inefficiency was the bounds checking code to make random access reads and writes to bytebuffers left a little to be desired back in 1.4.2 but it would appear it’s gotten a whole lot better since then.

The first issue is the only one then that really needed solving with what amounted to an efficient flyweight pattern (ISTR) that allowed us to write POJOs that accessed their fields directly through the underlying bytebuffer. 99% of reasonable use cases for this facility are therefore solved: in that it was about interacting with primitive data held in “legacy native” structures. For this you really literally only needed the ability to specify a ByteBuffer, a position within that bytebuffer, and … well, that is all that is strictly needed, as there are no such things as Objects in “native” data, though I suppose if you wanted to be smartypants you’d maybe have some way of mapping arrays in there as well. Personally I’d envisaged that any non-primitive data in such a mapped object would simply be completely ignored.

With this facility then you’ve solved the problem that 99% of us have which is writing vertex data fast and parsing & writing network data fast. All the other stuff is making a really simple concept quite opaque and prone to the sort of wandering off-track that you’re currently doing now worrying about absolute relative performance percentages relative to something that doesn’t actually concern us here. All we need is syntactically simple ways of easily manipulating native structs in a bytebuffer. Trying to be clever and bunging quadtrees and other complex Java structures into bytebuffers goes beyond what we need here. Save that for version 2.

Cas