Just changed the code generator to use unconditional jumps with DynASM dynamic labels (I just have to say, DynASM is an AMAZING tool!) and got for the same 100 bulked operations with 10,000 executions now a speedup of over 600% !!!
(edit: if the JVM has a bad day, it’s even over 760% sometimes!)
Absolutely same resulting end result matrix, just 600% faster.
The code generator does not emit linear code anymore, but only emits a used matrix function one time and uses unconditional jmp to a generated jump labels/addresses to go back to the next statement.
I am sure there is much more that can be improved (like what @Roquen rightly pointed out) and I will sometimes end up at even 1000% speedup.
EDIT: A bit on how it’s done:
I used a slightly different method compared to what this awesome tutorial(edited!, was wrong link) did. They emitted an immediate jump to a dynamic label address, because they emit code for every loop separately (they use that technique for nested loops).
This I could not do, sadly.
I needed a way to do what call/ret does, namely storing the address of a label in a register and then unconditionally jumping to it.
However some stackoverflow thread pointed out that push eax and then ret would not be optimal for a CPU for prediction or something.
Therefore I used rdx to store the address/label of the next operation in the batch, as that was the next register that was free. (rcx is used for the “arguments” ByteBuffer), and then let DynASM generate a new free label.
Every one-and-only generated matrix operation function at the end always just “jmp rdx”.
EDIT2: @Spasi, I do not deserve that star: the code generator had a bug actually, and now it’s back to “just” 260% faster, but actually correct.
EDIT3: However, since the code size does not grow as fast as before, doing a batch of 1000 matrix multiplications is 360% faster, and 10,000 also.