I am thinking now of analyzing the operation sequence in Java and inserting “hints” into the final “instructions” buffer, which goes to the native JIT, as to whether a jump should be performed or unrolled and also whether it is safe to keep the matrices in registers for subsequent operations.
Preferrably, for the ease of implementation and portability, I would do most of the work in Java and submit only the “IDs” of the raw assembly instructions (possibly parameterized, i.e. templates) to the JIT which it can then stitch together.
That’ll be fun!
And down the rabbit hole we go :point:
See ya in a few months!
Update: keep in mind you are creating this side-projects that only a handful of people will understand, and eventually nobody will use. I’ve been there with LibStruct. I think it’s cool tech, but even though I’d deem it simple to use, it’s too complex for ‘the 99%’, and the remaining people can’t be bothered jumping through hoops, as the benefits of the library seem not to outweigh the complexity cost.
Hehe. More like in a few “years.”
…wait, why not just have an embedded LLVM and emit LLVM bitcode and not tinker with those lowlevel optimizations anymore…
Update on your update: I aggree. The complexity for the end-user if far greater than the gains. Given that even with “classic” JOML every code I have seen so far always used it against its design, leading to worse performance.
So, it’s back to glTranslatef() and gluLookAt(), I guess.
LibStruct was a bit of an extreme case though. Having classes being converted to structs is a gigantic minefield as you don’t have any compile-time checks, and a huge number of things that Java programmers are used to doing are simply not supported. That in combination with the sometimes confusing ways it failed made it hard for me to justify the overhead, especially since the gains were minimal on my own CPU (but substantial on AMD CPUs).
The complexity of the system I proposed is nowhere near as high or intrusive. If you know how to use the standard Matrix4f class, it’s not a big leap to understand the Sequence object and that all operations are queued in it. It’s also not very confusing or full of pitfalls if you use builder methods that take care of argument filling etc. The only thing that can cause VM crashes is feeding invalid op codes or arguments to the native processor, and that can be hidden by abstraction.
So I’ve decided to move to using GCC intrinsics from now on, where I let GCC compile it to plain asm, which I then use with DynASM to emit it during runtime.
Also most of the code out there is using GCC intrinsics. No one seems to be using plain assembly anymore.
Using intrinsics you can define your own variable names, can easily initialize vectors with constant values without going through hoops, and also having GCC do the register allocation for you. This is just sooo much easier and at times also gives surprising results. So now I just write the matrix operations as C functions with SSE intrinsics and then see what gets emitted by that and use that as a template for DynASM.
Should’ve chosen this path earlier.
Because, so far it’s been such a pain keeping track of all free/used registers myself all the time. That really gives headache.
The next biggest thing would be making use of AVX when bulk-multiplying eight matrices at once as described here: https://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices
But for this JOML somewhat needs an operation scheduler that can detect and interleave independent operations to make even use of such parallel bulk operations when the client does not explicitly anticipiate that.
But before I start anything new, I will research a bit more on what the actual possibilities are and what reasonable bulky and yet simple-to-concatenate “atomic” operations there could be for vector math.
Based on that it would be reasonable to see which code generation architecture to choose.
EDIT: Okay, so far the translation of the GCC intrinsics of this matrix multiplication into assembly code by gcc was more than disappointing.
In -O3 the code looks like the amateurily handcrafted code of mine but is actually consistently 50% slower than mine… gcc’s code has 32 memory reads in it, whereas I used 20, but I made use of the whole width of the 8 xmm registers, whereas gcc only uses 4, which is good, too, since then the other four can hold a matrix permanently…
The -O2 code even did not unroll the loops and had two nested jumps in it. So even worse.
Maybe clang/llvm does a better job of it.
So, I went deeper into that rabbit hole again.
I decided to make JOML-native require at least SSE3, which is available with all x64 Intel Pentium 4 “Prescott Family” processors which exist since 2004.
AMD also has support for SSE3 since 2005.
So, whoever has a processor from at most 10 years ago, it should work.
But why do I want to require SSE3?
Simply because it has a whole bunch of more SIMD registers available compared to SSE2, namely: XMM0-XMM15
Most native operations, such as matrix multiplication, make use of XMM0-XMM7 to avoid main memory access as most as possible. Now with 8 additional registers (XMM8-XMM15) we can hold two matrices permanently in registers, which is great.
I measured main memory accesses and those are really slow, like @Riven said, even L1 hits.
So, I am using the XMM8-XMM11 registers to hold the 4 columns of the primary/first matrix, which corresponds to Java’s ‘this’ when invoking methods on a Matrix4f.
The other 4 registers XMM12-XMM15 I use for either a second matrix operand (such as for matrix multiplication) or to hold a ‘dest’ matrix when ‘dest’ is different from ‘this’.
Now, the native code actually does less than it did before and all instruction scheduling is moved to the Java “Sequence” class. It decides when to load/store registers from/to main memory based on the actual arguments of the issued matrix operations.
If you use the NativeMatrix4f class optimally by only issuing matrix operations that store their results back into the same ‘this’ matrix, then only one single load from main memory is issued for the whole sequence of operations!
When the Sequence is terminated it will be checked whether the registers are in sync with main memory (which is known due to the operation semantics) and if they’re not, they are lastly spilled back to main memory.
I am also emitting the instructions linearly again to avoid jumps.
Also, having learnt a bit more on X64 calling conventions, the generated native function now actually has a correct function prologue and epilogue, which when wrong could’ve crashed the process due to not saving non-volatile registers.
I will do some performance testing once the main matrix methods are implemented and I validated their correctness.
Okay, here are the real numbers for Matrix4f.mul(Matrix4f).
The test fixture gives HotSpot enough opportunity to generate optimal code by first warming HotSpot with 100,000,000 invocations of the classic Matrix4f.mul(Matrix4f) call with always the same objects that are also used when timing later.
So here are the timings (average of 10 runs).
num of inv. native / classic % faster
50x mul: 1520 / 5702 ns -> 275% faster
100x mul: 2281 / 8742 ns -> 283% faster
250x mul: 4.561 / 15.964 µs -> 250% faster
500x mul: 9.123 / 29.648 µs -> 225% faster
1000x mul: 17.105 / 53.215 µs -> 211% faster
2000x mul: 31.929 / 133.418 µs -> 317% faster
3000x mul: 47.134 / 187.393 µs -> 297% faster
5000x mul: 76.782 / 296.865 µs -> 287% faster
8000x mul: 123.155 / 442.446 µs -> 259% faster
10000x mul: 155.664 / 545.835 µs -> 259% faster
avg: 266% faster!
EDIT:
Just for fun, here are some numbers without warming HotSpot:
50x 1,521 / 48,654 ns -> 2,459%
100x 2,281 / 87,045 ns -> 3,716%
500x 8.743 / 432.563 µs -> 4,847%
1000x 16.724 / 778.841 µs -> 4,557%
2000x 33.450 / 1260.819 µs -> 3,669%
5000x 76.782 / 1752.679 µs -> 2,182%
8000x 122.015 / 1589.992 µs -> 1,203%
10000x 153.944 / 1701.744 µs -> 1,005%
20000x 489.960 / 2282.170 µs -> 365%
So, if you are planning on doing “only” 20,000 invocations ever or need the utmost performance right from the start, then JOML-native is definitely a win over HotSpot.
Since I am really fed-up with writing assembler and doing register allocation all in my head and with pencil on paper for even simple functions, I want to use SSE for, I am planning to add a small register allocator with simple iterative liveness dataflow analysis.
Before that, I tried using GCC for converting simple functions using SSE intrinsics into assembler, but GCC has no possibility to specify for certain values that specific registers should be used. It has the ability to “fix” some registers, so that GCC’s register allocator will not touch them (i.e. assumes they do not exist), but that is not what I want.
There is also the possibility to “annotate” a variable (that has the “register” storage qualifier) with the name of a certain register using the “asm(register)” syntax after the variable declaration. That however did not work and was ignored by both GCC 4.9 and the recent LLVM/clang 3.6.2.
I wanted to be able to allocate a certain variable to a specific register, since the interface of the assembly template snippets I used used xmm8-xmm11 to store the matrix in.
So, what I really want now is something like DynASM for Java. Being able to dynamically JIT native code during runtime with pure Java-Syntax! There are two possibilites how to go about this. Both ways would provide an API for the user to make use of the SSE functions via intrinsics named after those from GCC and msvc. So, there would be a class “Intrinsics” with static methods such as _mm_add_ps or _mm_set_ps.
Provided this, there are now two ways:
-
do static analysis of Java bytecode and literally translate each method marked as “should be jitted” into native code by translating the JVM bytecode instructions to x86 instructions. Of course, no invocations of pure-Java methods would be allowed in those methods.
-
Provide a dynamic runtime where the provided intrinsic methods are actually implemented and would write a representation of themselves into an “instructions” list once invoked. This is like the NativeMatrix4f builder concept currently works. Just that the operations would not be complex matrix methods anymore, but more lowlevel SSE intrinsics.
Both methods have advantages and disadvantages:
Method 1 has the advantage that it can account for dynamic control flow in the execution of the jitted function, as it would have to also jit Java control flow bytecode instructions, such as “goto” into x86 “jmp”, etc.
It does however have the disadvantage that the code it generates is fixed and static, because Java bytecode is static and (normally) does not change during runtime. We therefore lose the ability to dynamically alter the generated code at runtime based on runtime conditions.
Method 2 has the advantage of allowing this. But it has the disadvantage that we lose the ability of control flow in the generated jitted code, since we have no way of expressing it without making the Java code, that should represent this, very ugly by introducing Java methods for control flow instructions.
DynASM actually uses this Method 2 but since it uses a preprocessor with special non-C syntax, it does not have the mentioned disadvantage of uglifying the source language too much.
So, which method to choose? I tend to Method 2. It does give the client the ability to use it more like a template/dynamic JIT to generate instructions based on conditions at runtime. This method also does not need classpath scanning to actually find classes/methods that should be jitted and it simplifies the whole process a lot by not having to translate the whole JVM bytecode instruction set. We just need to implemented the provided API to record the called instructions and then spit out a jitted native function at the end, just like the current NativeMatrix4f interface does.
In the end, this will allow you to make use of SSE/AVX and whatnot with pure Java syntax for any usecase that requires it. That’s the goal.
EDIT:
Ideally, it would look something like this:
import org.joml.jit.JIT;
import org.joml.jit.NativeCode;
import org.joml.jit.__m128;
import static org.joml.jit.Intrinsics.*;
import static org.joml.jit.Registers.*;
public class JitTest {
public NativeCode generateMatrixIdent() {
JITContext jit = JIT.newContext();
// Generate the identity matrix in xmm8-xmm11
__m128 one = _mm_set_ps(0.0f, 0.0f, 0.0f, 1.0f);
__m128 xmm8 = new __m128(XMM8);
__m128 xmm9 = new __m128(XMM9);
__m128 xmm10 = new __m128(XMM10);
__m128 xmm11 = new __m128(XMM11);
_mm_move_aps(xmm8, one);
one = _mm_shuffle_ps(one, one, _MM_SHUFFLE(2, 1, 0, 3));
_mm_move_aps(xmm9, one);
one = _mm_shuffle_ps(one, one, _MM_SHUFFLE(2, 1, 0, 3));
_mm_move_aps(xmm10, one);
one = _mm_shuffle_ps(one, one, _MM_SHUFFLE(2, 1, 0, 3));
_mm_move_aps(xmm11, one);
return jit.generate();
}
}
Wait, so I could literally convert any function to use SSE instructions? Would this cause some kind of JNI overhead? I am not very intimate with these kind of low-level instructions, so I’m sorry if I’m misunderstanding something.
I only vaguely understand what you’re doing, but it sounds like it would be a great tool and a commendable engineering feat! 8)
Being able to batch up operations for SSE would benefit lots of areas where long calculations are done.
An advantage of the tool would be that noobs like me can use SSE without figuring out the vagaries of SSE on different architectures.
But would a disadvantage for the experts be that your tool would sacrifice some control so that the most efficient SSE code could not be made using java code? Perhaps a solution like embedding pure SSE code in the java source file directly would be better. I saw a language that did that somewhere by preceding the line with a pipe | symbol. But even doing it in a String would suffice. Then, I assume, finer control could be achieved.
Some other thoughts:
-I wonder how fast SSE is compared to using OpenCL, as @Spasi hinted at above. Since most performance-intensive programs will use the graphics card, openCL would be available to them and might even be faster than SSE given the much larger amount of cores on graphics cards and the possibility of the openCL calculation results being kept on the graphics card for drawing in openGL.
-I assume that the tool would only be available for java programs and not for Android or JavaScript code. It blows against the wind a little since java as a client-side technology is dying. The best ways to distribute our java programs these days is as WebGL, HTML5 and Javascript or as an Android program.
@Kai, assuming what you describe above is doable, you could then wrap the low-level interface in a user-friendly API similar to SIMD.js.
[quote=“theagentd,post:149,topic:53459”]
It doesn’t change anything performance-wise, you’d still have to batch enough work to trivialize the native call overhead.
[quote=“CommanderKeith,post:150,topic:53459”]
What Kai is describing is direct access to SIMD instructions from Java. So full control and full performance. It doesn’t make it any easier for people not familiar with SSE. But of course you could hide the nastiness in JOML or something like SIMD.js.
[quote=“CommanderKeith,post:150,topic:53459”]
The benefit of OpenCL is transparent use of SSE->AVX2 and multithreading at the same time. The disadvantages are a) you’d have to batch up more work, going through OpenCL will have more overhead than what Kai is doing and b) OpenCL availability; SSE is everywhere, but OpenCL requires a runtime to be available at the user’s machine.
Also note that using the GPU with OpenCL is not an option in this case. I’m assuming that whatever you’re doing with JOML needs to be CPU-accessible, for culling, collision-detection, etc. Otherwise, you’d be using OpenGL shaders directly.
I have been using simd ops since mmx was only available on prototype hardware. Something that’s been true since then is if you want real performance gains then the data has to laid out in a friendly manner.
just a side note : if you want to see the power of SIMD - try the game “from dust” (https://en.wikipedia.org/wiki/From_Dust) - it shows crazy large scale deformable volumetric meshes computed entirely on the cpu. side note side note : it’s made by eric chahi the creator of “another world” if you are old enough to remember that
on topic : i would love to use a “general purpose” SIMD native api!
It would allow you to use the SSE instructions, preferably in the form of the GCC/msvc intrinsics to make use of three-operand instructions and to not care about register allocation. So, anyone wanting to use this would first have to familiarize him/herself with SSE, which is not very hard.
The plan however is not to allow “any” currently scalar operations to be automagically converted to use SSE instructions. That would be “auto-vectorization” and that is a non-trivial task. Currently, even state-of-the-art C/C++ compilers support it in very limited form, such as “loop-vectorization.”
So, to use this you would come up with algorithms (or search for one) that already make use of SSE instructions and the plan is then to make those algorithms easily portable/accessible to Java.
Yepp. But there are not many different architectures since SSE is x86 only. So, there is just x86 and x86-64. The latter with more registers and wider basic datatypes. And AFAIK, there is only the difference in calling conventions and register usages between Windows and not-Windows.
But that would be hidden by the library, yes.
I plan to support all SSE instructions. The only difference is in expressiveness of control-flow. I am not planning to add emitting control-flow instructions. The only control-flow a client wants to do would have to be in the generation of the native code by branching in Java when calling the SSE (builder) methods.
Yes, you are talking about DynASM here. And since DynASM provides the whole x86 instruction set (and others) it allows to have control-flow in the generated code. But DynASM basically “only” provides encoding of the low-level instructions and relocation of labels (so it literally IS an assembler). This is too low-level in my opinion, but of course is the most expressive and flexible way, if one really wants to code in assembly.
What I would like is a library that is one level above that by providing simple register allocation, so you don’t have to do that anymore. To do that we would need to lift the abstraction level of the SSE instructions one level higher and make those really source-level functions/methods that do not operate on registers anymore but on “variables” (i.e. Java “locals”).
This is what the native compilers do with their “intrinsics” functions. They make the SSE/AVX instructions available via source-level functions. This is a very elegant way. And it seems that when people are talking about SIMD algorithms they do so in terms of those source-level intrinsics. So as stated above, the real motivation for me to create a library as described is to be able to translate those GCC/msvc intrinsics directly to Java.
EDIT: After thankful correction from @Spasi (see below post) the following is wrong and therefore strokethrough:
I see OpenCL a different field now which needs special treatment. The real motivating driving force for the library for me currently is easy porting of SSE intrinsics from GCC/msvc to Java.
A disadvantage of OpenCL in addition to what @Spasi already mentioned is for me: complexity in the writing of algorithms for it.
Remember that OpenCL conceptually (and with modern GPUs also physically) only has scalar operations.
Now to actually do SIMD, you would need to think about multithreading in a clever way to actually use 4 threads to simultaneously operate on a 4-component vector (or bigger).
And to have the best possible performance, those threads will likely want to be in the same wavefront and the data they access would need to be consecutive.
Also, the operational semantics and the concept of OpenCL are quite different from x86 SSE, though both provide SIMD.
If you want to think of SSE in terms of OpenCL, it would be that SSE only has 4 threads. Those 4 threads are synchronized “implicitly” after each SSE instruction. Each such instruction has lock semantics (meaning changes to memory are visible immediately) within the same thread group (those 4 threads).
In SSE you express your algorithm in SIMD form, whereas in OpenCL you express it in scalar form and hope that the runtime will execute multiple such instructions in parallel.
All in all this makes SSE a lot easier to use than OpenCL, but of course also limits you to 4 threads.
But if you want to use more than 4 “conceptual” threads you can of course always use operating system threads to parallelize your SSE algorithm even further on more data, so having “more” SIMD.
Android support is currently not planned, since I have absolutely no idea on how to do dynamic native code on Android.
Even if used on a x86 architecture, I don’t know whether one can mmap and mprotect on Android or if it is at all possible for user applications to generate native code at runtime. There seems to be very little resources about that online.
About JavaScript: If you are talking about libGDX with its GWT backend, that GWT backend would need to be augmented to recognize the SSE instructions of JOML and convert it to SIMD.js for example. This would allow Java-to-JavaScript-converted code to use SSE. At least on Firefox Nightly.
Good points, I understand the distinction better now. Thanks for the explanations.
@kaiHH Maybe I’m missing something, but I can see from the Jittest NativeCode sample you posted that you will give access to the registers as static variables of the class org.joml.jit.Registers. This makes sense for a single-threaded application but for multi-threaded use of an SSE setup like you hinted at then shouldn’t these be local variables?
Do you think it would be worth making a version that operates on 2 doubles rather than 4 floats? I notice that SIMD SSE2 allows 2 doubles as well as 4 floats: https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Registers
Obviously there would be a 50% slowdown doing 2 doubles instead of 4 floats but it would be nice to have. I only ask because I’ve sometimes found that doing intermediate calculations with floats can often cause rounding problems compared to using doubles all the way through.
This is a very exciting project.
[quote=“KaiHH,post:154,topic:53459”]
Everything you said here is wrong, you’re missing the point of OpenCL entirely. Multithreading happens at the workgroup level. Everything within a workgroup runs on the same thread with appropriate SIMD instructions for the kernel workload and target CPU. Read this for details. The Intel OpenCL compiler can even output 512-bit SIMD on Xeon Phi.
You really cannot beat OpenCL in terms of usability either:
float4 a = ...;
float4 b = ...;
float4 r = a + b; // it doesn't get more friendly than this
float16 a = ...;
float16 b = ...;
float16 r = a + b; // this will be 1 instruction on Phi or 4 "unrolled" instructions with plain SSE
GPUs are indeed scalar these days. For example AMD cards had vector registers up to the Radeon HD 6000 series (VLIW5 originally, then VLIW4), but switched to a scalar architecture with GCN. A vec4 operation will indeed be translated to 4 scalar instructions, but that’s an implementation detail. In both OpenGL shaders and OpenCL kernels, you’re using vector types for the semantics; it is the compiler’s job to translate that to the optimal hardware instructions. That means SIMD on Intel CPUs.
Your Java code will not have access to the registers. The native code, that the Java method represents and that will be built by the executed Java “builder” methods, will.
Those __m128 Java locals/variables in the example only “represent” variables, which will eventually be allocated to registers when generating native code.
But those registers will not be accessible from Java and also do not link with those Java variables/locals.
Accessing registers from Java is not the goal of JOML, as that is partly not possible and also would require each register access to go through a JNI interface function, which would be dramatically slow.
We could do this of course, if it will be usable useful for someone.
@Spasi: Ahh…thank you for your clarifications! I was somehow under the impression that OpenCL has no vector primitive types when I wrote the post. Should’ve looked that up, though.
EDIT:
Just actively searched for existing projects that promised to do what I wanted. Only found jssembly so far. However, it is in a very early pre-alpha state with not much actually working. And it is also too low-level for me, as it seems to plan on having only an instruction encoder and a simple instruction text mnemonic to operation mapper using an ANTLR grammar, so no relocation and linking to make it a full assembler.
It would be very useful if every class has the [icode]get(FloatBuffer buffer)[/icode] function.
Currently only the Matrix4f has it (correct me when I’m wrong).
That is correct. Currently, only the matrix classes have those methods.
The rationale behind this is that it is considered a low burden for a client to do:
Vector3f v = ...;
FloatBuffer fb = ...;
// put vector into fb
fb.put(v.x).put(v.y).put(v.z);
whereas with the 4x4 matrix it can be difficult and cumbersome because of the 16 values and having to take care of correct column-major layout.
But if you really need it, it can of course be add.
I had already fixed it that way. I thought it just would be useful if every class had the function, since it is used quite often.