C++/Java Engine without GC in graphics

But what part of it couldn’t be done in Java exactly? I’ve not generally experienced vastly slower performance in Java than C++ in straightforward compute tasks.

Cas :slight_smile:

Speed, you cannot get the matrix multiplications to perform fast enough.

So you’re wasting electricity and you cannot scale gameplay.

Look at Zelda BotW f.ex. there is never more than 10 animated characters on screen at any single point of the game, that’s because they are only pulling 5W and they are not using my Park engine! :wink:

The main problem of Java IMHO is that when handling composite data you’re mostly bound to using objects, which has a memory and compute overhead, i would love to have something like C-structs in Java.

Just found this here, it explains the problem briefly and offers a solution: https://tehleo.github.io/junion/

Didn’t try it yet but it looks interesting. Why isn’t there already something like this in Hotspot, shouldn’t it be able to recognize when you’re using an object as a stupid data container?

Such an optimization exists for ages and the JIT will eliminate object allocations and decompose an object into its fields whenever the object does not escape the “inline scope” which the JVM uses to optimize a method call stack and performs escape analysis with.
If you have relatively tight loops, not too deep call stacks and no allocated object escapes then it is very likely that no object allocation is actually happening. JITWatch provides more insights into this.
However, since this relies on the inlining and escape analysis to work it out automatically and your object not escaping, it is prone to fail in some circumstances - simply because your method size or call stack size exceeds some arbitrary limit and the JIT then just saying “nope”.
Project Valhalla will solve this once and for all with Value Types, requiring the programmer to hint to the JVM that he/she does not care about object identity.
I do agree with @bullen, that the lack of SIMD and a few other missing CPU intrinsics is a big point in performance. Together with the arbitrary inlining thresholds in the JVM (which can be configured, though) giving you totally unpredictable runtime behaviour.

[quote]Such an optimization exists for ages
[/quote]
Yes, but that is definitely not what i described, this optimization you mentioned only affects local throwaway objects but not large persistent object arrays if i understand you correctly. I’ll have to wait for project Valhalla i guess.

Ah, yes, I see. I misunderstood you there, I guess.

Valhalla is the game-changer in high-performance Java computing. It’s a shame it’s taking so long.

Cas :slight_smile:

Looks like Valhalla is pretty active commit wise in the jdk

http://hg.openjdk.java.net/valhalla/valhalla

[quote=“bullen,post:23,topic:58398”]
You are to easy on primises I fear :slight_smile:

AgxddJtSVx0

This runs on my notebook with a GTX 1060 (iirc on battery, which makes a difference, at least on my machine) in my at that time pretty …hacky … own engine, implemented in Java/Kotlin. It was mostly limited by the gpu in this case, because I am pretty generous with resources when I want to quickly get a result. Nonetheless, this things is not instanced, it has unique materials and unique animations per mesh - just that the mesh is all the same in this case.

We know that, but I also fear that @bullen has fallen for the Dark Side, and forever will it dominate their destiny :wink:

Cas :slight_smile:

Thanks for keeping my promise (2500 < 3000)! :wink:

That said neat stuff, is it open-source?

The video description says: “2500 separately animated hellknights in Java with GPU skinning, instanced and indirect rendering”.

I would have liked to feel analogue input (mouse) on the rendering because it seems the latency is high on this (watching the mouse pointer when you rotate the camera), low latency is more important than high bandwidth!

It’s so easy to get high FPS if you delay frames like RdR 2 or TloU (specially the 60 FPS version that uses fibers and pushes the frames back by 2-3!).

How many FPS are you getting in your engine when this video was recorded, also are you culling?

Here is a new test where I try to limit the CPU and still render 500 animated characters, the FPS drops to 30 when I start the animated .gif recorder because of reasons!? I sleep 15 ms per frame and the rendering takes ~6 ms.

Counting cycles on windows/intel is voodoo magic, sometime it includes the cycles counted during Sleep?!

Included mouse shake and animation swapping for realism! :wink:

But my engine needs to be felt, the latency is the lowest I have ever played!

C(++) is not fun, but the performance (and portability) makes it worth it.

That said I will still use Java on the backend, so I’m not completely corrupted by power yet.

Edit: did another recording with mouse pointer enabled:

The leftmost core is the .gif recorder and the engine runs on the rightmost core.

Haha, you’re right, I’m going to invest some time and look what the current state of my engine is capable of :slight_smile:

Yes, it’s open source, but I’m a bit ashamed of it because I mostly hack this beast in my spare time while sitting in the bus and the train…so lots of hacky things in there and probably hard to read for anyone else than me. Maybe I can just write about how I did things? Would be happy to talk about some optimizations, especially regarding things where you have experience optimizing it in C++.

Regarding instancing: I’m doing regular instancing and I was generous regarding the instanced data. I have nearly everything I use as object properties as instanced properties. For example transformations, materials, animations (currently only 4 possibly active ones per object), bounding boxes. So I can go with two draw calls for main rendering. With indirect rendering, instancing or not instancing is practically the same for me, although performance would differ when you have a lot of different meshes I guess because of bandwidth. The gpu is limiting me here always.

Regarding latency: I have a triple buffer renderstate construct. The rendering/gpu stuff runs on its own thread with a Kotlin coroutine dispatcher and always goas max or is limited by vsync. Then I have an update thread that’s also a coroutine dispatcher. Beginning the update frame, I have a stage where one can schedule single thread execution. That means I don’t have to synchronize datastructures, but can use scheduling for synchronizing things easily. After this, all systems updates are executed … on a coroutine context with max core count, in general maxing out the cpu or limited by the gpu. After that, the triple buffer extraction is done, single threaded. Since I wrote my own struct library, the extraction window is pretty small here because it’s close to a memcopy and until now never limited me :slight_smile: This is probably one of the things that C++ can do better, because my struct library reduces extraction time but introduces a little bit of runtime overhead in general during update.

Regarding latency, I would need to take a look at where I update my inputs, but I think it’s just in the update look that runs as fast as possible, sometimes with 10k times per second and the triple buffer is updated whenever the gpu finishes a frame, so I don’t know how to do that any faster to be honest :smiley:

EDIT: Regarding culling: The given video doesn’t use any culling. I have a culling system that implements two-phase occlusion culling completely on the gpu, so the gpu feeds itself. It’s capable of processing clusters which is my kind of “batch” and instances…but…I don’t know if it pays off to be honest, don’t have nice numbers yet.

I’m starting to wonder, is it time for me to build my own forum:

If so I’m going to have to add SMTP to it so it can have mail too! :wink:

Ok, last update since this is no longer going to be a Java project (I’m going to try and “script” with C maybe if I can hot-load the machine code).

I managed to get the engine working on raspberry 4 during christmas: http://talk.binarytask.com/task?id=4064110776042269443

47 guys at 60 FPS, bottlenecked by GPU, CPU at 30% without any optimizations, so I can make the game single threaded if I want!

OpenGL (ES) 3 is the last GL version for me, works really fast on both ARM and x86 with almost no platform specific modifications! Only since this summer did we get a workable GPU on the Raspberry for this.

I’m not going to port to Vulcan. The performance gain does not pay for the development/maintenance time!

I also managed to get my old Ouya bluetooth controller working flawlessly with the engine on the Pi:

This stuff (VC6, OpenGL ES 3, developing on linux; apt-get install dep, /dev/input/js0 etc.) is amazing!

x86 is dead, vanilla linux on ARM is the future for the desktop/console/portable! Android and Nintendo Switch are getting some hardcore competition!

Countdown to a raspberry 4 compute module powered Switch killer starts now!

Edit: at some point I will try to make a Java JNI port of the engine though, I think I will manage to make the engine hot-deploy with .so/.dll so that won’t happen for a long while but eventually.

@princec I cannot PM you here it seems? I’m curious about https://www.patreon.com/posts/33715502 Why did you switch to Unity?!?

Don’t know what’s wrong with DMs, they’ve been working fine for 20 years (!).

I didn’t switch to Unity, Chaz and Alli did, because of shiny shiny and toys (while I got on with the dull business of earning money with Java in my day job).
Unity makes a lot of things very easy for both artists and programmers. But it is also an incredible honey trap… fall into its sticky golden embrace and you will emerge many years later with a project so overblown in scope and polish it will never make a profit…

Cas :slight_smile:

I saw that too. That was a pretty good article.

Hey, I’m kinda addicted to this community so I’m going to keep posting about the C(++) client since it uses a Java backend and all, hope you don’t mind!

Here is the .dll hot-deployment API taking shape: http://edit.rupy.se/?host=move.rupy.se&path=/file/game.cpp&space

Replacing machine code on the fly is very underestimated!

If you think the code looks strange it is because it’s designed to avoid cache misses: https://en.wikipedia.org/wiki/AoS_and_SoA

So I lost my account in the move, but here is the first runnable prototype of this engine:

I’ll upload the engine to github with user editable stuff hopefully around christmas!

1 Like

Ok, so I’m back at it with an idea that feels interesting but potentially hard: Use my own J2ME KVM to script the game!!!

So I downloaded the source for CLDC 1.1 back when Oracle still had it on their site. It’s really tiny and does nothing more than a barebones JVM (or KVM as they call it; Kilobyte VM because it can run on 8KB of RAM).

It really fits the needs of my engine pretty well and since the rendering will be on another thread GC pauses and eventual inefficiences can be ok!

The reason I’m going on this route is I also want to have Java on the Raspberry Pico AND I want to replace nio and concurrency with userspace networking and my own concurrency for the backend… 3 flies in one hit!

Last but not least, I will ofcourse keep the C API for more hardcore devs. but Java is better because the compiler is distributed easily on Windows without having to download VS that is like 3GB now!!!

So anyhow an update on this forever project!

1 Like