C++/Java Engine without GC in graphics

bullen · November 20, 2019, 6:51am

How many triangles/bones does your character have?

Mine has 2500/38, I’m not choked on my GPU (1050Ti, I haven’t checked, how does one even check that?) but it runs pretty hot.

Are your bots instanced? If not, they probably should be.

princec · November 20, 2019, 9:04am

Yes, they’re instanced (the terrain of course is not which makes up the vast bulk of the rendering). Can’t remember how many polys there are but they’re properly optimised. I’ve only got a 960.

Cas

bullen · November 20, 2019, 2:04pm

Well, then you are comparing apples and oranges.

My characters are non-instanced: so they can look different, have different animations and be controlled by a separate player in a physical world. Try to make that in Java I can promise you wont be able to run 3000 at 60 FPS.

Instanced stuff is very hard to make interesting for game play since the position is stored in textures on the GPU and interaction with CPU code is hard. It becomes a visual treat who’s effects doesn’t last very long, just like VR.

Though I guess we’ll finally see tomorrow when Valve “unveils” (whatever that means, release in 2022?) their flagship VR title.

Disclaimer: I have DK1, DK2 and Vive; so I really really hope I’m wrong.

princec · November 20, 2019, 5:33pm

But what part of it couldn’t be done in Java exactly? I’ve not generally experienced vastly slower performance in Java than C++ in straightforward compute tasks.

Cas

bullen · November 20, 2019, 5:56pm

Speed, you cannot get the matrix multiplications to perform fast enough.

So you’re wasting electricity and you cannot scale gameplay.

Look at Zelda BotW f.ex. there is never more than 10 animated characters on screen at any single point of the game, that’s because they are only pulling 5W and they are not using my Park engine!

VaTTeRGeR · November 20, 2019, 7:21pm

The main problem of Java IMHO is that when handling composite data you’re mostly bound to using objects, which has a memory and compute overhead, i would love to have something like C-structs in Java.

Just found this here, it explains the problem briefly and offers a solution: https://tehleo.github.io/junion/

Didn’t try it yet but it looks interesting. Why isn’t there already something like this in Hotspot, shouldn’t it be able to recognize when you’re using an object as a stupid data container?

KaiHH · November 20, 2019, 7:30pm

Such an optimization exists for ages and the JIT will eliminate object allocations and decompose an object into its fields whenever the object does not escape the “inline scope” which the JVM uses to optimize a method call stack and performs escape analysis with.
If you have relatively tight loops, not too deep call stacks and no allocated object escapes then it is very likely that no object allocation is actually happening. JITWatch provides more insights into this.
However, since this relies on the inlining and escape analysis to work it out automatically and your object not escaping, it is prone to fail in some circumstances - simply because your method size or call stack size exceeds some arbitrary limit and the JIT then just saying “nope”.
Project Valhalla will solve this once and for all with Value Types, requiring the programmer to hint to the JVM that he/she does not care about object identity.
I do agree with @bullen, that the lack of SIMD and a few other missing CPU intrinsics is a big point in performance. Together with the arbitrary inlining thresholds in the JVM (which can be configured, though) giving you totally unpredictable runtime behaviour.

VaTTeRGeR · November 20, 2019, 7:48pm

[quote]Such an optimization exists for ages
[/quote]
Yes, but that is definitely not what i described, this optimization you mentioned only affects local throwaway objects but not large persistent object arrays if i understand you correctly. I’ll have to wait for project Valhalla i guess.

KaiHH · November 20, 2019, 7:52pm

Ah, yes, I see. I misunderstood you there, I guess.

princec · November 20, 2019, 10:29pm

Valhalla is the game-changer in high-performance Java computing. It’s a shame it’s taking so long.

Cas

abcdef · November 21, 2019, 2:51am

Looks like Valhalla is pretty active commit wise in the jdk

http://hg.openjdk.java.net/valhalla/valhalla

h.pernpeintner · November 21, 2019, 11:53am

[quote=“bullen,post:23,topic:58398”]
You are to easy on primises I fear

AgxddJtSVx0

This runs on my notebook with a GTX 1060 (iirc on battery, which makes a difference, at least on my machine) in my at that time pretty …hacky … own engine, implemented in Java/Kotlin. It was mostly limited by the gpu in this case, because I am pretty generous with resources when I want to quickly get a result. Nonetheless, this things is not instanced, it has unique materials and unique animations per mesh - just that the mesh is all the same in this case.

princec · November 21, 2019, 1:10pm

We know that, but I also fear that @bullen has fallen for the Dark Side, and forever will it dominate their destiny

Cas

bullen · November 21, 2019, 2:46pm

Thanks for keeping my promise (2500 < 3000)!

That said neat stuff, is it open-source?

The video description says: “2500 separately animated hellknights in Java with GPU skinning, instanced and indirect rendering”.

I would have liked to feel analogue input (mouse) on the rendering because it seems the latency is high on this (watching the mouse pointer when you rotate the camera), low latency is more important than high bandwidth!

It’s so easy to get high FPS if you delay frames like RdR 2 or TloU (specially the 60 FPS version that uses fibers and pushes the frames back by 2-3!).

How many FPS are you getting in your engine when this video was recorded, also are you culling?

Here is a new test where I try to limit the CPU and still render 500 animated characters, the FPS drops to 30 when I start the animated .gif recorder because of reasons!? I sleep 15 ms per frame and the rendering takes ~6 ms.

Counting cycles on windows/intel is voodoo magic, sometime it includes the cycles counted during Sleep?!

http://move.rupy.se/file/500@30fps@0.05CPU.gif(image larger than 102400KB)

Included mouse shake and animation swapping for realism!

But my engine needs to be felt, the latency is the lowest I have ever played!

C(++) is not fun, but the performance (and portability) makes it worth it.

That said I will still use Java on the backend, so I’m not completely corrupted by power yet.

Edit: did another recording with mouse pointer enabled:

http://move.rupy.se/file/500@30fps@0.05CPU2.gif(image larger than 102400KB)

The leftmost core is the .gif recorder and the engine runs on the rightmost core.

h.pernpeintner · November 21, 2019, 8:02pm

Haha, you’re right, I’m going to invest some time and look what the current state of my engine is capable of

Yes, it’s open source, but I’m a bit ashamed of it because I mostly hack this beast in my spare time while sitting in the bus and the train…so lots of hacky things in there and probably hard to read for anyone else than me. Maybe I can just write about how I did things? Would be happy to talk about some optimizations, especially regarding things where you have experience optimizing it in C++.

Regarding instancing: I’m doing regular instancing and I was generous regarding the instanced data. I have nearly everything I use as object properties as instanced properties. For example transformations, materials, animations (currently only 4 possibly active ones per object), bounding boxes. So I can go with two draw calls for main rendering. With indirect rendering, instancing or not instancing is practically the same for me, although performance would differ when you have a lot of different meshes I guess because of bandwidth. The gpu is limiting me here always.

Regarding latency: I have a triple buffer renderstate construct. The rendering/gpu stuff runs on its own thread with a Kotlin coroutine dispatcher and always goas max or is limited by vsync. Then I have an update thread that’s also a coroutine dispatcher. Beginning the update frame, I have a stage where one can schedule single thread execution. That means I don’t have to synchronize datastructures, but can use scheduling for synchronizing things easily. After this, all systems updates are executed … on a coroutine context with max core count, in general maxing out the cpu or limited by the gpu. After that, the triple buffer extraction is done, single threaded. Since I wrote my own struct library, the extraction window is pretty small here because it’s close to a memcopy and until now never limited me This is probably one of the things that C++ can do better, because my struct library reduces extraction time but introduces a little bit of runtime overhead in general during update.

Regarding latency, I would need to take a look at where I update my inputs, but I think it’s just in the update look that runs as fast as possible, sometimes with 10k times per second and the triple buffer is updated whenever the gpu finishes a frame, so I don’t know how to do that any faster to be honest

EDIT: Regarding culling: The given video doesn’t use any culling. I have a culling system that implements two-phase occlusion culling completely on the gpu, so the gpu feeds itself. It’s capable of processing clusters which is my kind of “batch” and instances…but…I don’t know if it pays off to be honest, don’t have nice numbers yet.

bullen · December 5, 2019, 4:25pm

I’m starting to wonder, is it time for me to build my own forum:

If so I’m going to have to add SMTP to it so it can have mail too!

bullen · January 5, 2020, 2:10am

Ok, last update since this is no longer going to be a Java project (I’m going to try and “script” with C maybe if I can hot-load the machine code).

I managed to get the engine working on raspberry 4 during christmas: http://talk.binarytask.com/task?id=4064110776042269443

47 guys at 60 FPS, bottlenecked by GPU, CPU at 30% without any optimizations, so I can make the game single threaded if I want!

OpenGL (ES) 3 is the last GL version for me, works really fast on both ARM and x86 with almost no platform specific modifications! Only since this summer did we get a workable GPU on the Raspberry for this.

I’m not going to port to Vulcan. The performance gain does not pay for the development/maintenance time!

I also managed to get my old Ouya bluetooth controller working flawlessly with the engine on the Pi:

This stuff (VC6, OpenGL ES 3, developing on linux; apt-get install dep, /dev/input/js0 etc.) is amazing!

x86 is dead, vanilla linux on ARM is the future for the desktop/console/portable! Android and Nintendo Switch are getting some hardcore competition!

Countdown to a raspberry 4 compute module powered Switch killer starts now!

Edit: at some point I will try to make a Java JNI port of the engine though, I think I will manage to make the engine hot-deploy with .so/.dll so that won’t happen for a long while but eventually.

bullen · March 3, 2020, 9:18am

@princec I cannot PM you here it seems? I’m curious about https://www.patreon.com/posts/33715502 Why did you switch to Unity?!?

princec · March 3, 2020, 9:24am

Don’t know what’s wrong with DMs, they’ve been working fine for 20 years (!).

I didn’t switch to Unity, Chaz and Alli did, because of shiny shiny and toys (while I got on with the dull business of earning money with Java in my day job).
Unity makes a lot of things very easy for both artists and programmers. But it is also an incredible honey trap… fall into its sticky golden embrace and you will emerge many years later with a project so overblown in scope and polish it will never make a profit…

Cas

FabulousFellini · March 4, 2020, 3:28pm

I saw that too. That was a pretty good article.