Java isn’t the one that’s “boss” in all this, it’s GLSL, or on the case of that demo, HLSL (basically the same). Java is just the glue layer here, which should perform as well as nearly anything else. It’s more like DX vs GL rather than anything vs Java.
That demo does use a compute shader, which only has a direct equivalent in OpenGL 4.3, but is otherwise morally equivalent to OpenCL (or perhaps a subset of it). The author does link to the source (I’ll link it here too) so it would be interesting to see how much of it is directly portable.
My desktop doesn’t have a OGL4 graphics card, so I can’t test it on my computer. There’s a big chance that compute shaders are faster, but they’re still not as flexible as transform feedback. Sure, you might be able to cram out a few more particles, but it’s extremely ineffective when you only have a few. He’s probably getting better performance since his particle contain less information, most likely just 24 bytes vs my 36 bytes per particle. Besides, I just need to use a geometry shader to expand the points into quads, which is exactly what I did for that sprite engine. =S
I wouldn’t say that OpenCL = compute shaders, compute shaders are much easier to use correctly (handling memory). I should really make one using OGL 4.3…
I’m trying to learn OpenCL and I’m currently reading Heterogeneous Computing with OpenCL and I was amazed by the lack of tutorials on the internet.
Is OpenCL really that unpopular?
It is viable but hard to implement into a game from what I understand.
The big problem with things such as particle system/physics is that they can be very computational intensive. With particle systems, just having 100k particles means that if everything is done on cpu the cpu has to calculate the position 100k times, calculate anything else the particle has 100k times, and still needs to send the updated particles to gpu.
With openCL you can have the gpu do all the calculations but you still need to send things to the gpu which is where my particle system dies at. theagentd suggestion is very nice as you get a prebuilt VBO that you could simply throw at the gpu meaning the CPU does next to nothing. Also, it is usable in opengl 3.0 which is very nice. I am now wondering if I would make a sprite batcher using this since geometry shaders are already 3.0
CPU version: 12 bytes {2D float position, RGBA byte color), multithreaded, RAM bandwidth limited (I have some cheap 1600MHz DDR3 RAM), 2.0 million particles.
OpenGL version: 23 bytes (padded to 24) {2D float position, 2D float velocity, RGB byte color, short maxLife, short lifeLeft} stored in 3 textures, OGL 3 only, 3.8 million particles (one GPU)
OpenCL version: 23 bytes (padded to 24) {2D float position, 2D float velocity, RGB byte color, short maxLife, short lifeLeft} stored in 3 VBOs, updated with OpenCL, 1.1 million particles (one GPU), see below.
Transform feedback version: 36 bytes {2D float position, 2D float velocity, RGB FLOAT color, float maxLife, floalt lifeLeft} stored interleaved in a VBO, updated with transform feedback, 3.0 million particles (one GPU), 6 million particles (two GPUs).
It’s only possible to output 4 byte floats and ints with transform feedback, so instead of compressing stuff I just converted everything to floats, hence the inflated particle byte size.
WTF IS UP WITH OPENCL AGAIN?! I am getting really tired of how sensitive OpenCL seems to be. On my laptop’s GTX 460M, the exact same code performs the same as the GPU version (2.2 million particles). I think it’s because the 400 series had extensive hardware changes compared to the 200 series, but I’m really not thrilled to start delving into that stuff again…
100k times isn’t really that much since you have 4 processors doing 3 billion clock cycles per second. The problem is actually the insane amount of memory bandwidth needed. Just getting two RAM sticks and running them in dual channel gave me a 60% speed boost on a dual-core laptop compared to single channel. Most particles only need some basic math.
I wouldn’t recommend a sprite batcher on the GPU. You’d need to run your whole game on the GPU to know HOW to move your sprites around.
Ah, I couldn’t come up with a better name for maxLife. It’s just how much life (= how many frame) the particle is supposed to last, while life is the amount of life left (= how many more frames it should last). I use it to calculate the alpha, alpha = life / maxLife; I do this on the CPU for the CPU version, hence I had a 4 byte RGBA color. When I had life available on the GPU I only needed 3 bytes but I padded it to 4 anyway to gain some performance.
And yeah, a game almost fully run on the GPU would be awesome.
Probably it doesn’t count, but what about all those game of life simulators
They’re not really a game, but there are much implmentations which fully run on the gpu.
sometime ago I saw a game posted on glslsandbox which was quite simple, but run only in a a single shader. And the game state was saved in the same texture which was displayed.
I’ve been thinking about porting the Flash game Creeper World to run almost completely on the GPU.
IN OTHER NEWS
I got rid of the 12 additional bytes in the transform feedback version, performance is now 3.55 million particles with transform feedback, up from 3.00 million. Haven’t ported the SLI version yet, but it’d most likely hit 7 million particles at 60 FPS.
GPUs perform very poorly at heavily-branching code. If you could get any kind of general-purpose VM running on a GPU at all, I imagine it would perform very poorly on any code that wasn’t already suited for GPU execution, i.e. heavily vectorized algorithms.
Instead of creating a new thread, I decided to just necro this one.
To celebrate my exams being over and the start of winter break (well, okay, I do have a basic Java exam left. :P) I decided to create new particle engine thingy. Been working all day, but I finally got it done! It’s a collection of old things I’ve posted here plus a few new features!
It’s now completely in 3D with particles bouncing around inside a huge box.
Like before, updating is done using OpenGL transform feedback. Nothing new here.
The particles are rendered as billboarded sprites using a geometry shader. The 4 vertices are generated in eye space and then sent through the projection matrix.
Nothing really huge here, just pretty much a mix of old stuff. The new feature is particle SORTING! I implemented a (very inefficient) radix sort using transform feedback to sort particles based on their depth. My algorithm needs 2 passes per bit of depth precision, meaning that for 24 bit integer depth I need 48 passes over the particles! Shit!
To combat this I decided to also do frustum culling when calculating the depth of each particle. That means that only particles that are actually on the screen will be sorted. All of them are still updated of course. This of course gave me a big performance boost when only a small number of particles are visible, but that’s kind of cheating… =S
Anyway, I’m getting 200k sorted particles at 63 FPS (one GPU) at the moment. The particle culling is extremely efficient, improving FPS to 600 when no particles are visible (it’s still updating them too). Using OGL4 I could reduce the number of passes needed to sort by a factor of 4 or even 8 at the cost of a small amount of video memory, but for now I’m stuck on OGL3. If anyone knows a more efficient sorting algorithm available in OpenCL or something like that, I’d love to hear about it!
It’s still too unoptimized. I need to find an algorithm that doesn’t require so many passes… Right now it seems to be a lot slower than Arrays.sort() on the CPU in raw sorting performance. If I disable culling, rendering and updating, I can sort 450 000 particles with 18-bits accuracy at 60 FPS (the algorithm scales linearly with the number of objects), so around 27 000 000 sorted particles per second. That’s compared to real OpenCL sorting libraries which claims performance closer to a billion 32-bit keys sorted per second. I just have no idea how to implement this with OpenCL…
The algorithm also does not sort the particles, it sorts their indices by their distance (8 byte keys). Since I have so many I have to use 32-bit indices. It turns out that randomly indexing into the particle array is a lot slower than just drawing them all sequentially. Sorting indices makes it possible to copy around less data when sorting, but it might be a good idea to actually reorder the particle buffer too. Since the order changes very slowly, that would make the indices essentially sequential since the particle order changes very slowly so not many particles would be moved so far away that they cause a cache miss.
It doesn’t looks that impressive on still images either. The coolest part is when I move the camera through that smoke cloud, and I can literally only see a few meters ahead. Without sorting, the cube of smoke looks hollow since you get the illusion that you can see inside it due to the incorrect blending of the particles. It’ll also handle correct blending of things like fire and smoke.
[x] if they are aligned to the camera orientation, then rotating the camera changes the sort-order drastically, as the render order is determined by the (infinite line)-point distance, not real distance to the camera.
[x] if they are facing the camera (perpendicular to it), you indeed have relatively stable order, but then sprites will intersect eachother