What I did today

theagentd · February 9, 2016, 10:38pm

A few months ago I saw these slides: http://www.slideshare.net/DevCentralAMD/holy-smoke-faster-particle-rendering-using-direct-compute-by-gareth-thomas

Apparently they found that foregoing rasterization of particles and instead going with a tiled compute shader was actually faster than hardware blending. In essence they divided the screen into tiles, binned all particles to said tiles and then had a compute shader “rasterize” those particles, blending into a vec4 completely inside the shader (no read-modify-write to VRAM). They also implemented sorting of particles in the compute shader.

I took a slightly different approach. I’ve been experimenting with quite a few order-independent transparency algorithms in the past few months/year, and I’ve got stochastic transparency, adaptive OIT, fourier-mapped OIT (hopefully I’ll get around to posting my bachelor thesis on this soon) and a reference CPU-side sorting simple renderer. So as a first test, I tried merging all 3 passes stochastic transparency into a single compute shader. Instead of writing to 8xRGBA16F render targets in the first pass, and then reading all those textures in the second pass and finally doing the weighted average resolving in a final fullscreen pass, I simply use 8 vec4s in the compute shader, immediately do the second pass again writing to local variables and finally doing the resolve and outputting the final RGBA of all particles blended together correctly, all in one (not so) massive shader. I currently lack the tile binning, and a lot of calculations are currently done on the CPU that need a lot of optimizations, but the GPU performance looks very optimistic. In some cases with a large number of particles covering the entire screen, the compute shader achieves almost twice the framerate of my old algorithm, and even more impressively at the same time reduces memory controller load from 80% to a meager 2%. The next step would be to port adaptive OIT to a compute shader. This would be even more interesting at it would eliminate the need for a linked list, as I can just compute the visibility curve as I process the particles. This would in theory allow AOIT to work on OpenGL 3.3 hardware if I just emulate a compute shader with a fullscreen fragment shader pass.

The biggest problem with this approach is that I would need to have every single piece of transparent geometry available in the compute shader, and I wouldn’t be able to have different shaders for different particles. However, it would be possible to only use the tiled approach to construct the visibility curve for AOIT in using a tiled compute shader, output the curve to textures and finally proceed with the second pass as usual. That would allow me to have fairly complex shaders for the particles (as long as they don’t modify alpha in a complex way) and still have the flexibility of my old system.

I don’t really have any good pictures I’m proud of to show off of all this, but hopefully I’ll get some nice screenshots in the end. >___<

chrislo27 · February 10, 2016, 3:31am

After working in 2D for the past five years, I finally decided to dabble into 3D. I followed xoppa’s libgdx cube thingy and rendered a cube! Ultimately I actually chose the orthographic camera to render it and I like it more than the perspective camera, probably because I’m so used to 2D flat projections.

Archive · February 10, 2016, 3:32am

3D is so fun haha keep workin on it

ra4king · February 10, 2016, 7:27am

theagentd:

A few months ago I saw these slides: http://www.slideshare.net/DevCentralAMD/holy-smoke-faster-particle-rendering-using-direct-compute-by-gareth-thomas

Apparently they found that foregoing rasterization of particles and instead going with a tiled compute shader was actually faster than hardware blending. In essence they divided the screen into tiles, binned all particles to said tiles and then had a compute shader “rasterize” those particles, blending into a vec4 completely inside the shader (no read-modify-write to VRAM). They also implemented sorting of particles in the compute shader.

I took a slightly different approach. I’ve been experimenting with quite a few order-independent transparency algorithms in the past few months/year, and I’ve got stochastic transparency, adaptive OIT, fourier-mapped OIT (hopefully I’ll get around to posting my bachelor thesis on this soon) and a reference CPU-side sorting simple renderer. So as a first test, I tried merging all 3 passes stochastic transparency into a single compute shader. Instead of writing to 8xRGBA16F render targets in the first pass, and then reading all those textures in the second pass and finally doing the weighted average resolving in a final fullscreen pass, I simply use 8 vec4s in the compute shader, immediately do the second pass again writing to local variables and finally doing the resolve and outputting the final RGBA of all particles blended together correctly, all in one (not so) massive shader. I currently lack the tile binning, and a lot of calculations are currently done on the CPU that need a lot of optimizations, but the GPU performance looks very optimistic. In some cases with a large number of particles covering the entire screen, the compute shader achieves almost twice the framerate of my old algorithm, and even more impressively at the same time reduces memory controller load from 80% to a meager 2%. The next step would be to port adaptive OIT to a compute shader. This would be even more interesting at it would eliminate the need for a linked list, as I can just compute the visibility curve as I process the particles. This would in theory allow AOIT to work on OpenGL 3.3 hardware if I just emulate a compute shader with a fullscreen fragment shader pass.

The biggest problem with this approach is that I would need to have every single piece of transparent geometry available in the compute shader, and I wouldn’t be able to have different shaders for different particles. However, it would be possible to only use the tiled approach to construct the visibility curve for AOIT in using a tiled compute shader, output the curve to textures and finally proceed with the second pass as usual. That would allow me to have fairly complex shaders for the particles (as long as they don’t modify alpha in a complex way) and still have the flexibility of my old system.

I don’t really have any good pictures I’m proud of to show off of all this, but hopefully I’ll get some nice screenshots in the end. >___<

Several questions about your rendering:

What did you need 8 render targets for in the first pass?
How does your new render system work with the compute shader? Do you still output to the render targets and then each instance of the compute shader processes the 8 vec4s across the 8 RT?
How is memory controller load reduced from 80% to 2%? Isn’t the same data being processed by the compute shaders instead of fragment shaders now?

I’ve been getting very interested in how OIT works!

Roquen · February 10, 2016, 8:35am

On OIT: There’s a new paper by Morgan McGuire & Michael Mara: http://graphics.cs.williams.edu/papers/TransparencyI3D16/

The reference version isn’t up yet though…probably after the conference.

theagentd · February 10, 2016, 10:14am

I’ll go with the stochastic OIT I implemented to explain it. Stochastic OIT basically means that you have a number of samples per pixel, and each sample has a chance to pass based on the alpha of what covers it. On average, this produces the correct result. The old algorithm looked like this:

In the first pass I write to 8 RGBA16 render targets with GL_MIN blending. I’m basically emulating 32 separate depth buffers this way (using MSAA and a coverage mask requires OGL4+ and limits samples to 8 ), and the shader outputs either for a sample that randomly passes, and <1.0> when it doesn’t. This part is purely bandwidth bound. In the second pass, I basically do weighted blended order-independent transparency (WBOIT) using the 8 textures from before as the weighting function. This means writing to 1xRGBA16F and 1xR16F render target and reading the right pixel from all 8 render targets, then doing a step comparison between the depth of the current particle and all stochastic samples. This is also heavily bandwidth bound as the texture data doesn’t fit in the texture cache. Finally, there’s a fullscreen pass to resolve the weighted sum computes, writing to the final output RGBA16F render target.

For a compute shader, none of the textures are needed. I can replace the 8xRBGA16 with a vec4[8] and write to those instead. Since they’re just registers in the compute shader, they never go out to main memory. Similarly, reading those samples is again just a register read, so texture hardware or main memory is never involved. I can even do the final weighted sum resolve in the compute shader as well, and finally just do an image-store write to the output RGBA16F render target. Only for this final write is main memory actually used.

Yeah, I haven’t entirely figured out what they’re doing that’s actually new in that one. Still working on it.

Roquen · February 10, 2016, 11:55am

I’ve only skimmed the paper. I’m going to wait for the reference version…the conference is in a couple of weeks.

matt_p · February 10, 2016, 1:41pm

Everything I’ve done today and will do (and the last days too): refactoring!
Currently I’m quite happy that it made a huge difference this time: less and better structured code and also easier to understand (I think that resembles like all the reasons to refactor in the first place)
But I hope I will come back to actual implementation soon(ish)

theagentd · February 10, 2016, 3:09pm

If that works as they say it does, using less than 16 bytes per pixel with only a single pass and without any need for OGL4+ features, it’s completely revolutionary. The result looks correct for the glass, but the lack of tests with smoke particles and similar stuff is a bit suspicious.

Apo · February 10, 2016, 6:39pm

Started a new game. It is a mix of my two favorite games: chess and soccer. It works really good and now I try to create a nice ai.

LostWarrior · February 10, 2016, 9:30pm

This looks quite interesting Apo. Could you elaborate a bit more on how you play the game?

Apo · February 10, 2016, 10:50pm

There are some simple rules. When it’s your turn, you can move one figure like the chess rules say and pass/shoot the ball three times (if possible). To pass the ball you have to stand in the near of the ball. As you can see in the Screenshot, a white knight and the queen are near. So now the ball can move like the knight and the Queen. When only a bishop is near, the ball can move like a bishop. The goal is to score three times. To score make a goal or set your opponent checkmate.

jonjava · February 11, 2016, 12:57am

You’re a fucking beast, man.

Mac70 · February 11, 2016, 1:31am

Had some fun with Java 8/JavaFX properties:

https://dl.dropboxusercontent.com/u/67758055/Wurm/Screens/properties.gif

The whole code which does that looks exactly like that:

inputsStage.titleProperty().bind(Bindings.concat("Inputs of ", block.getData().titleProperty()));

Coldstream24 · February 11, 2016, 6:20am

my engine now has scripting support. currently actors can be told to follow paths, speak sentences, enter vehicles, and look around.

rt6C9ti7n2w

here’s an example of defining paths, and being able to show them on the map you’re editing

Mac70 · February 12, 2016, 2:53am

Another day, another GIF… Map generator is taking shape, it is time to finally start adding actual generation on top of this framework which is basically UE4 Blueprint system implemented in Java, with some additional features to make it more friendly for typical user.

https://dl.dropboxusercontent.com/u/67758055/Wurm/MapPlanner/parameters.gif

elect · February 12, 2016, 1:32pm

Very interesting theagentd…

I have already played around with some oit, such as (dual)depth peeling, weighted sum and average…

This is on my todo list at the moment

what can you tell about a comparison between depth peeling vs stochastic, adaptive and fourier-mapped oit in terms of quality, memory and speed?

Archive · February 13, 2016, 12:41am

Added a church to the god in Shard Master, Lazzar, complete with pews, stained glass, and a stone altar.

also started working on necromancer swamp

ShadedVertex · February 13, 2016, 2:51am

I see lots of potential in the engine, especially since there aren’t that many Java game engines with visual workflows.

SHC · February 13, 2016, 4:25am

Got GWT-AL to play audio in the browser through OpenAL API. Here’s a screenshot:

And of course some code here.


// Create the context and make it current
ALContext context = ALContext.create();
AL.setCurrentContext(context);

// Create the source
int alSource = AL10.alGenSources();

// You need the AudioDecoder to decode audio using the browser's WebAudioAPI
AudioDecoder.decodeAudio
(
    // The 'ArrayBuffer' containing the data (just binary, along with the container and header, not samples)
    data,

    // Upon success, we are returned a OpenAL buffer ID, and the decoded data is uploaded
    alBufferID ->
    {
        // Attach the buffer to the source and play it
        AL10.alSourcei(alSource, AL10.AL_BUFFER, alBufferID);
        AL10.alSourcePlay(alSource);
    },

    // Upon error, you are returned a reason, which is a String explaining what failed.
    reason -> GWT.log("decodeAudio error: " + reason)
);

There is also the original alBufferData function, but it is only usable if you have PCM samples and not a binary file. The example will be hosted online very soon.