4 or more threads in render loop

What do you mean by availability for FlushMappedBufferRange?

Cas will understand.

He means I’ve not checked the GL caps to ensure that function is available (it throws an NPE in LWJGL when you forget to check and it’s not present).
Right now the engine’s mandated to be OpenGL3.0+ so I’m not sure it’ll help my specific case but it’s a good idea in library code (such as this is).

Cas :slight_smile:

So when did you stop supporting Java 1.4? ::slight_smile: OK… Been a while since I’ve read anything about your backward compatibility decisions. I think if you said I’m using ForkJoinPool and invoke that would have gotten around a lot of the chat above. Yeah. Hard to say how much faster one can get with that particular direction w/ the invoke and wait strategy. The only thing I can think of offhand is not storing things in a Sprite class, so a bare array of data to render to be more cache friendly (of course there are limitations in this direction / less readable code, etc.). I was under the impression that you were filling buffers concurrently to a separate rendering thread. Figuring out a strategy like that will make things faster as you fill separately and render immediately with latest filled data. Even a separate single thread filling buffers could be rather quick. I’m not saying implementing this is easy; it sounds like you are plenty happy with performance presently, so you know… keep making that game. Can’t wait to see it.

I’m quite glad Doug Lea and all put out most of the java.util.concurrent code as public domain:
http://g.oswego.edu/dl/concurrency-interest/

Been stuck in Java 6 land w/ Android, but thinking about integrating the JSR-166 code eventually into TyphonRT.

There are various smaller optimisations in the sprite engine to try and help a bit. For example I can mark a sprite as “frozen” which after calculating all its rendering data, I stash it in a ByteBuffer and simply copy it in verbatim to the rendering VBO next frame. And I can change the sorting algorithm on individual sprite layers (eg. “no sort”, handy for terrain), and also specify that an entire layer is “frozen”. Stuff like that. The real win though was multithreading the animation and multithreading the sorting/gathering/writing.

In case anyone’s not fallen asleep yet the gist of that code does this:

  1. Use a thread-per-layer to sort sprites (if necessary) and gather up a count of all the visible sprites in that layer. Wait.
  2. Now I know how much sprite data there is back in the main thread, I ensure that my VBOs are big enough and map them.
  3. Then I chop up the sprites into a number of threads and calculate and write the sprite data out to the VBOs concurrently (here lies the biggest performance gain). So if there’s 100k sprites and 4 cores I’m writing 25k sprites out in each thread. The calculations of the sprite vertex data can be quite complex (requiring sometimes multiple matrix transforms as sprites can be defined hierarchically). At the same time I build up the rendering command lists, one each. Wait.
  4. Back in the main thread I then iterate through the rendering command lists in order.

So I’m basically hopping in and out of a thread pool and waiting on all the jobs finishing each time, which is my “synchronisation”. There’s probably a bit of time spent idle but it’s nice and easy.

Cas :slight_smile:

sounds great! just wondering, “writing concurrently to VBOs” …

do you map the buffer in the main thread and pass the ByteBuffer to the other threads … or do you create a gl-context per thread and map there ?

The former. Multiple contexts with shared VBOs is just asking for bugs.

Cas :slight_smile:

thought so :wink:

Huh, I was wondering about separating out the sprite transformation and filling of the VBOs in separate threads using a thread pool, this is fast enough to be done every frame?

Well, what do you think?

Cas :slight_smile:

Well it seemed to me the synchronization would be a big overhead but if that’s what your engine does every frame then I need to re-evaluate my assumptions and play around with threads in my GL code more often!

aye. i think the [icode]synchronized[/icode] keyword is slow too …

tho’ using http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReentrantLock.html with conditions or even a http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CyclicBarrier.html seems to be very speedy.

My proposal :smiley:


|111111111111111|22222222222222222222|333333333333333|444444444444444|     (Divied one iteration of the loop to 4 sections)
	    
In real time: 
Thread 1: |111111111111111|W2|P2|111111111111111|W2----------|P2|111111111111111|W2----------|P2|111111111111111|W2----------|P2|111111111111111|......
Thread 2: |S--------------------|22222222222222222222|W3|P3|S---|22222222222222222222|W3|P3|S---|22222222222222222222|w3|P3|S---|22222222222222222222|......
Thread 3: |S-----------------------------------------------|333333333333333|W4|P4|S--------|333333333333333|W4|P4|S--------|333333333333333|W4|P4|......
Thread 4: |S---------------------------------------------------------------------|444444444444444|S--------------|444444444444444|S--------------|444444444444444|......

Assume each thread has a copy of the shared resources such as framebuffer, zBuffer, etc...

'W' : waiting for a thread to become Idle  
'P'  : swap shared resource with another thread, and awake (notify) the thread
'S' : become idle