What I did today

theagentd · August 23, 2015, 12:46pm

I implemented forking in my threading system, which I plan to use to optimize culling and other things.

My threading system is based a task hierarchy where some tasks are dependent on other tasks and can only be run after the task they depend on have finished running. If there is no dependency between two tasks, then they are allowed to run in parallel on multiple threads. In addition, each task can be split up into a number of subtasks that can execute in parallel. This is what allows the physics and graphics of WSW to be fully threaded and utilize any number of CPU cores.

Although this system is perfect when you know what work you want to do (update X physics bodies, calculate Y skeletons, etc), there are times when it is a bad fit for the problem. In the case of frustum culling objects, I use either quadtrees or octrees, and traversing an octree in one thread is pretty slow. At the same time, we have no idea how many objects we will actually find (that’s why we’re culling!), so how many threads we should be using depends on the data structure in question.

A good solution is to use “forking”, which means that when a thread realizes it has a lot of work to do, it splits it up and hands over some of it to other threads. However, my threading system did not support that. The easiest solution would’ve been to create a a task with a certain number of subtasks and have them wait for the main task to generate work for them. This is inefficient though, as that locks up the threads (they’re waiting for work) when they could be working on other tasks.

Instead I added support for proper forking, where a task can dynamically generate additional subtasks of itself while it’s running. That means that once a subtask realizes that it has a lot of work left, it can fork itself to generate new subtasks that it can hand over part of the work to, which the worker threads will automatically pick up, execute and finish, freeing them up to work on other things again.

I wrote a small test program that tries it out. It basically has a task that forks itself 8 times, each fork sums up 5 000 000 Math.sin() calculations and then prints out the result. Here’s the output of running it with a single-threaded executor:

[quote]Executing task tree.
Main task reporting. Forking 8 times.
Forking complete.
Fork 0 reporting.
1.6005898759895323
Fork 1 reporting.
1.6005898759895323
Fork 2 reporting.
1.6005898759895323
Fork 3 reporting.
1.6005898759895323
Fork 4 reporting.
1.6005898759895323
Fork 5 reporting.
1.6005898759895323
Fork 6 reporting.
1.6005898759895323
Fork 7 reporting.
1.6005898759895323
Task tree execution finished in 8.014 seconds.
[/quote]
Note how the main task forks itself 8 times, then after the main task is finished it picks up the forked subtasks and processes them one by one. The power here lies in the ability to dynamically generate any number of forks.

Here’s the same program running with a multithreaded executor with 8 threads (I have a Hyperthreaded quad core CPU, so 8 logical cores):

[quote]Executing task tree.
Main task reporting. Forking 8 times.
Fork 0 reporting.
Fork 2 reporting.
Fork 1 reporting.
Fork 6 reporting.
Forking complete.
Fork 5 reporting.
Fork 4 reporting.
Fork 3 reporting.
Fork 7 reporting.
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
Task tree execution finished in 1.565 seconds.
[/quote]
There are some interesting points here. Note that the forks start running as soon as they’re created, and 4 forks even started executing before all forks have been issued. The forks then all execute in parallel on multiple threads, in contrast to the serial execution of the single threaded executor. Since they’re calculated in parallel using multiple cores, the task tree finishes execution in 1/5th as long time.

I plan to use this for frustum culling. Currently the frustum culling is threaded per “view”, which means that each camera and shadow map is calculated in parallel, but this is a very uneven workload. The camera is usually much much more expensive to cull for leading to one subtask taking much longer time to finish than the others. With this forking system, big views can be split up into multiple smaller jobs while small views that are cheap to cull can avoid the overhead of forking and synchronization.

Probably too much work. We’ll just tweak it to a good value. Really interesting idea though.

CommanderKeith · August 23, 2015, 1:06pm

That’s an interesting idea. But how did you measure performance? Using System.nanoTime()?
If so, how would you know that the CPU your thread was being processed by didn’t switch to another thread mid-way between a timing?
Did you cull outliers and average timings, hoping the effect would be averaged out?

Riven · August 23, 2015, 2:06pm

My impression is that here are no optimal values, as we have different CPU brands, controllers, generations, frequencies - same applies for RAM. Then you have all permutations of those. Each system will have its own sweetspot. Redistribution of entities over cells is a relatively cheap operation, as it is O(n), and the operation is cache friendly, as you use a more or less ordered dataset to fill another ordered dataset.

Lets say you adjust the cellsize every 20 frames (3Hz). Then you have a dataset of avg performance over the last 20 frames, and the avg performance over the 20 frames before that. If the performance improved, make the same change you did last time (grow or shrink cell size), if performance degraded, do the opposite. Outliers will have been averaged out (with 20 frames, their effect is 5%). In the grand scheme of things it is not even bad to guess wrong quite frequently anyway. It will converge back to a more optimal cell size within the next second or so.

ags1 · August 23, 2015, 4:26pm

@theagentd, how many cells and how many entities are being handled?

In Vangard, got logic millis down from 25ms to 16ms (unrelated to the above discussion).

Icecore · August 23, 2015, 9:14pm

Someone try CCD Rigbody physical base collision?
http://www.stencyl.com/help/view/continuous-collision-detection/
https://developer.nvidia.com/sites/default/files/akamai/physx/Manual/Advanced.html
its sound stupid, but it looks like: you can use it to process collision on 100+ frames forward.
(in close system without player external input)

up:if take close dimension system (game closed room or lvl)
and calculate objects moves by frame ticks (not physical PC time)
then in theory (i think ^^) you can calculate collision forwards

if use some sort cached “time_frame positions, time_frame velocity”
it will be fast to calculate any external collisions changes that created by player input.
but maybe its all my madness =)

Opiop · August 24, 2015, 5:15pm

Wow, what a day I’m having so far!

I finally went to lunch with my boss and after we talked a little he said that he and the other bosses have noticed the work I’m putting in and I’ve made a lot of fans here. My boss said he’s going to fight and push everyone to try to get me a job here, and I should know relatively soon because my last day here is this Friday. I’m optimistic and although I would’ve preferred an answer today, I’m very grateful and happy about what my boss said. Here’s to hoping in a couple of days I will be a fully employed software developer!

wessles · August 24, 2015, 6:04pm

Found this today.

Then found out I had to go somewhere, so I can’t release today :P. Still though, It’s cool to see my goal for the past year on-screen and ready to be clicked!

-wes

kingroka123 · August 24, 2015, 8:57pm

Got Tupper’s Self Referential Formula Working ;D

(top right is graphed from the “K” value generated from the grid tiles)

DarkCart · August 25, 2015, 1:23am

Wahey, looks like my game “Miners” is going to be on Indie Insights!

wessles · August 25, 2015, 5:41am

My dreams are becoming realized.

RFLEX: Coming soon.

-wes

EDIT:
Don’t let your dreams be dreams! flex

EDIT 2:
Mwahahaha…

BurntPizza · August 25, 2015, 5:43am

Wes the man, not letting his dreams be dreams.
There’s your 200^th medal.

Opiop · August 25, 2015, 1:02pm

Can I just say… I’ve seen you grow up so much and I’m your biggest fan!

Congrats, hope it all goes well!

Icecore · August 25, 2015, 2:57pm

Congrats!

basil · August 25, 2015, 4:41pm

i’ve uploaded another flick to youtubes. it’s pretty old, nothing special.

6nyPlZSNosk

Opiop · August 25, 2015, 5:59pm

Let’s all take a moment… and realize my title is now Junior Developer for Cardinal Commerce. Full time developer!

DavidBVal · August 25, 2015, 6:03pm

Way to go, man. Keep opening a path, we all coming behind!

J01 · August 25, 2015, 6:57pm

That moment when there are too many new posts who deserve a medal and you don’t know to whom you should give it :persecutioncomplex: I guess I’m just going to go ahead and randomize that a little ;D

wessles · August 25, 2015, 9:21pm

Sweet music for this.

basil · August 25, 2015, 10:01pm

cheers! that goes to https://en.wikipedia.org/wiki/Towa_Tei … who’s hopefully fine with that upload.

uploaded another flick. testing gl capture without frame drops. looks pretty smooth to me.

snVWrKpaNSA

ags1 · August 25, 2015, 10:40pm

Finally completed a full economic cycle. Entities search for resources to harvest, move to the resources, harvest the resources, select a manufacturing process (checking available tools, skills and materials), produce sellable product, select a marketplace, transport goods to market, and collect payment, then return to harvesting spot to continue producing their product. And it is all done in generic, pluggable logic.