I implemented forking in my threading system, which I plan to use to optimize culling and other things.
My threading system is based a task hierarchy where some tasks are dependent on other tasks and can only be run after the task they depend on have finished running. If there is no dependency between two tasks, then they are allowed to run in parallel on multiple threads. In addition, each task can be split up into a number of subtasks that can execute in parallel. This is what allows the physics and graphics of WSW to be fully threaded and utilize any number of CPU cores.
Although this system is perfect when you know what work you want to do (update X physics bodies, calculate Y skeletons, etc), there are times when it is a bad fit for the problem. In the case of frustum culling objects, I use either quadtrees or octrees, and traversing an octree in one thread is pretty slow. At the same time, we have no idea how many objects we will actually find (thatās why weāre culling!), so how many threads we should be using depends on the data structure in question.
A good solution is to use āforkingā, which means that when a thread realizes it has a lot of work to do, it splits it up and hands over some of it to other threads. However, my threading system did not support that. The easiest solution wouldāve been to create a a task with a certain number of subtasks and have them wait for the main task to generate work for them. This is inefficient though, as that locks up the threads (theyāre waiting for work) when they could be working on other tasks.
Instead I added support for proper forking, where a task can dynamically generate additional subtasks of itself while itās running. That means that once a subtask realizes that it has a lot of work left, it can fork itself to generate new subtasks that it can hand over part of the work to, which the worker threads will automatically pick up, execute and finish, freeing them up to work on other things again.
I wrote a small test program that tries it out. It basically has a task that forks itself 8 times, each fork sums up 5 000 000 Math.sin() calculations and then prints out the result. Hereās the output of running it with a single-threaded executor:
[quote]Executing task tree.
Main task reporting. Forking 8 times.
Forking complete.
Fork 0 reporting.
1.6005898759895323
Fork 1 reporting.
1.6005898759895323
Fork 2 reporting.
1.6005898759895323
Fork 3 reporting.
1.6005898759895323
Fork 4 reporting.
1.6005898759895323
Fork 5 reporting.
1.6005898759895323
Fork 6 reporting.
1.6005898759895323
Fork 7 reporting.
1.6005898759895323
Task tree execution finished in 8.014 seconds.
[/quote]
Note how the main task forks itself 8 times, then after the main task is finished it picks up the forked subtasks and processes them one by one. The power here lies in the ability to dynamically generate any number of forks.
Hereās the same program running with a multithreaded executor with 8 threads (I have a Hyperthreaded quad core CPU, so 8 logical cores):
[quote]Executing task tree.
Main task reporting. Forking 8 times.
Fork 0 reporting.
Fork 2 reporting.
Fork 1 reporting.
Fork 6 reporting.
Forking complete.
Fork 5 reporting.
Fork 4 reporting.
Fork 3 reporting.
Fork 7 reporting.
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
1.6005898759895323
Task tree execution finished in 1.565 seconds.
[/quote]
There are some interesting points here. Note that the forks start running as soon as theyāre created, and 4 forks even started executing before all forks have been issued. The forks then all execute in parallel on multiple threads, in contrast to the serial execution of the single threaded executor. Since theyāre calculated in parallel using multiple cores, the task tree finishes execution in 1/5th as long time.
I plan to use this for frustum culling. Currently the frustum culling is threaded per āviewā, which means that each camera and shadow map is calculated in parallel, but this is a very uneven workload. The camera is usually much much more expensive to cull for leading to one subtask taking much longer time to finish than the others. With this forking system, big views can be split up into multiple smaller jobs while small views that are cheap to cull can avoid the overhead of forking and synchronization.
Probably too much work. Weāll just tweak it to a good value. Really interesting idea though.