Fully utilising Multi cores/Multi cpus?

moogie · October 23, 2007, 11:55pm

Some people have tested the real time ray tracer i am making on different computers with different number of cpus and cores per cpu and it is apparent that the more cpus/cores a computer has, the more dramatic the level of cycles are wasted. (on a 8 core machine an average of 50% utilisation )

I am quite inexperienced with multi-cpu/core programming and am wondering whether there is someing fundamental I am doing incorrectly which would account for this waste and whether there are steps i can to reduce it?

At a high level the ray tracer logic flows in the following steps:

obtain the number of cpus/cores
create a render thread for each core
nominate one thread to the be the master thread
split the screen into tiles and give each thread an equal portion of the tiles
All render threads apart from the master thread wait by call await() on the “state” CyclicBarrier
master thread updates world state and calls await() on the “state” CyclicBarrier awaking all threads.
all threads render their tiles and then wait by call await() on the “render” CyclicBarrier.
once all threads have called await() on the “render” CyclicBarrier all threads wake and the non-master threads goto step 5.
the master thread displays the rendered image and then goes to step 6.’

An applet of a recent version of the raytracer can be found here:

http://javaunlimited.net/hosted/moogie/testApplet.html

The above version has some banding on the specular reflection which is fixed in a later version. To see fps an executable JAR version can be found here:

http://javaunlimited.net/hosted/moogie/jrtrt_specular_fixed.jar

broumbroum · October 24, 2007, 3:28am

Multi- and Hyper-Threading cannot be confused because the latter is some TradeMark sold by Intel solutions, while we’d say HT from Intel is Multi-Threading which is quite correct.
But in fact, softwares are somekind of logic-systems that cannot be confused with hardware logic-systems like CPU, MIPS, GPU etc.
Whereas we have the opportunity to impl. multiple Threads simultaneously, that cannot have a direct impact on how the processor will receive the datas to compute, that means one HT-CPU vs. Celeron CPU (saying multiple- vs. single- pipelines) have equal chance to find out the results at equal frequency. Of course, multi-Threaded softwares can be run on either multi-core or single-core (or pipeline(s)) without changing the source code, but if there’s any specific compiler solution, then we can observe significant (“full”) difference between 2 types of processors.
By the way it can be a good idea to have a list of all specific compiler solutions Java or Sun can bring out to their developers. :-[
What you’re explaining is a bit tricky and may be understood if one can figure out how a very CPU works…

moogie · October 24, 2007, 4:54am

When i say multi core/cpu i mean it in a general sense… i.e. a CPU which is able to perfrom multiple “threads” of execution simultaneously.

I have been discussing my particular problem with a work collegue and it may be beneficial to change the paradigm from each thread owning tiles to render to having a stack of tiles to render and each thread poping a tile from the stack when they have finished rendering a tile. This might reduce the idle time experienced in the current implementation. However it remains to see whether the added cost of using a syncronised method to pop from the stack will be less than the cost of idle cycles.

Don_Kiddick · October 24, 2007, 8:40am

The more processors you have the less likely you’ll max them all out. That’s inevitable.
You might want to try having more threads than processors, say twice or three times as much and see what the result is. Generally having more threads will increase processor utilization and benefit performance until the overhead from synchronization and thread creation counterbalances that.
Experiment.

D.

Riven · October 24, 2007, 6:11pm

Having a single point of tile-distribution can be (but in this case certainly won’t be) a bottleneck.

In this case the proceesing of a single tile is most likely so much work, that the sync-time will be roughly 0.00001% of the CPU-time.

In extreme cases (like each ray is a ‘runnable’ and popped off a stack) you might want the thread to pop a bunch of tasks, with a local runnable-queue for each thread.

When some thread runs out of runnables, and another seems to be grinding to a halt on it’s local-queue, the thread can go hijacking runnables from other threads local-queues. You need sync-ing here too, but when you have a threadpool of N, you have N locks (pick a random thread and hijack some task), thus reducing the sync-bottleneck by factor N. Building this correctly is extremely hard for 99.93% of all people.

AFAIK this is the most efficient way to distribute tasks from a single source.

Anyway, sync-ing can’t be your bottleneck, the tasks are simply too big. You’ll probably notice CPU-usage goes from 50% to 99.9%. With the design describes above, you might get 99.99%.

Also… never block on a barrier with all your worker-threads, just start with the next frame. Obviously, for this to work, you need to make a copy of your world for each frame.

_{P.S.

The numbers in this post are completely fictional, yet pretty good guesstimates.}

moogie · October 24, 2007, 10:32pm

That is my though also (sync time is negligible compare to task processing time) so all it remains to do is implement the stack based approach

I do not believe that i will need to explicitly balance the load amongst the worker threads as I would have thought that by the nature of the stack approach the balance is implicitly handled. i.e. once a thread has finished a task it will pop the next task from the stack which means some threads will be able to process more tasks while one thread is processing a much larger task.

As you have suggested, I have thought about “double” buffering the world state so that each thread can render while the world is being updated, however the world will eventually exist of potentially millions of objects and I would imagine that there would a significant cost in both memory requirements and time to copy the world state each frame. I have not actually attempted it so I cannot be certain about the net loss however.

Don_Kiddick · October 25, 2007, 8:27am

NB. the stack based approach is basically done for you in the 1.5 concurrency package.

e.g.

 
ExecutorService  executorService = Executors.newFixedThreadPool(1000)
executorService.invokeAll(myCollectionOfTasks);

D.

moogie · October 25, 2007, 12:06pm

I definitely will look into the ExecutorService… I am a little apprehensive about any object creation over head.

Hmm… the more i look into using Executor Service, the more i think to use it i will need to make significant changes… Currently a “task” needs to know what thread is processing it so it can use the variables defined in the thread’s scope. With an executor service is looks like there is no way to achieve this without massive replication of data and the potential performance cost involved.

Don_Kiddick · October 25, 2007, 2:37pm

Thread.currentThread() gives you a handle to the current. Not sure why you would want that though.

moogie · October 25, 2007, 10:21pm

That might work… depending whether I can pass the Executor a derived class from Thread to use.

To help understand why i need to know what thread is currently processing a task I have set up a hypothetical scenario:

I have three threads:

T1,T2,T3

each thread has a pool of objects. This pool is represented by a linked list.

in the main logic loop of each thread:

A Thread grabs task from task queue
same thread calls a “dowork” method on the task.
Task requests an object from the pool and uses it internally.

Thus, if i do not know what thread is currently processing the task I will have to either have one super large pool with synchronised access which is in a scope common to the tasks, or have a pool of objects for each task. The former introducing performance over head and the latter increasing memory usage profoundly.

Perhaps my current implementation is not particuarly efficient? can someone else offer a better solution? Basically i have shifted all object creation to the initialisation phase and simply use these pre-allocated objects during the main runtime of the program. It has given a massive speed increase!

Riven · October 25, 2007, 11:27pm

I doubt you can gain much performance from here. If all CPUs are running >99% on your tasks, is it worth the effort to squeeze out the last percent?

It’s often better to work on the bottleneck, like… optimizing the tasks themselves.

moogie · October 26, 2007, 12:56am

At the moment on multi core CPUs i am not achieving 99% utilisation I believe this is due to the current implementation with each thread being assigned a static list of tasks and all threads waiting until all tasks are finished. I will attempt a stack based implementation to see whether i can achieve higher utilisation.

I agree, once i achieve 90%+ cpu utilistion I need to heavily optimise the hot spots.

Don_Kiddick · October 26, 2007, 8:30am

[quote]each thread has a pool of objects. This pool is represented by a linked list.

in the main logic loop of each thread:

A Thread grabs task from task queue
same thread calls a “dowork” method on the task.
Task requests an object from the pool and uses it internally.
[/quote]
I’m envisaging a different algorithm.

For each task in the queue, create a class that implements Callable, takes a task and calls dowork in it’s call method.
Create an ExecutorService with the amount of threads you want (e.g Executors.newFixedThreadPool() or cachedThreadPool())
Submit all the Callables to the service via invokeAll

Done!

The executor service handles the rest for you, monitoring the queue of tasks, handing them out to idle threads, and retasking threads when they have finished.

thanks, D.

blahblahblahh · October 26, 2007, 10:59am

I think you want ThreadLocal instead - have a look at the javadocs to see what it does.

moogie · October 26, 2007, 1:21pm

Thanks for the heads up

I will look into both suggestions.

This project will be going on hiatus for two weeks… it my wedding this weekend and then i am going to singapore and hong kong for a honey moon

When i get back i will see which suggestion best suits my needs.

Cheers

CommanderKeith · October 26, 2007, 1:32pm

That’s a pretty good excuse. Congratulations!

AI_Guy · November 2, 2007, 3:22pm

I agree with java.util.concurrent being one of the best ways to go for single applications benefiting from multi-core cpu’s. Would strongly reccomend the book “Java Concurrency in Practice”, by Brian Goetz. Like the title suggests, it is not just a bunch of hard to understand theory. This will make things alot easier than just reading the Java docs, since these type of applications can fail in very difficult ways to understand and reproduce, unless a regiment is followed.

zendar · November 5, 2007, 11:12pm

The garbage collector needs to stop all the threads across cpus/cores to do it’s work. Even if you use a parallell collection GC, it needs to stop everything for some tasks. Generating and discarding a lot of objects is never a good idea, but you probably have to be even more careful as the number of parallell threads running on seperate cores increases.

It would seem that because of the jvm having a stop-the-world point in garbage collection, it is pretty important to ensure that the gc doesn’t run too often on multiple cores.

Just a thought.

cylab · November 6, 2007, 8:45am

First time I hear this. Any pointers for reference?

bienator · November 6, 2007, 10:54am

that does just happen on special circumstances (eg. if the VM runs OutOfMemory). You can track this easily if you debug the GC. Everytime you read “full collection” it was a stop the world collection… You should always tweak your realtime app for minor collections they run concurrently.