What is this OpenCL sorcery?

Hah! Okay, so I’ve finally got a decent handle on writing multithreaded game loops. I actually just finished a processor class similar the concurrent package’s ExecutorService. In my case, I can see a performance boost on tasks that take at least 1 millisecond to complete on a single core, but I digress. I consider this relevant because graphics cards tend to have thousands of cores over my ‘pitiful’ oct core cpu. Do they call them oct cores? Meh…

What I’m confused about is how hardware that is created for the sole purpose of accelerating graphics can accelerate cpu instructions. I’ve seen, for instance, that OpenCL can be used to process elements in an array assuming that order is arbitrary. It’s like a for loop with i as your index, but you don’t really know what the value of i is. You just know that i will go over all indices at some point, and you’re free to do your processing utilizing this black box sorcery that we call OpenCL.

Now, I can’t argue with results! I think it would be interesting if Java were to include an OpenCL binding in its distributions for zealous performance freaks like myself. However! I always thought that the GPU was made for a more particular task (rendering), and that’s why it has always been faster at rendering, and only rendering. That’s the only thing that perplexes me.

Edit: I have seen a couple of OpenCL bindings. Even LWJGL has it included, which I find interesting. It’s still not ideal seeing as it requires JNI, but it’s still a perk. I needed to make this relevant to Java somehow, so that’s why I mentioned this xD

GPUs have uniform-architecture(Geforce 6000) for a long time now, which means that there are only a few special purpose calculation units left on these chips. And with DX10 cards these architecture was first accessible to the public.

That GPUs are better equipped than CPUs for graphic calculation, comes from the different type/concept of the architecture. Not like in the old times were you had some specialized hardware for a single purpose.

GPUs don’t have thousands of cores, but a few hundreds which can handle a lot(here comes the thousands into play) of lightweight threads. Another difference is the fast memory access of a GPU. The gigabytes of VRAM are multiple factors faster to access by the GPU than the RAM is for the CPU. Combine this with multiple levels of intelligent caches and you have a beast of machine which can crunch extremely fast through gigabytes of data.

ps: there are OpenCL bindings for Java as there are ones for OpenGL. There is also a quite handy lib called Aparapi from AMD which can convert normal Java bytecode to OpenCL kernels on the fly with a fallback to a normal fork-join pool.

GPUs are good at math.
CPUs are good at branching.

Ah! I have heard of Aparapi. I would still like to see OpenCL code to run on the JVM, though. Then maybe the “Java is slow” myth would end? Hah! Just kidding…

GPUs are becoming faster in a way that consumer CPUs can’t until we get proper threading in our day-to-day programs. CPU performance isn’t improving in the way it did 10 years ago. At that time we were still following Moore’s “law” where the number of transistors per area unit (and usually also performance) doubled every two years, leading to an exponential increase in performance. Then we hit somewhere around 2.5-3.0 GHz and suddenly the performance increase pretty much stopped. Sure, we’re still getting slightly higher increases in clock speeds and better architectures with sophisticated branch prediction etc, but all in all we’re not seeing the same rate of increase nowadays. Instead, we’ve switched to having more cores. The reason lies in heat. Double the clock rate and increase the voltage to keep the CPU stable and you get between 4x and 8x as much heat. That’s where multicore solutions come in.

With this, it’s obvious that instead of having one fast core, it’d be much more efficient to have 4-8 cores running concurrently at half the clock speed. However, CPUs and most programming languages are only able to utilize a single core unless you manually split up work between them. Games and other consumer programs have traditionally been pretty bad at utilizing more than one or two cores with good scaling, so the CPU makers have been forced to try to cram out as much performance as possible from each core, despite the fact that it’s inefficient to do so.

The first “graphics cards” were simply specialized additional single core CPUs you could plug into your motherboard which the (main) CPU could offload 2D graphics tasks to. They weren’t even that fast, usually using less power than the main CPU, but they could give a solid performance boost since the CPU was free to do other things. As graphics got more advanced and started to venture into proper 3D, the manufacturers realized that rasterizing and shading was an extremely easy thing to parallelize. Two pixels can be independently calculated on two different cores. (This was before graphics cards also handled transforming vertices.) Then the rendered resolution started to increase to the point where the number of pixels was so large that using multiple cores became more viable. 15 years ago Nvidia coined the term “GPU” when it realesed the Geforce 256, which featured a grand total of 4 pixel shaders. It also featured hardware support for vertex transformations, but this hardware was slower than a decent CPU. After this, the number of pixel shaders and vertex shaders gradually increased, and for Nvidia the number of pixel shaders and vertex shaders eventually reached 24 and 8 respectively in the GTX 7900 in 2006 (which is a slightly faster version of the PS3’s GPU). At this point, GPUs were becoming so powerful with lots of memory that new rendering techniques started appearing.

The flexibility of programmable shaders lead to a new lighting technique called “deferred shading”. Deferred shading splits up lighting into two passes. In the first pass, the geometry pass, you store the data you’ll need for lighting (diffuse color, normals, shininess…) for each pixel into a huge buffer. In the second pass, the lighting is done by rendering the volume of the light, reading the lighting data for each pixel the volume intersects and computing the lighting. The key here is the unbalanced workload. In the first pass, the number of triangles processed was huge, but the pixel shader was essentially just a copy which depended on bandwidth, not number crunching. In the second pass, the workload flipped. Now we had very few triangles, but we had millions of pixels to light instead. In essence, only half the GPU was working at any given time. The first pass had heavy load on the vertex shaders and the second one had heavy load on pixel shaders. This lead Nvidia and AMD to move on to a unified architecture where GPUs had only one type of shader/core instead of two which could handle both vertex and pixel shading. That allowed the GPU to load-balance between vertex and pixel processing and adapt to the uneven load. Now, I’m not saying that deferred shading was the only reason they made this move, but for games that was probably the biggest reason.

And that’s pretty much where we are today. GPUs still contain a lot of fixed functionality hardware, like rasterizers that “fill” pixels that are covered by triangles, and raster output units which handle blending and the conversion and writing of the resulting pixel color, but the thousands of cores that GPUs have are now so flexible that they can be used for almost anything that can be parallelized. People are running physics engines, ray tracing, etc on GPUs nowadays. A simple for-loop where each object is processed independently can easily be run on multiple cores. It’s what’s called an embarrassingly parallel problem, which is basically a problem that can be split up into a large number of independent tasks that can be processed in parallel. Like pixel shading.

Well, I hope someone found that interesting. I just enjoy writing this stuff, I guess.

Yikes! You wrote more than I did! That’s some dedication. To be honest with you, I don’t know too much about the history of hardware in general. I’m just a college student. I DID know about the slowing down of Moore’s law, though. I’ve heard that modern transistor gate is… 20 atoms across unless I’m wrong.

On an almost unrelated note, this flash submission has a transistor gate in there somewhere… among other mind blowing things…
http://htwins.net/scale2/

I couldn’t resist.

This myth is still there but it became really very wrong even before 2004. Moreover, JogAmp has a nice OpenCL binding (JOCL) supporting both desktop and mobile environments including Android :slight_smile:

OpenCL on Android?!

This post shows how to use OpenCL in Android App.

Yes but it will become more and more interesting as time goes by, by getting more capable mobile GPUs. The main contributor of JogAmp and the maintainer of JOCL did a very nice job ;D

Dat read, worth a read over to anyone here.

Thanks for that Gibbo, somehow I missed that post.

Agent, do you have a blog somewhere? 'Cause I for one would read the heck out of it.

[quote]Well, I hope someone found that interesting. I just enjoy writing this stuff, I guess.
[/quote]
Keep doin’ what you do. It’s good stuff.

Keep doin’ what you do. It’s good stuff.
[/quote]
Sadly I don’t really do blogs. Maybe I should? xd

Oh please do! :smiley:
Your post was really well written and interesting :smiley:

Let me note again. Do you “really” want OpenCL and/or automagic moving of general purpose code to the GPU? For a game runtime? Most likely you don’t. If you’re attempting to push the limit you’re going to need to juice out GPU cycles to render your scenes. Chances are you going to want to do the opposite…take things that are easy to compute on the GPU and perform some of theme on the CPU instead (like in software occlusions).