OpenCL being slower than OpenGL

After cannibalizing on SpriteShootoutCL and the sum example I finally have something that is starting to resemble my particle engine again. I’ve got my CL commands sorted out, and rendering is working fine. The code is obviously a lot cleaner and shorter than my earlier GPU particles test, which used textures to store particle data and then draw them. The next logical step was obviously to use OpenCL instead of a fragment shader to update my particles, and that also allowed me to just store them in a VBO and just process the data with OpenCL. Sadly, my OpenCL knowledge is a little bit… well, terrible, so I must be doing something wrong, because I get terrible performance.

In my OpenGL particle simulation I keep the particle data in 3 textures, one RGBA 32-bit float texture for position and velocity, one RGBA8 texture for color and one RG16 texture for life. This runs at 65 FPS with 2 million particles, with several thousand particles being created every second. My OpenCL version, however, can manage 62 FPS for just moving around particles on the screen, but this is without any data for color or life. Just keeping and copying that data drops performance to 54 FPS. This is even without any commands to create new particles too!

This is the OpenCL program source:

typedef struct Particle {
    float2 pos;
    float2 vel;
    ushort4 colorLife;
    
} Particle;

kernel void update(const int width, const int height, const float gravity, const float airResistance, global const struct Particle* input, global struct Particle* output){
    unsigned int i = get_global_id(0);
    
    Particle p = input[i];
    
    p.pos += p.vel;
    
    if(p.pos.x < 0 && p.vel.x < 0){
        p.vel.x = -p.vel.x;
    }
   
    if(p.pos.y < 0 && p.vel.y < 0){
        p.vel.y = -p.vel.y;
    }
   
    if(p.pos.x >= width && p.vel.x > 0){
        p.vel.x = -p.vel.x;
    }
   
    if(p.pos.y >= height && p.vel.y > 0){
        p.vel.y = -p.vel.y;
    }
    
    output[i] = p;
}

The amount of data copied (24 bytes per particle -> ~46MBs) seems to be the bottleneck, since the actual logic can be commented away without any increase in performance.

kernel void update(const int width, const int height, const float gravity, const float airResistance, global const struct Particle* input, global struct Particle* output){
    unsigned int i = get_global_id(0);
    
    Particle p = input[i];
    output[i] = p;
}

For reference here is the update fragment shader I use in the OpenGL version:

#version 330

uniform sampler2D posVelSampler;
uniform sampler2D lifeSampler;

uniform float gravity;
uniform float airResistance;
uniform vec2 screenSize;

in vec2 texPos;

#define POSVEL 0 //RGBA 32-bit float texture
#define LIFE 1   //RG16 texture

layout(location = POSVEL) out vec4 posVelOut;
layout(location = LIFE) out vec2 lifeOut;

void main()
{
    lifeOut = texture2D(lifeSampler, texPos, 0).xy;
    lifeOut.x -= 1.0/65535;
    //if(lifeOut.x == 0){
        //discard;
        //return;
    //}

    posVelOut = texture2D(posVelSampler, texPos, 0);
    posVelOut.w += gravity;
    posVelOut.zw *= airResistance;
    posVelOut.xy += posVelOut.zw;
    
    if(posVelOut.x < 0 && posVelOut.z < 0){
        posVelOut.z = -posVelOut.z;
    }
    
    if(posVelOut.y < 0 && posVelOut.w < 0){
        posVelOut.w = -posVelOut.w;
    }
    
    if(posVelOut.x > screenSize.x - 1 && posVelOut.z > 0){
        posVelOut.z = -posVelOut.z;
    }
    
    if(posVelOut.y > screenSize.y - 1 && posVelOut.w > 0){
        posVelOut.w = -posVelOut.w;
    }
}

It doesn’t make any sense if OpenGL is faster at computing than OpenCL. Therefore I assume I am doing something wrong.

TL;DR:
Updating particles:

  • OpenGL: 65FPS
  • OpenCL: 54FPS.
    WTF?

You are not comparing apples to apples. The fragment shader version uses the GPU’s texture sampling hardware, which goes through a cache, while the kernel version uses raw memory. The data structures are different too. A couple of things you could try:

  • Split the Particle struct to a pos/vel buffer (float4*) and a life buffer (ushort4*).
  • Use sampler objects instead of buffers.

But no single value is accessed twice! Why would a cache help in this case? And how in the world do I use samplers?

The cache won’t make it faster per se, but texture sampling will cause the fragment shader to be scheduled and executed in a different way than the kernel. The memory access pattern is different, in the fragment shader case the GPU is most likely reading and writing in batches of 4x4 fragments. The sampling hardware is designed for such a pattern and will probably do a single read/write for each such fragment block.

Check out clCreateSampler, clCreateImage in the OpenCL spec and also the sampler_t data type. You’ll need to use the read_imageX and write_imageX functions in the kernel.

This doesn’t make any sense at all. Why should I need to handle my data as an “image” in OpenCL when I’m updating particle motion? Isn’t there some other way of making it access the data in a faster way? There’s a problem with handling it like that too. The data processed is in an OpenGL VBO, and I’d prefer to keep it like that since it simplifies rendering a lot.

Sorry, but I really am a n00b at this, so please bear with me! You’re the author of the sprite shootout example, right? Thanks for making that! =S

Your problem is that u access your global memory wrong. Something what one should do in most cases, is to use a Struct of Arrays (SoA) instead of an Array of Struct (AoS) as you do now.

So use this


struct Particles {
    float2 * pos;//64bit
    float2 * vel;//64bit
    ushort4 * colorLife;//64bit?    
};

kernel void update(const int width, const int height, const float gravity, const float airResistance, global const struct Particles input, global struct Particles output){
    unsigned int i = get_global_id(0);
    
    output.pos[i] = input.pos[i];
    output.vel[i] = input.vel[i];
    output.colorLife[i] = input.colorLife[i];
}


explanation following…

Beforehand, sry if I mess up the terminology I use CUDA atm and only know that OpenCL memory spaces are named a little bit different.
so Global := device memory, Shared := memory per Multiprocessor shared by one processing block
Also I only know the best practices for NVidia cards, so see fro your self if things are working for AMD cards as well.

So lets get our hands dirty^^
Threads are arranged in little sets called Wraps, which have always the size of 32. And there is a lot about which affect how the card works, but I’m only interested how they affect global memory reads.

So to make it short there is something called coalescence reads, when you have some data (float a[]) you want to make sure that all threads in a wrap read exactly in their order so t0 reads a[0], t1 a[1] … t31 a[31]. This will work for 32bit reads and should on most hardware also for 64bit.
Also make sure to use the predefined data-types float2, float3 and no self-build ones.

I hope you can now understand why SoA is better then AoS. If you have any other questions just ask.

I have 24 bytes of data per particle: Position (2 floats), velocity (2 floats), color (4 bytes) and life (2 shorts). For some reason it was faster to combine color and life into an ushort4 instead of having an int and a ushort2, so that’s why they are combined in the kernel I posted.

I can’t define my kernel to take a Particles struct. It either has to be pointer (Particles*) or not be defined global. >_> Anyway, what you mean is that I should separate the data into 4 different buffers for each parameter, so the easiest way to get it right is just to get rid of the struct in the first place and keep a float2* for position and velocity, a char4* or an int* for color and a ushort2 for life. The reason why I wanted the data to be in the same buffer is because it’s easier and faster to draw it using OpenGL later…

Sorry, but I don’t really understand why it’s better to keep structs of arrays instead of arrays of structs…

OpenGL also supports AoS and SoA, so rendering should be the same (just other strides and offsets).

when you are doing coalescent global memory reads/writes (what you are doing with SoA but not with AoS) the hardware need to shift around a lot less memory then when your reads/writes aren’t.

Think of it like this, when u call for a float the hardware won’t read exactly one float from the global memory, instead it will always read a hole chunk of memory which is somehow aligned. This chunk was on old hardware half the size of a Wrap and on newer exactly the size of a Wrap.

Yeah, it’s not a big deal, but I think it’s slightly faster to keep the data in the same buffer, but that doesn’t matter if OpenCL is faster with different buffers.

It seems to be difficult to have a struct inside the kernel which has a variable size (at least for me >_<), so would I get the same effect by just keeping 4 different input and output?

EDIT:

kernel void update(
		const int width, const int height, const float gravity, const float airResistance, 
		global const float2* inPos, global const float2* inVel, global const uchar4* inCol, global const ushort2* inLife, 
		global float2* outPos, global float2* outVel, global uchar4* outCol, global ushort2* outLife
){
    unsigned int i = get_global_id(0);
    
    float2 pos = inPos[i];
    float2 vel = inVel[i];
    uchar4 col = inCol[i];
    ushort2 life = inLife[i];
    
    pos += vel;
    
    if(pos.x < 0 && vel.x < 0){
        vel.x = -vel.x;
    }
   
    if(pos.y < 0 && vel.y < 0){
        vel.y = -vel.y;
    }
   
    if(pos.x >= width && vel.x > 0){
        vel.x = -vel.x;
    }
   
    if(pos.y >= height && vel.y > 0){
        vel.y = -vel.y;
    }
    
    outPos[i] = pos;
    outVel[i] = vel;
    outCol[i] = col;
    outLife[i] = life;
}

62 FPS. It’s almost as fast as the OpenCL version now… >_>

take a look: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/OpenCL_Best_Practices_Guide.pdf

at 3.2.1

[quote=“theagentd,post:11,topic:38055”]
It should be fine to use a single float4 buffer for position and velocity.

Sorry, but I still feel really stupid here. Using structs I was accessing memory in 24 byte chunks from a large buffer in global memory. Now I am accessing the same total amount of data from 4 different buffers and it’s suddenly noticeably faster? Please correct me if I’m wrong, but the problem was that I loaded a 24 byte struct which caused the data transaction to not be aligned to 32 bytes? By separating the data into different buffers the individual parts of the old struct were could be aligned to 32 bytes which resulted in a single transaction?

[quote=“Spasi,post:13,topic:38055”]

Keeping the position and velocity separate let’s me get rid of the stride when rendering the particles with OpenGL later, but working with OpenCL has made me question everything I knew about memory and bandwidth… I dunno…

I recently found a particle system test that runs with OpenGL and CUDA. Compared to the dissected SpriteShootoutCL example I have (which just renderers points without textures) the performance was identical ± 1 FPS at around 60 FPS with the same particle count. Now the really weird thing is that the performance of my OpenGL GPU particle test is almost exactly the same. That’s insanely weird!

  • SpriteShootoutCL uses a float4 per particle. That’s 16 bytes per particle.
  • I assume the CUDA implementation uses the same amount of data per particle since it does not feature individually colored particles and the particles don’t have a lifetime (they are permanent just like in SSCL).
  • My OpenGL version uses 20 bytes when updating particles and another 4 bytes for color (which are not copied during updating). The updating shader therefore uses more memory bandwidth but still retains the same performance as the other two. HOW?! Why is OpenGL with the overhead of rendering each particle with a geometry shader with more data per particle and also keeping this data in textures faster than using OpenCL or CUDA?

This isn’t really an apples to apples comparison though, since the OpenGL version also creates new particles at a constant rate, so the OpenCL and CUDA versions actually have less work to do. Right now I’m working on creating an identical OpenCL version of my OpenGL particle simulation (with “living” particles). We’ll see how it goes…

Some more thoughts:
OpenGL is faster ‘because’ it is using textures and your OpenCL implementation is slower ‘because’ of AoS.

Remember that you can read from global memory in a coalescence© and in an uncoalescence(UC) way. I try to explain both again a little bit better in just a few sec, but first compare those two with texture fetches.

[FAST] numbers aren’t based on facts just assumptions

  • coalescence global memory reads ~100%
  • texture memory reads ~80%
  • uncoalescence global memory reads ~5%
    [/list]
    [SLOW]

So why are texture reads faster then UC but slower then C reads. First of all your GPU has an extra cache for texture data, so you can randomly sample your texture with only a very small performance hit. Also in some cases, can the Z memory layout of the texture memory give you some speedups because of relative memory locality.

C reads are the fastest way to read from global memory just because you have zero overhead.

When you take a look at your code and think about it like this, every line gets executed simultaneously. And then take a look at what memory addresses you are accessing. Assuming 24byte AoS
i.e. for only your first 4 threads.
thread 1 is accessing address 0-23
2: 24-47
3: 48 - 71
4: 72 - 95

When you now have hardware which has 32byte memory banks your threads access the following ones:
1: 0
2: 0,1
3: 1, 2
4: 2
with this approach we have only 1 read but access 3 different memory banks. Also, the hardware has to do 5 reads in total, because the memory accesses overlap.

now we take a 3 8byte SoA approach
1: 0-7, 2: 8-15, 3: 16-23, 4: 24-31 Buffer1
1: 0-7, 2: 8-15, 3: 16-23, 4: 24-31 Buffer2
1: 0-7, 2: 8-15, 3: 16-23, 4: 24-31 Buffer3

as you can see we need 3 reads but only access 1 memory bank in each read. So we have a total of 3 Memory reads

As you can see a AoS approach needs far more memory interactions, this applies to both reads and writes.

And another thing, AoS uses one big memory access and SoA some smaller ones. This affects the thread scheduling of the GPU, having more smaller task allows the GPU to manage the threads better so when one thread Wrap is blocking because he is waiting for memory the GPU can throw some other thread Wrap in the prepossessing stage. In the case every thread wrap is waiting for memory, because the memory instructions are so big, there is nothing else to do for the GPU then waiting^^

Awesome explanation! That completely cleared it up for me, though it doesn’t explain the performance I’m actually getting. :clue:

EDIT: Thanks for all the help guys!!! I just got my OpenCL particle test working, and it very very slightly outperformed the OpenGL version (as in 54 vs 55 FPS at 3 072 000 particles). It works fine, but I was hoping for a bigger speed increase by getting rid of the overhead of textures and everything, but I guess this just proves that OpenGL is insanely good at utilizing the available GPU resources. I believe I’ve crammed out all the speed I can from my dear (not) little laptop’s GPU now, so I’ll just get back to actually being productive. Thanks again for all the help with memory handling! We’ll see if I get tempted to implement something in OpenCL later. Hmmm, bone interpolation perhaps? =D

OpenGL is a specification, not something which has performance metrics. The driver you have installed is very good at utilizing GPU resources. Which I don’t find too amazing, its one of its primary functions nowadays if you have to believe the changelogs of both nVidia and AMD drivers.