After cannibalizing on SpriteShootoutCL and the sum example I finally have something that is starting to resemble my particle engine again. I’ve got my CL commands sorted out, and rendering is working fine. The code is obviously a lot cleaner and shorter than my earlier GPU particles test, which used textures to store particle data and then draw them. The next logical step was obviously to use OpenCL instead of a fragment shader to update my particles, and that also allowed me to just store them in a VBO and just process the data with OpenCL. Sadly, my OpenCL knowledge is a little bit… well, terrible, so I must be doing something wrong, because I get terrible performance.
In my OpenGL particle simulation I keep the particle data in 3 textures, one RGBA 32-bit float texture for position and velocity, one RGBA8 texture for color and one RG16 texture for life. This runs at 65 FPS with 2 million particles, with several thousand particles being created every second. My OpenCL version, however, can manage 62 FPS for just moving around particles on the screen, but this is without any data for color or life. Just keeping and copying that data drops performance to 54 FPS. This is even without any commands to create new particles too!
This is the OpenCL program source:
typedef struct Particle {
float2 pos;
float2 vel;
ushort4 colorLife;
} Particle;
kernel void update(const int width, const int height, const float gravity, const float airResistance, global const struct Particle* input, global struct Particle* output){
unsigned int i = get_global_id(0);
Particle p = input[i];
p.pos += p.vel;
if(p.pos.x < 0 && p.vel.x < 0){
p.vel.x = -p.vel.x;
}
if(p.pos.y < 0 && p.vel.y < 0){
p.vel.y = -p.vel.y;
}
if(p.pos.x >= width && p.vel.x > 0){
p.vel.x = -p.vel.x;
}
if(p.pos.y >= height && p.vel.y > 0){
p.vel.y = -p.vel.y;
}
output[i] = p;
}
The amount of data copied (24 bytes per particle -> ~46MBs) seems to be the bottleneck, since the actual logic can be commented away without any increase in performance.
kernel void update(const int width, const int height, const float gravity, const float airResistance, global const struct Particle* input, global struct Particle* output){
unsigned int i = get_global_id(0);
Particle p = input[i];
output[i] = p;
}
For reference here is the update fragment shader I use in the OpenGL version:
#version 330
uniform sampler2D posVelSampler;
uniform sampler2D lifeSampler;
uniform float gravity;
uniform float airResistance;
uniform vec2 screenSize;
in vec2 texPos;
#define POSVEL 0 //RGBA 32-bit float texture
#define LIFE 1 //RG16 texture
layout(location = POSVEL) out vec4 posVelOut;
layout(location = LIFE) out vec2 lifeOut;
void main()
{
lifeOut = texture2D(lifeSampler, texPos, 0).xy;
lifeOut.x -= 1.0/65535;
//if(lifeOut.x == 0){
//discard;
//return;
//}
posVelOut = texture2D(posVelSampler, texPos, 0);
posVelOut.w += gravity;
posVelOut.zw *= airResistance;
posVelOut.xy += posVelOut.zw;
if(posVelOut.x < 0 && posVelOut.z < 0){
posVelOut.z = -posVelOut.z;
}
if(posVelOut.y < 0 && posVelOut.w < 0){
posVelOut.w = -posVelOut.w;
}
if(posVelOut.x > screenSize.x - 1 && posVelOut.z > 0){
posVelOut.z = -posVelOut.z;
}
if(posVelOut.y > screenSize.y - 1 && posVelOut.w > 0){
posVelOut.w = -posVelOut.w;
}
}
It doesn’t make any sense if OpenGL is faster at computing than OpenCL. Therefore I assume I am doing something wrong.
TL;DR:
Updating particles:
- OpenGL: 65FPS
- OpenCL: 54FPS.
WTF?