Using JOGL for GPGPU but it runs slow :(

So I am using JOGL to try and run some code written in Java faster, however the code actually takes longer while using the GPU versus just using the CPU. I arranged my code so that alot of stuff occurs only once such that when I get new data all i have to do is update the textures and then render. Anyhow, without posting too much code here is one of the shaders I’m using… If you want to see more code I can post the GLEventlistener that I’m using, but for now i’ll start with this…

As you can guess, this function peforms a dotproduct on some data…


public void calcDotProd(int index1, int index2)
  {
	  String[] shaderSource = {"uniform sampler2D instance1;" +
			  			       "uniform sampler2D instance2;" +
			  			       "uniform sampler2D numValues;" +
			  				"void main(void) { " +
			  				"	vec4 values1 = texture2D(instance1, gl_TexCoord[0].st);" +
			  				"	vec4 values2 = texture2D(instance2, gl_TexCoord[0].st);" +
			  				"   gl_FragColor[0] = dot(values1,values2);" +
	  						"}" };
	  
	  	// create the arraylist that holds all the needed data
		ArrayList<String> shaderVars = new ArrayList<String>(4);
		ArrayList<SerialFloatBuffer> data = new ArrayList<SerialFloatBuffer>(4);
		SerialFloatBuffer tempB = new SerialFloatBuffer(this.internalData.get(index1).capacity());
		float[] tempF = new float[1];
		tempF[0] = internalData.get(index1).capacity();
		tempB.setBuffer(FloatBuffer.wrap(tempF));
		data.add(0,internalData.get(index1));
		data.add(1,internalData.get(index2));
		data.add(2,tempB);
		shaderVars.add(0,"");
		shaderVars.add(1,"instance1");
		shaderVars.add(2,"instance2");
		shaderVars.add(3,"numValues");
		
		// create the core
		if ((currentCore == null) || (currentCore.coreRenderer == null) || (currentCore.coreRenderer.getRenderer().firstTime))
		{
			currentCore = new Core(shaderSource, data, shaderVars, true);
			dotProdResult.setBuffer( currentCore.execute().getBuffer());
			currentCore.coreRenderer.getRenderer().firstTime = false;
		}
		  
		else
		{
			currentCore.addData(data, true);
			dotProdResult.setBuffer( currentCore.execute().getBuffer());
		}
		  
		return;
  }

1.) you might be running into a bottleneck in the texture upload stage, I’d make sure that you are using the texSubImage calls as they don’t incur the same setup and initialization that the standard texImage calls do.

2.) After that I’d probably be concerned with the part where you are getting the results, if you are reading that out to a Java buffer then feeding that back into the next function as a texture that will tank performance right there. You might want to look into using FBOs to move results between functions as textures with out any intermediate copying.

3.) Your shader isn’t terribly complex, CPUs are should have very little trouble with two unfiltered texture reads (ie. 2D array lookup) and a simd dot is quite fast, though a scalar dot, as likely done by Java, will still be decent. Yes GPUs have more bandwidth for those texture reads but they really excel over CPUs if you are doing filtering operations or can’t avoid the read latency, and similarly GPUs are faster than CPUs at dot products but that’s a fairly narrow gap when your main bottleneck is somewhere else.

Also in the same vein, your GPU isn’t likely to be seeing very high utilization with that shader, even GPUs like the Radeon 9700s (R300) had approximately a 1:1 tex:alu ratio, while newer GPUs like the Radeon 2900 (R600) are far, far, worse and the vast majority of the GPU will be sitting idle.

4.) Finally, what size are your textures/results? To small and you’re not leveraging the GPUs greatest strength, parallelism , then combine that with #s 1,2 and you’ll be spending most of your time simply moving data around instead of doing computation.

Thx for your input Chris!

I don’t use texSubImage, and will check it out.

Yeah I WAS doing that… hehe I would read the result into a FloatBuffer, then reload it into a texture later on down the road. So what I gather from you, is simply keep the result loaded in some texture that I can reference later? I’ll look into the FBO idea you mentioned.

Thanks for your input. :slight_smile:

~Bolt

Basically the idea is to keep the data on the GPU for as long as possible. FBOs are nice for doing just that…

You’ll need two FBOs: FBO1 and FBO2

Bind FBO1 as a render buffer
perform function 1

Bind FBO2 as render buffer
Bind FBO1 as texture
perform function 2

Bind FBO1 as render buffer
Bind FBO2 as texture
perform function 3

… and you can just keep flip flopping between them for as long as you want or need, and of course you’ll need more FBOs if you want to retain results for longer than a single function, for example if you wanted the previous two results to be the input to the dot product function.