gDebugger with JOGL App?

kevingregson · April 14, 2006, 4:52pm

OK, I give up. Can gDebugger be used with a JOGL app?

The only way I can see to do it is to create an exe (I used JexePack) but even though gDebugger will run the exe, it doesn’t seem to be “attaching” to the OpenGL. All thats working is the CPU usage graph.

Am I missing something here?

Is there an alternative that will work with a Java app?

Kevin

kbr · April 14, 2006, 4:59pm

You might want to post this on the JOGL forum instead.

What facilities of gDebugger are you interested in? We should be able to very easily provide these facilities with the composable pipeline model in JOGL; see DebugGL and TraceGL for examples.

kevingregson · April 15, 2006, 11:06am

Oops! I thought I did post to the JOGL forum, but my query is performance related in a way…

I’m specifically interested in the performance analysis aspect of gDebugger - especially the shaders. I would like to monitor GPU utilization, memory consumption, etc., pretty much what NVPerfKit provides. I havn’t tried this yet but I can see it probably won’t work with JOGL either.

The problem I’m having is that after moving my skeletal skinning out of Java and into the vertex shader (Cg) my frame rate has actually dropped slightly. Ok, it’s only by 2 or 3 frames a second slower, but I expected to see at least some improvement.

I seem to have found some sort of plateau in the GPU performance whereby if I reduce the workload (e.g. by not calculating the normals) I see no change in speed at all. On the other hand, only after tripling the workload (i.e. put all the vertex shader code in a for loop x3) do I start to see a significant degredation in performance.

My frame rate is still within an acceptable range, but I’m concerned about what impact adding further work to the GPU will have, and I need to explain to the boss how a week’s work resulted in poorer performance. I need to get under the hood and understand whats happening on the GPU.

I am running on a 3.4Ghz Pentium 4, Windows XP SP2 Home (which may be part of the problem) , with a GeForce 6600GT. I am drawing 28 skeletal models, each with about 35 bones. My total output is about 60-70k vertices per frame.

Kevin

kbr · April 15, 2006, 4:57pm

OK. Going down the route of trying to use gDEBugger, it isn’t clear to me how they interpose on the OpenGL calls, i.e., whether they rely on rewriting the .exe or whether they do things dynamically. I would think it should be possible one way or another to get it to work; do you have a support contract with Gremedy? If you can find out whether they work with .dlls which link against OpenGL then we can figure out how to hook it into JOGL.

I would guess that the NVPerfKit would provide even lower-level information, and given that it’s all performance counter based I don’t see why it shouldn’t work with JOGL.

kevingregson · April 15, 2006, 9:55pm

I have just downloaded NVPerfkit and it does indeed work with JOGL. I will take a closer look at the stats and make some before/after comparisons in the morning. Very interesting…

kevingregson · April 17, 2006, 12:21pm

Hmmm. I have used NVperfKit to compare my app before and after moving the skinning to Cg, and it still doesn’t make any sense.

My Ogl frame rate as reported by NVperfkit has dropped from 59.8 to 33.6 (Vert sync off), which is a lot more significant than I originally thought. The GPU idle dropped from 19.7% to 3.6% (OK, so the Java side is running slightly faster now that it has less to do). The % OGL driver waiting went from 0% to 0.13%, which is negligable I guess. The % vertex shader busy went from 13.3% to 18.6% - seems a bit high, but maybe that’s to be expected. I suppose it would be time to worry if it was close to 100%.

The difference between using half and float in the shader code made very little difference. As far as I can tell there is no evidence of a bottleneck in the GPU pipeline. Nothing here seems to be waiting on anything else.

Interestingly, NVperfKit is reporting around 50% more vertices per frame than what I am actually drawing. Before implementing the skinning on the vertex shader, I was generating about 65000 vertices and NVperfKit was recording a similar number. Once I use the vertex shader to do the skinning, NVperfkit reports just over 100000 vertices per frame.

I am seeing the same thing in the nVidia runtime_ogl_vertex_fragment Cg example (spinning textured sphere). Here the sphere is clearly made up of 900 vertices, but NVPerfkit is reporting 1786 per frame. I will be contacting NVidia about this discrepancy shortly.

Maybe I’m missing something obvious (it won’t be the first time), but I can’t see how I could be better off doing the skinning on the CPU in Java and then sending all those transformed vertices and normals down the wire. What’s more, in the Java version, each vertex was weighted to up to 8 bones, meaning there were up to 8 matrix multiplications per vertex. I reduced this to a maximum of 4 in the Cg version.

The Cg skinning code is almost identical to the advanced skinning example in the nVidia Cg User Manual. I have essentially only removed the normalization of the normals because my weights all add up to one and my normals are already normalized. I have tried without normals, lighting or texturing and that made no difference to the frame rate, meaning there is no problem with the fragment shader

It just doesn’t make sense. The vertex shader is supposed to be more efficient at this kind of thing, I’m sending about 80% less data over the bus and the Cg skinning is actually doing fewer transforms per vertex on average than the original CPU code.

Anybody seen this too? I’m still a vertex programming noobie and any clues would be appreciated!

Kevin ???

Spasi · April 17, 2006, 1:48pm

[quote=“keving,post:6,topic:26956”]
The half type only makes a difference with fragment shaders. Don’t use it in a vertex shader, unless you’re using NV_half_float for vertex data.

[quote=“keving,post:6,topic:26956”]
You’re obviously doing something wrong. You should’ve seen a big improvement with GPU skinning. In Marathon I can skin 500-600 characters, ~1300 triangles each, per-pixel shaded, at ~30fps on a 6800GT.

[quote=“keving,post:6,topic:26956”]
If you mean the “Improved Skinning” shader, please note that it uses dynamic branching (numBones per-vertex attribute). This may be affecting performance, but it still doesn’t explain the ~30 fps for 28 models.

[quote=“keving,post:6,topic:26956”]
Some thoughts:

How do you send the vertex data? You’re using static VBOs, right?
Maybe you’re not using an optimal vertex format? How are your vertices aligned in memory? What data type is used for indices (and for “numBones” if used)?
Have you measured the cost of uploading the bone matrices for each model?
Try posting the vertex shader if possible.

kevingregson · April 17, 2006, 4:02pm

Hi Spasi

Here is my Cg code for the vertex skinning. I have marked my changes from the original:


struct inputs
{
	// *float4 in the original. float4 works even though data is 3 aligned. 
	float3 position:       POSITION;                      // Position of this vertex

	float3 normal:         NORMAL;                        // Normal of this vertex

	// *added
	float2 texUV :           TEXCOORD0;                // Texture UV Coords

	float4 weights :        BLENDWEIGHT;           // Weighting to each bone indexed
	float4 matrixIndices : TESSFACTOR;      // Index of each bone this vertex is weighted to
	float4 numBones:      SPECULAR;        // Count of bones affecting this vertex
};

struct outputs
{
	float4 hPosition  :            POSITION;             // To Rasterizer

	// *next 5 replace "color" in the original
	float3 pEye :                      TEXCOORD0;      // To Fragment Shader
	float3 nEye:                       TEXCOORD1;      // To Fragment Shader
	float2 uv :                           TEXCOORD2;      // To Fragment Shader
	float3 kDiffuse :                COLOR0;              // To Fragment Shader
	float3 kSpecular :             COLOR1;              // To Fragment Shader
};


outputs main(inputs IN,

		 uniform float4x4 modelViewProj,	 
		 uniform float4x4 modelView,               // *added
		 uniform float4x4 modelViewIT,            // *added

		 uniform float3x4 boneMatrices[40],   // *was 30 in the original

		 uniform float3 diffuseColor,                // *added
		 uniform float3 specularColor             // *added
		 ) {
			 
	outputs OUT;

	float4 index = IN.matrixIndices;
	float4 weight = IN.weights;
	float4 position;
	float3 normal;

	for (floati = 0; i < IN.numBones.x; i += 1) {
		// transform the offset by bone i
		position = position + 
                       weight.x * float4(mul(boneMatrices[index.x], float4(IN.position, 1)).xyz, 1.0);

                        // *was no casting to float4 in the original:
                        // weight.x * float4(mul(boneMatrices[index.x], IN.position).xyz, 1.0);

		// transform normal by bone i
		normal = normal + weight.x *
                        mul((float3x3)boneMatrices[index.x], IN.normal.xyz).xyz;

		// shift over the index/weight variables; this moves
		// the index and weight for the current bone into
		// the .x component of the index and weight variables
		index = index.yzwx;
		weight = weight.yzwx;
	}

	// *not needed if weights add up to 1
	// normal = normalize(normal);

	OUT.hPosition = mul(modelViewProj, position);

	// *added. Slight improvement if these are removed
	OUT.pEye = mul(modelView, position).xyz;		
	OUT.nEye = mul(modelViewIT, float4(normal, 0)).xyz;		

	// *pass through unchanged
	OUT.uv	   = IN.texUV;	
	OUT.kDiffuse   = diffuseColor;	
	OUT.kSpecular  = specularColor;	
	
	return OUT;
}

Now that you mention it, my vertex positions are aligned to 3 bytes, like the normals. I will try padding to 4 asap.

Everything is in VBOs.

The NumBones is in a FloatBuffer with groups of 4 floats per vertex, only the first of each group of 4 has data. This is how the original code worked, and I assume this was done for a reason?

Similarly, the matrix indices and weights are FloatBuffers with 4 floats per vertex.

The un-transformed vertex positions, normals and UVs are FloatBuffers with 3, 3 and 2 floats per vertex respectively. Their structure remains unchanged from the original which used glVertexPointer( ) etc.

I have grouped models with the same geometry together, and output the above data once for each group using cgGLEnableClientState and cgGLSetParameterPointer. Then for each model in the group I send the bone matrices with cgGLSetMatrixParameterArrayfr, and the material colors (these may vary for each model). The vertex program is bound once before each group and unbound at the end of each group. I had no such grouping in the original.

How do you suggest I measure the cost of uploading the bone matrices? What could I compare that cost to?

JProfiler showed that about 90% of my CPU time is spent on physics and only 5% on the original skinning (I have hair colliding with parts of these models), So I don’t expect much of an improvement taking the skinning off the CPU - my main aim was to reduce the amount of data sent down to the card because on one environment at least I have reason to believe this has become a bottleneck. I will be happy with the same frame rate as before on my machine here.

Kevin

Spasi · April 17, 2006, 5:17pm

The shader looks fine, so it must be something in the vertex format or in your code. The only thing that I find weird is the numBones attribute, which is 4 floats instead of 1 per vertex. Try to properly align/pad or even interleave the vertex data.

For reference, I’m using the following vertex format (interleaved in a single VBO array + an index VBO):

position - 3 floats
normal - 3 floats
tangent - 3 floats
bone indeces - 2 signed shorts (unsigned is not hardware accelerated)
bone weights - 2 floats
tex coords - 2 floats

[quote=“keving,post:8,topic:26956”]
Sounds fine. Note however that I’m not familiar with the Cg API, I’m using GLSL here.

[quote=“keving,post:8,topic:26956”]
Try uploading the matrices for the first model only and see if it makes a difference (all models will have the same animation).

[quote=“keving,post:8,topic:26956”]
The performance degradation you’re seeing is incomprehensible. At the very least, you should’ve been able to spot the bottleneck with JProfiler or NVPerfKit. My guess is that most of the time is spent in the driver (see the cost of glDraw(Range)Elements in JProfiler).

kevingregson · April 17, 2006, 9:07pm

I just now tried uploading the matrices for the first model only, and the frame rate has gone up to 40.

I also previously tried weighting all the vertices 100% to one bone (the root) and sending just that one matrix for each model - that made no difference at all - I havn’t looked at what effect this has in NVPerfKit but I guess all that’s happening is the Vertex Shader Busy % dropped.

I have implemented two timers that accumulate the time spent in the game logic/physics (“prepare” timer) and the time spent in the display method (state changes, setting VBO pointers, textures, etc, and finally the glDrawElements call) (the “render” timer). The sum of the two timers approximates the frame rate (I have another total loop timer for this). These show me how the time to generate each frame is spent.

Before moving the skinning to the vertex shaders the times were:

 prepare: 9 to 10ms per frame
 render: 6ms per frame
 total: 16ms per frame, which is approximately 62fps, which is close enough to the NVPerfkit value (59.8)

With the skinning in the vertex shader I get

prepare: 6ms per frame
render: 22ms per frame
total: 28ms per frame, or approx 36 fps, which is also close to the NVPerfkit value (33.6)

With the skinning in the vertex shader, all thats happening in the display method is the setting of the texture handle,
the material colors, the bone matrices and then the DrawElements call (which could be either GL_TRIANGLES or GL_TRIANGLE_STRIP depending on the model, with a ShortBuffer of GL_UNSIGNED_SHORTs)

The positions, normals and UVs etc. are sent once for each group of models with the same mesh, and there are currently 14 models in each group. The time taken to do this is included in the render time.

This is very much starting to look like an issue with how the data is packed. I will try interleaving the data and a single short or float for NumBones in the morning.

kevingregson · April 18, 2006, 7:44am

Just a thought - my original skinning was running under JOGL 1.1.1, and I had to migrate to JSR231 Beta 3 to get the Cg stuff working. Could this be the problem?

kevingregson · April 18, 2006, 3:29pm

OK, I have migrated my original version (with skinning on the CPU) to JSR-231 Beta 3, and I am now seeing a slight drop in the frame rate (from 60fps to 55fps). There are probably reasons for this, but it definitely doesn’t account for the drop to 36fps moving the skinning to Cg.

I had a go at interleaving the data and it looks to me like this can’t be done with Cg - there are no base index parameters to the cgGLSetParameterPointer method. How much of a difference could this make anyway - the original version was not interleaved, and I’m not sending a really large amount of vertex data down?

I figure the reason for the use of float4 for the vertex position, normal and numBones in the original Cg Toolkit Improved Skinning example is to keep the parameters aligned to 128bits which may be more efficient?

Spasi · April 18, 2006, 4:14pm

[quote=“keving,post:12,topic:26956”]
What do you mean? cgGLSetParameterPointer takes a pointer like any other function that specifies vertex data. There should also be a version for VBOs, right?

[quote=“keving,post:12,topic:26956”]
Probably.

Anyway, if you don’t think it’s a data related issue, then it’s either the for loop in the shader (see what the generated assembly looks like, try a version without the loop), or something Cg specific. Btw, what’s your triangle count per model?

kevingregson · April 18, 2006, 8:14pm

Maybe I’m having a RTFM moment here but, yes cgGLSetParameterPointer takes a (pointer to a) FloatBuffer which can, I suppose, be offset into the buffer in C, but not in Java. I would need the equivalent of glInterleavedArrays() to do this in Cg, no?
I’m looking at the nVidia Cg Toolkit User Manual v1.4.1 - and I don’t see such a method.

Here is the generated assembly - I have deleted some of the bone parameters for clarity

!!ARBvp1.0
OPTION NV_vertex_program3;
# cgc version 1.4.0001, build date Mar  9 2006 20:52:26
# command line args: -q -profile vp40 -entry main
#vendor NVIDIA Corporation
#version 1.0.02
#profile vp40
#program main
#semantic main.modelViewProj
#semantic main.modelView
#semantic main.modelViewIT
#semantic main.boneMatrices
#semantic main.diffuseColor
#semantic main.specularColor
#var float3 IN.position : $vin.POSITION : ATTR0 : 0 : 1
#var float3 IN.normal : $vin.NORMAL : ATTR2 : 0 : 1
#var float2 IN.texUV : $vin.TEXCOORD0 : ATTR8 : 0 : 1
#var float4 IN.weights : $vin.BLENDWEIGHT : ATTR1 : 0 : 1
#var float4 IN.matrixIndices : $vin.TESSFACTOR : ATTR5 : 0 : 1
#var float4 IN.numBones : $vin.SPECULAR : ATTR4 : 0 : 1
#var float4x4 modelViewProj :  : c[0], 4 : 1 : 1
#var float4x4 modelView :  : c[4], 4 : 2 : 1
#var float4x4 modelViewIT :  : c[8], 4 : 3 : 1
#var float3x4 boneMatrices[0] :  : c[12], 3 : 4 : 1
#var float3x4 boneMatrices[1] :  : c[15], 3 : 4 : 1
#var float3x4 boneMatrices[2] :  : c[18], 3 : 4 : 1

 < repeated for each of the 40 matrices >

#var float3x4 boneMatrices[38] :  : c[126], 3 : 4 : 1
#var float3x4 boneMatrices[39] :  : c[129], 3 : 4 : 1
#var float3 diffuseColor :  : c[133] : 6 : 1
#var float3 specularColor :  : c[134] : 7 : 1
#var float4 main.hPosition : $vout.POSITION : HPOS : -1 : 1
#var float3 main.pEye : $vout.TEXCOORD0 : TEX0 : -1 : 1
#var float3 main.nEye : $vout.TEXCOORD1 : TEX1 : -1 : 1
#var float2 main.uv : $vout.TEXCOORD2 : TEX2 : -1 : 1
#var float3 main.kDiffuse : $vout.COLOR0 : COL0 : -1 : 1
#var float3 main.kSpecular : $vout.COLOR1 : COL1 : -1 : 1
#const c[132] = 0 3 1
PARAM c[135] = { program.local[0..131],
		{ 0, 3, 1 },
		program.local[133..134] };
TEMP R0;
TEMP R1;
TEMP R2;
TEMP R3;
TEMP R4;
TEMP R5;
TEMP CC;
ADDRESS A0;
BB1:
MOV   R1, vertex.attrib[5];
MOV   R2, vertex.attrib[1];
MOV   R5.w, c[132].x;
BB2:
SLTC  CC.x, R5.w, vertex.attrib[4];
BRA   BB4 (EQ.x);
BB3:
MOV   R3, R1;
FLR   R1.x, R3;
MUL   R1.x, R1, c[132].y;
ARL   A0.x, R1;
MOV   R2, R2;
MOV   R4.w, c[132].z;
MOV   R4.xyz, vertex.attrib[0];
DP4   R1.x, c[A0.x + 12], R4;
DP4   R1.y, c[A0.x + 13], R4;
MOV   R1.w, c[132].z;
DP4   R1.z, c[A0.x + 14], R4;
MAD   R0, R2.x, R1, R0;
DP3   R1.z, c[A0.x + 14], vertex.attrib[2];
DP3   R1.x, c[A0.x + 12], vertex.attrib[2];
DP3   R1.y, c[A0.x + 13], vertex.attrib[2];
MAD   R5.xyz, R2.x, R1, R5;
MOV   R1, R3.yzwx;
MOV   R2, R2.yzwx;
ADD   R5.w, R5, c[132].z;
BRA   BB2;
BB4:
DP4   result.position.w, R0, c[3];
DP4   result.position.z, R0, c[2];
DP4   result.position.y, R0, c[1];
DP4   result.position.x, R0, c[0];
DP4   result.texcoord[0].z, R0, c[6];
DP4   result.texcoord[0].y, R0, c[5];
DP4   result.texcoord[0].x, R0, c[4];
MOV   R0.w, c[132].x;
MOV   R0.xyz, R5;
DP4   result.texcoord[1].z, R0, c[10];
DP4   result.texcoord[1].y, R0, c[9];
DP4   result.texcoord[1].x, R0, c[8];
MOV   result.color.xyz, c[133];
MOV   result.color.secondary.xyz, c[134];
MOV   result.texcoord[2].xy, vertex.attrib[8];
END
# 40 instructions, 6 R-regs

I have tried it previously with all the vertices weighted to one bone, while keeping the for loop. I tried it just now without the for loop. Both times I get no change whatsoever in the frame rate, though I did see changes in the % vertex shader busy and the % gpu idle:

for loop over 4 bones max, sending down the entire skeleton
32.0 fps
Vert shader busy 20.7%
GPU idle: 5.6%

for loop over 1 bone, sending down entire skeleton
32.0 fps
Vert shader busy 14.0%
GPU idle: 5.6%

1 bone, no for loop, sending down entire skeleton
32.0 fps
Vert shader busy 9.1%
GPU idle: 6.1%

1 bone, no for loop, sending down one bone matrix only, declared uniform float3x4 boneMatrices[1]
40.0 fps
Vert shader busy 14.4%
GPU idle: 2.2%

So what does that mean? Apparantly, sending down the boneMatrices is having quite an impact, and consequently the GPU is less idle and the Vertex Shader is able to be kept busier.

But each set of matrices is only 35 x 12 x 4 = 1680 bytes x 28 models = 47k per frame, which is surely a comparitively insignificant amount of data?

I’m now wondering if it would be better to not use cgGLSetMatrixParameterArrayfr, send the matrices down as a single float array or FloatBuffer and somehow unpack them myself in the shader? Surely not?

Would using a FloatBuffer and not a float[ ] make a significant difference? (AFAIK there is no corresponding method that takes a FloatBuffer in the JOGL Cg interface)

Does GLSL also split the matrix array into individual vars in the assembly code?

Maybe I should look at converting to GLSL?

Kevin

kevingregson · April 18, 2006, 8:27pm

Oh, and it’s a bit hard to tell how many triangles are involved, but before stripification I have about 3700 individual triangles in each skeletal model on average at maximum LOD. All my timings are at maximum level of detail.

Spasi · April 19, 2006, 7:04pm

[quote=“keving,post:14,topic:26956”]
No, glInterleavedArrays is not necessary (and in fact deprecated by the hardware vendors). You can manually interleave the data by exploiting the Buffer position. I’m not sure if/how it’s possible with JOGL, but LWJGL takes the position into account when you supply a Buffer to a function. So, the driver receives the correct pointer address. The buffer contents will be like this:

TC_0, BW_0, BI_0, TAN_0, NORM_0, POS_0, TC_1, BW_1, BI_1, TAN_1, NORM_1, POS_1, … TC_n, BW_n, BI_n, TAN_n, NORM_n, POS_n

The same goes for VBOs (easier, you just change the offset).

[quote=“keving,post:14,topic:26956”]
Exactly, the vertex shader is not the bottleneck, it’s the Cg API/driver that’s causing the problem. The amount of data for the matrices is small (it has never been a problem for us - we’re using ~24 bones only though), but don’t forget that each time you’re uploading matrices the shader data has to change too. I don’t know what the true cost is, but I’ve even heard that, in some cases, changing a uniform value causes shader recompilation. I’m certain I don’t have this problem, but I can’t be sure what goes on with Cg.

However, you’ve already tried to reduce the amount of matrices and upload the skeleton for a single model only. So, the above might not be the answer.

I’m not sure. LWJGL supports FloatBuffer only and I’m using that. Also, I declare the bone matrices in the GLSL shader as a vec4 (float4) array, with a size [boneCount * 3]. The current version of GLSL does not support 3x4 matrices.

[quote=“keving,post:14,topic:26956”]
Yes. The NV driver uses the Cg compiler internally for GLSL, so the generated assembly is very similar (if not the same). Splitting the matrix array is not a problem, it’s the way the low level XXX_vertex_program extensions work.

[quote=“keving,post:14,topic:26956”]
That would be my last choice.

kbr · April 20, 2006, 12:42am

FYI, there is a problem in JOGL’s exposure of the cgGLSetParameterPointer; it should only be accepting direct Buffers. I’ll fix this shortly. I’m not sure which JOGL version you’re running with but you should definitely upgrade to the latest nightly build. It should only have one variant taking a Buffer as argument. You can pass in a direct FloatBuffer and change the position() of that FloatBuffer to affect which pointer is passed down to the C code.

kbr · April 20, 2006, 1:56am

The bug in the glue code for cgGLSetParameterPointer has been fixed in the nightly builds dated 4/19 or later and also in JSR-231 beta 4.

kevingregson · April 20, 2006, 8:12am

Thanks for that tip Ken, I didn’t realize I could change the position() in the buffer to get the starting points in the interleaved data. Interleaving the vertex data will be my last resort though.

I am using JSR231 beta 3. I will get beta 4 shortly.

I want to first try sending the matrix data as a float buffer (cgGLSetParameterPointer) rather than a float array (cgGLsetMatrixParameterArrayfr). Here’s an example: http://www.gamasutra.com/features/20030325/fernando_pfv.htm

kevingregson · April 20, 2006, 8:12pm

Interleaving all the vertex data = very slight improvement, one could even call it no change at all.

Sending matrix data as a FloatBuffer of float4s = made it worse (33fps to 30fps) even though assembly code is almost identical. Original had an extra MUL instruction for some reason.

JSR-231 Nightly Build (20 April) = no difference for me whatsoever.

I’m going to abandon doing the skinning on the GPU for now - I have at least gained some experience with Cg and shader programs, maybe I will have better luck with improved shadowing and other effects I will eventually need to do.

Thanks for your help, Spasi and Ken