Maybe I’m having a RTFM moment here but, yes cgGLSetParameterPointer takes a (pointer to a) FloatBuffer which can, I suppose, be offset into the buffer in C, but not in Java. I would need the equivalent of glInterleavedArrays() to do this in Cg, no?
I’m looking at the nVidia Cg Toolkit User Manual v1.4.1 - and I don’t see such a method.
Here is the generated assembly - I have deleted some of the bone parameters for clarity
!!ARBvp1.0
OPTION NV_vertex_program3;
# cgc version 1.4.0001, build date Mar 9 2006 20:52:26
# command line args: -q -profile vp40 -entry main
#vendor NVIDIA Corporation
#version 1.0.02
#profile vp40
#program main
#semantic main.modelViewProj
#semantic main.modelView
#semantic main.modelViewIT
#semantic main.boneMatrices
#semantic main.diffuseColor
#semantic main.specularColor
#var float3 IN.position : $vin.POSITION : ATTR0 : 0 : 1
#var float3 IN.normal : $vin.NORMAL : ATTR2 : 0 : 1
#var float2 IN.texUV : $vin.TEXCOORD0 : ATTR8 : 0 : 1
#var float4 IN.weights : $vin.BLENDWEIGHT : ATTR1 : 0 : 1
#var float4 IN.matrixIndices : $vin.TESSFACTOR : ATTR5 : 0 : 1
#var float4 IN.numBones : $vin.SPECULAR : ATTR4 : 0 : 1
#var float4x4 modelViewProj : : c[0], 4 : 1 : 1
#var float4x4 modelView : : c[4], 4 : 2 : 1
#var float4x4 modelViewIT : : c[8], 4 : 3 : 1
#var float3x4 boneMatrices[0] : : c[12], 3 : 4 : 1
#var float3x4 boneMatrices[1] : : c[15], 3 : 4 : 1
#var float3x4 boneMatrices[2] : : c[18], 3 : 4 : 1
< repeated for each of the 40 matrices >
#var float3x4 boneMatrices[38] : : c[126], 3 : 4 : 1
#var float3x4 boneMatrices[39] : : c[129], 3 : 4 : 1
#var float3 diffuseColor : : c[133] : 6 : 1
#var float3 specularColor : : c[134] : 7 : 1
#var float4 main.hPosition : $vout.POSITION : HPOS : -1 : 1
#var float3 main.pEye : $vout.TEXCOORD0 : TEX0 : -1 : 1
#var float3 main.nEye : $vout.TEXCOORD1 : TEX1 : -1 : 1
#var float2 main.uv : $vout.TEXCOORD2 : TEX2 : -1 : 1
#var float3 main.kDiffuse : $vout.COLOR0 : COL0 : -1 : 1
#var float3 main.kSpecular : $vout.COLOR1 : COL1 : -1 : 1
#const c[132] = 0 3 1
PARAM c[135] = { program.local[0..131],
{ 0, 3, 1 },
program.local[133..134] };
TEMP R0;
TEMP R1;
TEMP R2;
TEMP R3;
TEMP R4;
TEMP R5;
TEMP CC;
ADDRESS A0;
BB1:
MOV R1, vertex.attrib[5];
MOV R2, vertex.attrib[1];
MOV R5.w, c[132].x;
BB2:
SLTC CC.x, R5.w, vertex.attrib[4];
BRA BB4 (EQ.x);
BB3:
MOV R3, R1;
FLR R1.x, R3;
MUL R1.x, R1, c[132].y;
ARL A0.x, R1;
MOV R2, R2;
MOV R4.w, c[132].z;
MOV R4.xyz, vertex.attrib[0];
DP4 R1.x, c[A0.x + 12], R4;
DP4 R1.y, c[A0.x + 13], R4;
MOV R1.w, c[132].z;
DP4 R1.z, c[A0.x + 14], R4;
MAD R0, R2.x, R1, R0;
DP3 R1.z, c[A0.x + 14], vertex.attrib[2];
DP3 R1.x, c[A0.x + 12], vertex.attrib[2];
DP3 R1.y, c[A0.x + 13], vertex.attrib[2];
MAD R5.xyz, R2.x, R1, R5;
MOV R1, R3.yzwx;
MOV R2, R2.yzwx;
ADD R5.w, R5, c[132].z;
BRA BB2;
BB4:
DP4 result.position.w, R0, c[3];
DP4 result.position.z, R0, c[2];
DP4 result.position.y, R0, c[1];
DP4 result.position.x, R0, c[0];
DP4 result.texcoord[0].z, R0, c[6];
DP4 result.texcoord[0].y, R0, c[5];
DP4 result.texcoord[0].x, R0, c[4];
MOV R0.w, c[132].x;
MOV R0.xyz, R5;
DP4 result.texcoord[1].z, R0, c[10];
DP4 result.texcoord[1].y, R0, c[9];
DP4 result.texcoord[1].x, R0, c[8];
MOV result.color.xyz, c[133];
MOV result.color.secondary.xyz, c[134];
MOV result.texcoord[2].xy, vertex.attrib[8];
END
# 40 instructions, 6 R-regs
I have tried it previously with all the vertices weighted to one bone, while keeping the for loop. I tried it just now without the for loop. Both times I get no change whatsoever in the frame rate, though I did see changes in the % vertex shader busy and the % gpu idle:
for loop over 4 bones max, sending down the entire skeleton
32.0 fps
Vert shader busy 20.7%
GPU idle: 5.6%
for loop over 1 bone, sending down entire skeleton
32.0 fps
Vert shader busy 14.0%
GPU idle: 5.6%
1 bone, no for loop, sending down entire skeleton
32.0 fps
Vert shader busy 9.1%
GPU idle: 6.1%
1 bone, no for loop, sending down one bone matrix only, declared uniform float3x4 boneMatrices[1]
40.0 fps
Vert shader busy 14.4%
GPU idle: 2.2%
So what does that mean? Apparantly, sending down the boneMatrices is having quite an impact, and consequently the GPU is less idle and the Vertex Shader is able to be kept busier.
But each set of matrices is only 35 x 12 x 4 = 1680 bytes x 28 models = 47k per frame, which is surely a comparitively insignificant amount of data?
I’m now wondering if it would be better to not use cgGLSetMatrixParameterArrayfr, send the matrices down as a single float array or FloatBuffer and somehow unpack them myself in the shader? Surely not?
Would using a FloatBuffer and not a float[ ] make a significant difference? (AFAIK there is no corresponding method that takes a FloatBuffer in the JOGL Cg interface)
Does GLSL also split the matrix array into individual vars in the assembly code?
Maybe I should look at converting to GLSL?
Kevin