Instancing with skeletal animation

I did it! I managed to get my GPU skinning working!!! =D My small test featuring Bob the dwarf is now running very nicely!


http://img718.imageshack.us/img718/7356/dwarvesv.png

Features:

  • Individually animated Bobs! They can be at different animation frames and even entirely different animations too, but I only have one animation to use at the moment…
  • Instancing! Each part of Bob (head, helmet, lamp, e.t.c) is drawn with a single OpenGL command no matter how many instances I have.
  • Bone interpolation is done on the CPU and uploaded per instance into a VBO. In my vertex shader this VBO is then accessed through a Texture Buffer Object (TBO).
  • Instance positions / model matrices are uploaded to a VBO and is marched over per instance using GL33.glVertexAttribDivisor(index, 1).

Sadly this program is still CPU-bottlenecked, with my GPU being able to process around 2.5x the instances my CPU can interpolate bones for. The above screenshot runs with 600 instances of Bob, has 16xQ CSAA (= 8x MSAA + 8 coverage samples) enabled since this does not affect performance due to the CPU bottleneck and runs smoothly at 60-61 FPS. With threading (and less anti-aliasing ;)) this could be improved to twice the FPS which would enable me to have over 1000 instances of Bob at the same time! I believe the ultimate solution though is OpenCL. That way I can just upload all the animation frame data on startup and interpolate bones for each instance on the GPU. This would offload everything to the GPU and I estimate that it would run at around 120-150 FPS with no CPU load at all. 8)

I wonder what the performance would be on my GTX 580 hmmmmmm… :wink:

couldn’t you somehow update it with an vertex outputstream, so you won’t need opencl

Probably over 9000.

You make it sound like OpenCL is something bad. xD

Brilliant work! Well done!

Are we going to see this turn into a cool game at some point?

I’ve been working on the same game for half a year now. My answer is “Yes, it will”. It’s an RTS, but I won’t show any screenshots or anything since I’m not 100% sure. I won’t announce any specific information about it since I don’t like the pressure of having said “I’m gonna release this game in x months”…

I can’t think of a good reason for using OpenCL vs. a shader in this instance. I only see downsides.

nah opencl is probably great, but you already are very familiar with shaders and the opengl pipeline.
Why learn something new when you can do it with something you already have, you probably won’t have the infrastructure ready in your codebase for opencl also.

ps: I don’t want to say something against learning new things of course^^. Just thought of a development processing view.

I think OpenCL is better than OpenGL for this, if only because it makes a lot more sense to read frame data from a buffer to fill another buffer with the per instance data instead of emulating the whole process with shaders, texture objects and transform feedback. OpenCL is meant for general purpose computing (= bone interpolation in my book), OpenGL is meant for graphics (= skinning).

Hehe, I just switched to a better slerp function which uses a threshold to avoid expensive trigonometric functions if the interpolated angle is too small and got a 3-4x speed in CPU performance. xD Now the CPU and GPU are almost equally busy, but now it’s almost impossible to not be fragment limited. Bone interpolation (CPU) and skinning (GPU) performance is at 2 000 instances at 60 FPS, but if they are going to actually cover more than a pixel or so per instance (or if I want MSAA) I’ll have to reduce the number of instances to around 1 500. Anyway, the point is that I’ve pretty much maxed out the performance gain from instancing. I’m pushing 2 million triangles per frame with skinning and I haven’t even done any heavy optimizations yet. Well, I guess I won’t be needing OpenCL for a while then… Off to actually being able to load other 3D models than Bob! xD

SLERP that uses trig functions should only be used if the end points are changing each frame (if then). What method are you using? Bisection is very fast and has little error.

	private static final float DOT_THRESHOLD = 0.99975f;

	private static void slerp(Quaternion q0, Quaternion q1,
			Quaternion resultOrientation, float t) {
		float dot = Quaternion.dot(q0, q1);

		float scale0 = 1 - t;
		float scale1 = t;
		if (dot < DOT_THRESHOLD) {
			double theta = Math.acos(dot);
			double invSinTheta = 1f / Math.sin(theta);
			scale0 = (float) (Math.sin((1 - t) * theta) * invSinTheta);
			scale1 = (float) (Math.sin(t * theta) * invSinTheta);
		}

		float x = (scale0 * q0.x) + (scale1 * q1.x);
		float y = (scale0 * q0.y) + (scale1 * q1.y);
		float z = (scale0 * q0.z) + (scale1 * q1.z);
		float w = (scale0 * q0.w) + (scale1 * q1.w);
		resultOrientation.set(x, y, z, w);
		resultOrientation.normalise();
	}

I can’t say I understand exactly how quaternions work, but I do understand the theory of slerp and how it interpolates along the surface of a sphere… Like I said, this is lighting fast, so I don’t see any need to optimize this further at the moment… xd

Ouch. If that’s the fast version, I’d hate to see what the slow version looks like. :wink:

Let’s just say that the calculation of theta and invSinTheta was outside the if-statement. >_> So what’s so bad about this one then? Do you know an even faster one?

If you want make that faster you could try using SIN/COS lookup tables. Riven has done great work with those. Accuracy should be enough.

Well, that’s a little bit too low level at the moment. I’ll add it if I need it later.

About the SLERP, I once stumbled upon this article: Understanding Slerp, Then Not Using It

Also here is a post explaining it’s (non-)usage for skeleton animations.

The trig and inverse trig functions aren’t strictly needed. There are tons of possible implementations. The problem, if you will, with the fastest versions is that they require pre-computation (so multiple usages of starting & end points + auxiliary data) and/or some added constraints (like max angle between end points, only forward moving ‘t’ and/or fixed step ‘t’). I’m assuming that you don’t want to bother with any of that. I do have a really old untested version without any constraints that I could pull out and test.

WRT: trig look-up table…the problem is the relative error is huge for small angles and we’re mostly interested in small angle. (Well not really, but that’s the way most animation data works out in practice.)

Man, why do people insist on making easy stuff hard. SLERP (as a primitive) is freaking awesome.

One easy thing you could do is lose the normalization. The resultant quaternion will be very near one, so that can be replaced by a single step of some renormalization method (like Newton/Ralphson). So you’ll trade a sqrt & divide for a couple of multiplies.

http://code.google.com/p/libgdx/source/browse/trunk/gdx/src/com/badlogic/gdx/math/MathUtils.java

With this you can use LUT just like you use normal sin and cos.

Ps. Just in case if other methods fails.