Enhanced Crowd Rendering
A fundamental challenge when designing a game centered around hordes of zombies is how to render a large quantity of animated characters. A number of solutions exist, but none of them were quite suitable for Dead Shift. In this post I will introduce a new(?) technique I’ve developed for skinned mesh instancing (aka. crowd rendering) that addresses many of the issues with the one proposed in GPU Gems 3 (and adapted for XNA), making it better suited for crowds of important characters who may require animation techniques such as motion blending.
To start, here is a quick overview of both techniques.
Original Technique:
- Bones are stored in absolute positions, so no blending is possible.
- Animations must be pre-baked at 30fps or 60fps resulting in large memory/storage requirements.
- Not suitable for rendering main characters, requiring a completely separate animation system for them!
My Enhanced Technique:
- Bones are stored in relative positions, along with skeletal hierarchy and bind-pos data needed to blend multiple frames/animations together at run-time.
- Animations can be pre-baked at lower framerates (15fps works nicely) which drastically lowers memory/storage requirements.
- Animation Processing is decoupled from Rendering, allowing arbitrarily complex blending/tweening routines on the GPU and re-use of processed frames between passes (ie. shadow-depth)
- Suitable for all characters in a game, unifying all animation into a single system.
The process for my technique is as follows:
- Pre-bake animations as usual with the GPU Gems 3 technique; output Relative Matrices, NOT Absolute Matrices. We’ll refer to this as the “AnimationTexture”.
- Store the Inverse BindPos as a frame in AnimationTexture, you’ll need this at runtime. I find it’s easy to place this at Frame 0.
- Store the Skeletal Hierarchy in a Texture for run-time access… This can be a separate texture, or simply packed into another dummy animation frame. The important thing here is that we’ll need our shaders to be able to look up the Parent BoneIndex for any given BoneIndex…
- Write an “Animation Processor” shader that reads frames from AnimationTexture for each character instance and generates a new texture, we’ll call this “FinalAnimationTexture”, in the format of the original GPU Gems 3 technique. Each character will need to be assigned a RowID within this new FinalAnimationTexture to read their processed frame. Hint: If you use Multiple Render Targets, you can spit out all 3 pixels of a bone matrix at once… This eliminates the need to process bones 3x.
- Render Characters as usual, the only difference being that characters reference RowID in FinalAnimationTexture, instead of FrameID in AnimationTexture…
- If bone data is required on the CPU, wait a frame and read it back. I’ve found it far more efficient to first process all bone data we want to read back into a single texture to avoid multiple GPU to CPU transfers.
The beauty of this technique lies in the “Animation Processor” shader. Because we don’t need a World Matrix at this stage, we can easily process over 200 character animations in a single draw call. Also, nobody says you have to use a single “Animation Processor” shader, you could write any number of shaders for different types of animation such as Inverse Kinematics. All animation frame data simply needs to be written to FinalAnimationTexture, so perform this step however you please.
Now, I’m far too lazy to draw diagrams, write sample code, and go into extreme detail… Sorry!… Feel free to ask questions if you’re interested, and I’ll do my best to answer! I just wanted to put this idea out there for anyone struggling with a similar problem.
Some useful nuggets of information:
- In order to perfectly sync bones between the CPU and GPU without horribly stalling the GPU, you may have to use the processed animation data from 2-3 frames back (depending on your architecture). This means double/triple/quadruple buffering FinalAnimationTexture and using outdated animation info to ensure, for example, that a weapon stays perfectly in a character’s hand.
- If you store your original AnimationTexture as Dual Quaternions (2 pixels per bone), it’s very easy to blend between frames/animations, and you use less memory. The trick to interpolating dual quaternions is to compare them w/ a dot product and conditionally negate the second DualQuat to ensure blending moves in the right direction… Interestingly, this step can be pre-computed at build time in the AnimationTexture so that any frame can be blindly lerp()’d with the following frame. *devious*




@T
Thank you.
You actually have me second guessing the Dual Quaternion thing… I believe I was using them because they’re easy to lerp, but I switched to them from matrices quite a while back when I was much newer to 3d programming (not that I’m an expert now)… So perhaps their use is unfounded… They are quite expensive to convert back into a matrix, but that only happens once per processed bone…, hrmmmmm….. Now I’m going to have to rethink them.
It sounds like you understand what I’m doing, except I’m traversing the tree in the reverse order CurrentBone->Parent->Parent->Parent until I hit root, this makes it much simpler.
I’m using a shader constant array and a modified “Fullscreen Quad” that’s now more of a “Partial-screen Quad” The instance is determined by Texel.Y, and the bone by the Texel.X.
Correction: The DualQuat gets converted to a Matrix once for each recursion through to the Parent… The DualQuats are used to tween/blend animations, then the final matrix is used to adjust for the parent… Since there’s so much math, I decided to have a hand at optimizing the conversion function and managed to eliminate 23 multiplications on top of my old optimizations that cut out 11!… So the function now uses 34 less multiplication instructions than the example I based it on. Here’s the crazy-optimized code, maybe someone has a clever way of optimizing it further?
float4x4 DualQuatToMatrix(float2x4 dQ)
{
float4 Qn = dQ[0];
float4 Qd = dQ[1];
Matrix M = 0;
//Qn Squared…
float4 Qn2 = Qn * Qn;
M[0][0] = Qn2.w + Qn2.x – Qn2.y – Qn2.z;
M[1][1] = Qn2.w + Qn2.y – Qn2.x – Qn2.z;
M[2][2] = Qn2.w + Qn2.z – Qn2.x – Qn2.y;
//x*y, y*z, z*w, w*x…
float4 Qn2_1 = Qn * Qn.yzwx;
M[0][1] = Qn2_1.x + Qn2_1.z;
M[1][0] = Qn2_1.x – Qn2_1.z;
M[1][2] = Qn2_1.y + Qn2_1.w;
M[2][1] = Qn2_1.y – Qn2_1.w;
//x*z, y*w, z*x, w*y…
float4 Qn2_2 = Qn * Qn.zwxy;
M[0][2] = Qn2_2.x – Qn2_2.y;
M[2][0] = Qn2_2.x + Qn2_2.y;
float4 Qdx_X_Qn = Qn * Qd.x;
float4 Qdy_X_Qn = Qn * Qd.y;
float4 Qdz_X_Qn = Qn * Qd.z;
float4 Qdw_X_Qn = Qn * Qd.w;
M[3][0] = Qdx_X_Qn.w – Qdy_X_Qn.z + Qdz_X_Qn.y – Qdw_X_Qn.x;
M[3][1] = Qdx_X_Qn.z – Qdz_X_Qn.x + Qdy_X_Qn.w – Qdw_X_Qn.y;
M[3][2] = Qdy_X_Qn.x + Qdz_X_Qn.w – Qdx_X_Qn.y – Qdw_X_Qn.z;
//Batch all the various 2x’s here to save instructions
M[0] *= float4(1, 2, 2, 1);
M[1] *= float4(2, 1, 2, 1);
M[2] *= float4(2, 2, 1, 1);
M[3] *= float4(2, 2, 2, 1);
float len2 = dot(Qn, Qn);
M[3][3] = len2;
M /= len2;
return M;
}
the MS points to US currency is simple
80 Points = $1