From Quaternion to Matrix and Back
February 27th 2005
J.M.P. van Waveren
© 2005, Id Software, Inc.
Abstract
Optimized routines for the conversion between quaternions and matrices are presented.
First the regular C/C++ routines presented in literature are optimized and/or
restructured to make it easier for the compiler to generate optimized assembler code.
Next the best approach to SIMD is determined and the SIMD optimizations are partially
prototyped in regular C/C++ code. Finally the Intel Streaming SIMD Extensions are
used to get the most out of every clock cycle.
1. Introduction
Quaternions are often used in skeletal animation systems for the interpolation between
general rotations. When interpolating between animation key frames quaternions
provide an efficient means to interpolate the general rotations of joints in a skeleton.
However, matrices are more efficient when many points or vertices need to be
transformed, and the joints in a skeleton typically transform many vertices of a
polygonal mesh. As such the desire arises to convert quaternions to matrices.
Sometimes it may also be desired to modify a skeleton using matrices. Therefore it may
also be useful to convert matrices to quaternions.
1.1 Previous Work
The quaternion was first introduced by Will iam Rowan Hamilton (1805 - 1865) as a
successor to complex numbers [1]. Arthur Cayley (1821 - 1895) contributed further by
describing rotations with quaternion multiplication [2]. Ken Shoemake popularized
quaternions in the world of computer graphics [6]. Quaternions have since found their
way into many different systems among which animation, inverse kinematics and
physics.
In skeletal animation systems quaternions are often used to interpolate between joint
orientations specified with key frames or animation curves [7,9,10]. On the other hand
rotation matrices are often used when many points in space need to be transformed like
the vertices of the skin of an animated model. Rotation matrices are typically more
efficient on today's hardware when many positions need to be transformed. Because
both quaternions and rotation matrices are useful and efficient for certain calculations
the desire arises to convert between these representations. These conversions were
introduced by Ken Shoemake [6,7,8] in the context of computer graphics.
1.2 Layout
Section 2 shows some properties of quaternions and rotation matrices. Section 3
describes the conversion from joint quaternions to joint matrices. The conversion from
joint matrices to joint quaternions is presented in section 4. The results of the
optimizations are presented in section 5 and several conclusions are drawn in section 6.
2. Quaternions and Rotation Matrices.
The unit quaternion sphere is equivalent to the space of general rotations. Throughout
this article quaternions will represent general rotations. The four components of a
quaternion are denoted (x, y, z, w) and the quaternion will be represented in code as
follows.
struct Quaternion {
float x, y, z, w;
};
A quaternion (x, y, z, w) which represents a general rotation can be interpreted
geometrically as follows.
x = X · sin( α / 2 )
y = Y · sin( α / 2 )
z = Z · sin( α / 2 )
w = cos( α / 2 )
Here (X, Y, Z) is the unit length axis of rotation in 3D space and α is the angle of
rotation about the axis in radians.
A general rotation can also be defined with a 3x3 orthonormal matrix. Each row and
each column of the matrix is a 3D vector of unit length. The rows of the matrix are
orthogonal to each other and the same goes for the columns.
Quaternions and rotation matrices are often used in skeletal animation systems to
describe the orientation and translation of joints in a skeleton. Joints using a
quaternion for the orientation will be represented in code as follows.
struct Vec4 {
float x, y, z, w;
};
struct JointQuat {
Quaternion q;
Vec4 t;
};
Joints using a rotation matrix for the orientation will be represented in code as follows.
struct JointMat {
float mat[3*4];
};
This is a 3x4 matrix where the first three elements of each row are from a row-major
rotation matrix and the last element of every row is the translation over one axis.
3. Quaternion to Matrix
For the quaternion (x, y, z, w) the corresponding rotation matrix M is defined as follows
[6].
1 - 2y² - 2z² 2xy + 2wz 2xz - 2wy
2xy - 2wz 1 - 2x² - 2z² 2yz + 2wx M =
2xz + 2wy 2yz - 2wx 1 - 2x² - 2y²
By grouping the common products the joint quaternion to joint matrix conversion can be
implemented as follows.
void ConvertJointQuatsToJointMats( JointMat *jointMats, const JointQuat *jointQuats, const int numJoints ) {
for ( int i = 0; i < numJoints; i++ ) {
const float *q = &jointQuats[i].q;
float *m = jointMats[i].mat;
m[0*4+3] = q[4];
m[1*4+3] = q[5];
m[2*4+3] = q[6];
float x2 = q[0] + q[0];
float y2 = q[1] + q[1];
float z2 = q[2] + q[2];
{
float xx2 = q[0] * x2;
float yy2 = q[1] * y2;
float zz2 = q[2] * z2;
m[0*4+0] = 1.0f - yy2 - zz2;
m[1*4+1] = 1.0f - xx2 - zz2;
m[2*4+2] = 1.0f - xx2 - yy2;
}
{
float yz2 = q[1] * z2;
float wx2 = q[3] * x2;
m[2*4+1] = yz2 - wx2;
m[1*4+2] = yz2 + wx2;
}
{
float xy2 = q[0] * y2;
float wz2 = q[3] * z2;
m[1*4+0] = xy2 - wz2;
m[0*4+1] = xy2 + wz2;
}
{
float xz2 = q[0] * z2;
float wy2 = q[3] * y2;
m[0*4+2] = xz2 - wy2;
m[2*4+0] = xz2 + wy2;
}
}
}
The above routine localizes variable dependencies with additional braces to make it
easier for the compiler to produce optimized FPU code.
One thing becomes immediately apparent when examining the above routine. The
number of mathematical operations is minimal compared to the number of data move
operations. Furthermore the way the quaternion components are scattered into a matrix
makes it hard to exploit parallelism through increased throughput. The required swizzle
of the quaternion components and de-swizzle of the calculated matrix elements easily
nullifies any gain from executing four operations at once for the few mathematical
operations used in the conversion.
Instead of exploiting parallelism through increased throughput, parallelism can also be
exploited with a compressed calculation. As it turns out it is not that hard to find
common operations that can be executed in parallel, but it is not trivial to arrange them
in such a way that consecutive operations in the conversion can be executed with SIMD
instructions without requiring excessive swizzling. However, the following prototype can
be constructed which has several advantageous properties.
void ConvertJointQuatsToJointMats( JointMat *jointMats, const JointQuat *jointQuats, const int numJoints ) {
for ( int i = 0; i < numJoints; i++ ) {
const float *q = &jointQuats[i].q;
float *m = jointMats[i].mat;
float x2 = q[0] + q[0];
float y2 = q[1] + q[1];
float z2 = q[2] + q[2];
float w2 = q[3] + q[3];
float yy2 = q[1] * y2;
float xy2 = q[0] * y2;
float xz2 = q[0] * z2;
float yz2 = q[1] * z2;
float zz2 = q[2] * z2;
float wz2 = q[3] * z2;
float wy2 = q[3] * y2;
float wx2 = q[3] * x2;
float xx2 = q[0] * x2;
m[0*4+0] = - yy2 - zz2 + 1.0f;
m[0*4+1] = xy2 + wz2;
m[0*4+2] = xz2 - wy2;
m[0*4+3] = q[4];
m[1*4+0] = xy2 - wz2;
m[1*4+1] = - xx2 - zz2 + 1.0f;
m[1*4+2] = yz2 + wx2;
m[1*4+3] = q[5];
m[2*4+0] = xz2 + wy2;
m[2*4+1] = yz2 - wx2;
m[2*4+2] = - xx2 - yy2 + 1.0f;
m[2*4+3] = q[6];
}
}
The above routine should not be used as a replacement for the former routine because
it is significantly slower when compiled to FPU code. However, the above routine does
provide a good starting point for an SSE optimized version.
The conversion counts 9 multiplications that can be executed with three SSE
instructions. Because of the way the multiplications are arranged in the above routine,
the first row of the matrix can be calculated directly from the first 8 products. The
second row can be calculated by replacing one of the first 8 products with the 9th
product. As such the swizzling required during the conversion is minimized. Because the
elements of the first two rows are calculated by adding and subtracting products, the
sign of some of the products is changed with the 'xorps' instruction which allows a
single 'subps' instruction to be used per row. Only the first three elements of the first
two rows are calculated from the 9 products. Because of the way the products are
arranged the 'subps' instructions used for the first two rows also calculate two elements
for the last row in the fourth elements of the SSE registers. The last diagonal element
is then calculated separately and combined with these fourth elements to form the third
row.
The complete SSE optimized code for the conversion can be found in appendix A. The
code assumes that both the list with joints and the list with matrices are at least 16
byte aligned.
The SSE2 instruction 'pshufd' is used to swizzle the quaternion components before
multiplying them. This instruction is meant to be used for double word integer data.
However, since every 32 bits floating point bit pattern represents a valid integer this
instruction can be used on floating point data without problems. The advantage of using
the 'pshufd' instruction is that the complete contents of one SSE register can be copied
and swizzled into another SSE register.
4. Matrix to Quaternion
Converting a rotation matrix to a quaternion is a bit more challenging. The quaternion
components always appear in pairs in the rotation matrix and some manipulation is
required to extract them. To avoid sign loss only one component of the quaternion is
extracted using the diagonal and divided into cross-diagonal sums. The algorithm avoids
precision loss due to near-zero divides by looking for a component of large magnitude
as divisor, first w, then x, y, or z. When the trace of the matrix (sum of diagonal
elements) is greater than zero, |w| is greater than 1/2, which is as small as the largest
component can be. Otherwise, the largest diagonal element corresponds to the largest
of |x|, |y|, or |z|, one of which must be larger than |w|, and at least 1/2. The following
routine converts JointQuats to JointMats using the quaternion to matrix conversion.
float ReciprocalSqrt( float x ) {
long i;
float y, r;
y = x * 0.5f;
i = *(long *)( &x );
i = 0x5f3759df - ( i >> 1 );
r = *(float *)( &i );
r = r * ( 1.5f - r * r * y );
return r;
}
void ConvertJointMatsToJointQuats( JointQuat *jointQuats, const JointMat *jointMats, const int numJoints ) {
for ( int i = 0; i < numJoints; i++ ) {
float *q = &jointQuats[i].q;
const float *m = jointMats[i].mat;
if ( m[0 * 4 + 0] + m[1 * 4 + 1] + m[2 * 4 + 2] > 0.0f ) {
float t = + m[0 * 4 + 0] + m[1 * 4 + 1] + m[2 * 4 + 2] + 1.0f;
float s = ReciprocalSqrt( t ) * 0.5f;
q[3] = s * t;
q[2] = ( m[0 * 4 + 1] - m[1 * 4 + 0] ) * s;
q[1] = ( m[2 * 4 + 0] - m[0 * 4 + 2] ) * s;
q[0] = ( m[1 * 4 + 2] - m[2 * 4 + 1] ) * s;
} else if ( m[0 * 4 + 0] > m[1 * 4 + 1] && m[0 * 4 + 0] > m[2 * 4 + 2] ) {
float t = + m[0 * 4 + 0] - m[1 * 4 + 1] - m[2 * 4 + 2] + 1.0f;
float s = ReciprocalSqrt( t ) * 0.5f;
q[0] = s * t;
q[1] = ( m[0 * 4 + 1] + m[1 * 4 + 0] ) * s;
q[2] = ( m[2 * 4 + 0] + m[0 * 4 + 2] ) * s;
q[3] = ( m[1 * 4 + 2] - m[2 * 4 + 1] ) * s;
} else if ( m[1 * 4 + 1] > m[2 * 4 + 2] ) {
float t = - m[0 * 4 + 0] + m[1 * 4 + 1] - m[2 * 4 + 2] + 1.0f;
float s = ReciprocalSqrt( t ) * 0.5f;
q[1] = s * t;
q[0] = ( m[0 * 4 + 1] + m[1 * 4 + 0] ) * s;
q[3] = ( m[2 * 4 + 0] - m[0 * 4 + 2] ) * s;
q[2] = ( m[1 * 4 + 2] + m[2 * 4 + 1] ) * s;
} else {
float t = - m[0 * 4 + 0] - m[1 * 4 + 1] + m[2 * 4 + 2] + 1.0f;
float s = ReciprocalSqrt( t ) * 0.5f;
q[2] = s * t;
q[3] = ( m[0 * 4 + 1] - m[1 * 4 + 0] ) * s;
q[0] = ( m[2 * 4 + 0] + m[0 * 4 + 2] ) * s;
q[1] = ( m[1 * 4 + 2] + m[2 * 4 + 1] ) * s;
}
q[4] = m[0 * 4 + 3];
q[5] = m[1 * 4 + 3];
q[6] = m[2 * 4 + 3];
q[7] = 0.0f;
}
}
The above routine may appear to be quite different from the commonly used
implementation as presented by Ken Shoemake [6]. However, the above routine just
unrolls the four cases for the different divisors. The routine is typically faster because it
does not use any variable indexing into arrays. The above routine also uses a fast
reciprocal square root approximation [14,15,16].
When examining the above code a key observation can be made. The code for each of
the four cases is almost the same. The only differences are a couple of signs and the
order in which the components of the quaternion are stored. To emphasize these
differences the above routine can be rewritten to the following routine.
void ConvertJointMatsToJointQuats( JointQuat *jointQuats, const JointMat *jointMats, const int numJoints ) {
for ( int i = 0; i < numJoints; i++ ) {
float s0, s1, s2;
int k0, k1, k2, k3;
float *q = &jointQuats[i].q;
const float *m = jointMats[i].mat;
if ( m[0 * 4 + 0] + m[1 * 4 + 1] + m[2 * 4 + 2] > 0.0f ) {
k0 = 3;
k1 = 2;
k2 = 1;
k3 = 0;
s0 = 1.0f;
s1 = 1.0f;
s2 = 1.0f;
} else if ( m[0 * 4 + 0] > m[1 * 4 + 1] && m[0 * 4 + 0] > m[2 * 4 + 2] ) {
k0 = 0;
k1 = 1;
k2 = 2;
k3 = 3;
s0 = 1.0f;
s1 = -1.0f;
s2 = -1.0f;
} else if ( m[1 * 4 + 1] > m[2 * 4 + 2] ) {
k0 = 1;
k1 = 0;
k2 = 3;
k3 = 2;
s0 = -1.0f;
s1 = 1.0f;
s2 = -1.0f;
} else {
k0 = 2;
k1 = 3;
k2 = 0;
k3 = 1;
s0 = -1.0f;
s1 = -1.0f;
s2 = 1.0f;
}
float t = s0 * m[0 * 4 + 0] + s1 * m[1 * 4 + 1] + s2 * m[2 * 4 + 2] + 1.0f;
float s = ReciprocalSqrt( t ) * 0.5f;
q[k0] = s * t;
q[k1] = ( m[0 * 4 + 1] - s2 * m[1 * 4 + 0] ) * s;
q[k2] = ( m[2 * 4 + 0] - s1 * m[0 * 4 + 2] ) * s;
q[k3] = ( m[1 * 4 + 2] - s0 * m[2 * 4 + 1] ) * s;
q[4] = m[0 * 4 + 3];
q[5] = m[1 * 4 + 3];
q[6] = m[2 * 4 + 3];
q[7] = 0.0f;
}
}
In the above code each case sets 4 indices (k0, k1, k2, k3) and three sign multipliers
(s0, s1, s2). The indices are used to determine the order in which the different
quaternion components are stored and the sign multipliers are used to change the signs
in the calculation. The above routine should not be used as a replacement for the
former routine because it is significantly slower when compiled to FPU code. However,
the above routine does provide a blue print for an SSE optimized version.
The best approach to SIMD for the joint matrix to joint quaternion conversion is to
exploit parallelism through increased throughput. The routine presented here will
operate on four conversion per iteration and the scalar instructions are replaced with
functionally equivalent SSE instructions. This requires a swizzle because the matrices
are stored per joint while some of the individual elements of four matrices need to be
grouped into SSE registers. Furthermore the conditionally executed code for the four
different cases has to be replaced with a single sequence of instructions for all cases.
The initial swizzle loads the diagonal elements of four matrices into three SSE registers.
The swizzle loads one element at a time and shuffles it into one of the SSE registers.
The diagonal elements are stored in the xmm5, xmm6 and xmm7 register. Based on the
diagonal elements the three conditions are evaluated and the results are stored in the
xmm0, xmm2, and xmm4 register as follows:
movaps xmm0, xmm5
addps xmm0, xmm6
addps xmm0, xmm7
cmpnltps xmm0, SIMD_SP_zero // xmm0 = m[0 * 4 + 0] + m[1 * 4 + 1] + m[2 * 4 + 2] > 0.0f
movaps xmm1, xmm5
movaps xmm2, xmm5
cmpnltps xmm1, xmm6
cmpnltps xmm2, xmm7
andps xmm2, xmm1 // xmm2 = m[0 * 4 + 0] > m[1 * 4 + 1] && m[0 * 4 + 0] > m[2 * 4 + 2]
movaps xmm4, xmm6
cmpnltps xmm4, xmm7 // xmm4 = m[1 * 4 + 1] > m[2 * 4 + 2]
From the three conditions four masks are calculated for the four cases. These masks are
stored in the xmm0, xmm1, xmm2 and xmm3 register. Based on the chosen divisor only
one of these registers will be fi l led with all one bits and the other registers will be all
zeros. The masks are calculated as follows.
movaps xmm1, xmm0
andnps xmm1, xmm2
orps xmm2, xmm0
movaps xmm3, xmm2
andnps xmm2, xmm4
orps xmm3, xmm2
xorps xmm3, SIMD_SP_not
The components of a quaternion are stored in a different order based on the chosen
divisor. The indices k0 through k3 in the C/C++ blue print basically specify a swizzle to
store the components of a quaternion. The correct swizzle corresponding to the chosen
divisor can be selected using the four masks calculated above. The four different
swizzles are stored as 8 bit indices in 16 byte constants as follows.
#define ALIGN4_INIT1( X, I ) __declspec(align(16)) static X[4] = { I, I, I, I }
ALIGN4_INIT1( unsigned long SIMD_DW_mat2quatShuffle0, (3<<0)|(2<<8)|(1<<16)|(0<<24) );
ALIGN4_INIT1( unsigned long SIMD_DW_mat2quatShuffle1, (0<<0)|(1<<8)|(2<<16)|(3<<24) );
ALIGN4_INIT1( unsigned long SIMD_DW_mat2quatShuffle2, (1<<0)|(0<<8)|(3<<16)|(2<<24) );
ALIGN4_INIT1( unsigned long SIMD_DW_mat2quatShuffle3, (2<<0)|(3<<8)|(0<<16)|(1<<24) );
One of the swizzles can be selected by using a binary 'and' of each of the above swizzle
constants with one of the four masks and using a binary 'or' on the results. The
following SSE code selects one of the swizzles for each of the four conversions and
stores the result in a local byte array called 'shuffle'.
ALIGN16( byte shuffle[16]; )
andps xmm0, SIMD_DW_mat2quatShuffle0
movaps xmm4, xmm1
andps xmm4, SIMD_DW_mat2quatShuffle1
orps xmm0, xmm4
movaps xmm4, xmm2
andps xmm4, SIMD_DW_mat2quatShuffle2
orps xmm0, xmm4
movaps xmm4, xmm3
andps xmm4, SIMD_DW_mat2quatShuffle3
orps xmm4, xmm0
movaps shuffle, xmm4
Next to the swizzle the three signs for each of the four cases need to be calculated as
well. The following SSE code calculates sign bits from the four masks for the four
conversions and stores them in the xmm0, xmm1 and xmm2 register.
ALIGN4_INIT1( unsigned long SIMD_SP_signBit, IEEE_S
本文档为【四元数简介】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。