四元数简介

四元数简介 From Quaternion to Matrix and Back February 27th 2005 J.M.P. van Waveren © 2005, Id Software, Inc. Abstract Optimized routines for the conversion between quaternions and matrices are presented. First the regular C/C++ routines presented in litera...

From Quaternion to Matrix and Back February 27th 2005 J.M.P. van Waveren © 2005, Id Software, Inc. Abstract Optimized routines for the conversion between quaternions and matrices are presented. First the regular C/C++ routines presented in literature are optimized and/or restructured to make it easier for the compiler to generate optimized assembler code. Next the best approach to SIMD is determined and the SIMD optimizations are partially prototyped in regular C/C++ code. Finally the Intel Streaming SIMD Extensions are used to get the most out of every clock cycle. 1. Introduction Quaternions are often used in skeletal animation systems for the interpolation between general rotations. When interpolating between animation key frames quaternions provide an efficient means to interpolate the general rotations of joints in a skeleton. However, matrices are more efficient when many points or vertices need to be transformed, and the joints in a skeleton typically transform many vertices of a polygonal mesh. As such the desire arises to convert quaternions to matrices. Sometimes it may also be desired to modify a skeleton using matrices. Therefore it may also be useful to convert matrices to quaternions. 1.1 Previous Work The quaternion was first introduced by Will iam Rowan Hamilton (1805 - 1865) as a successor to complex numbers [1]. Arthur Cayley (1821 - 1895) contributed further by describing rotations with quaternion multiplication [2]. Ken Shoemake popularized quaternions in the world of computer graphics [6]. Quaternions have since found their way into many different systems among which animation, inverse kinematics and physics. In skeletal animation systems quaternions are often used to interpolate between joint orientations specified with key frames or animation curves [7,9,10]. On the other hand rotation matrices are often used when many points in space need to be transformed like the vertices of the skin of an animated model. Rotation matrices are typically more efficient on today's hardware when many positions need to be transformed. Because both quaternions and rotation matrices are useful and efficient for certain calculations the desire arises to convert between these representations. These conversions were introduced by Ken Shoemake [6,7,8] in the context of computer graphics. 1.2 Layout Section 2 shows some properties of quaternions and rotation matrices. Section 3 describes the conversion from joint quaternions to joint matrices. The conversion from joint matrices to joint quaternions is presented in section 4. The results of the optimizations are presented in section 5 and several conclusions are drawn in section 6. 2. Quaternions and Rotation Matrices. The unit quaternion sphere is equivalent to the space of general rotations. Throughout this article quaternions will represent general rotations. The four components of a quaternion are denoted (x, y, z, w) and the quaternion will be represented in code as follows. struct Quaternion { float x, y, z, w; }; A quaternion (x, y, z, w) which represents a general rotation can be interpreted geometrically as follows. x = X · sin( α / 2 ) y = Y · sin( α / 2 ) z = Z · sin( α / 2 ) w = cos( α / 2 ) Here (X, Y, Z) is the unit length axis of rotation in 3D space and α is the angle of rotation about the axis in radians. A general rotation can also be defined with a 3x3 orthonormal matrix. Each row and each column of the matrix is a 3D vector of unit length. The rows of the matrix are orthogonal to each other and the same goes for the columns. Quaternions and rotation matrices are often used in skeletal animation systems to describe the orientation and translation of joints in a skeleton. Joints using a quaternion for the orientation will be represented in code as follows. struct Vec4 { float x, y, z, w; }; struct JointQuat { Quaternion q; Vec4 t; }; Joints using a rotation matrix for the orientation will be represented in code as follows. struct JointMat { float mat[3*4]; }; This is a 3x4 matrix where the first three elements of each row are from a row-major rotation matrix and the last element of every row is the translation over one axis. 3. Quaternion to Matrix For the quaternion (x, y, z, w) the corresponding rotation matrix M is defined as follows [6]. 1 - 2y² - 2z² 2xy + 2wz 2xz - 2wy 2xy - 2wz 1 - 2x² - 2z² 2yz + 2wx M = 2xz + 2wy 2yz - 2wx 1 - 2x² - 2y² By grouping the common products the joint quaternion to joint matrix conversion can be implemented as follows. void ConvertJointQuatsToJointMats( JointMat *jointMats, const JointQuat *jointQuats, const int numJoints ) { for ( int i = 0; i < numJoints; i++ ) { const float *q = &jointQuats[i].q; float *m = jointMats[i].mat; m[0*4+3] = q[4]; m[1*4+3] = q[5]; m[2*4+3] = q[6]; float x2 = q[0] + q[0]; float y2 = q[1] + q[1]; float z2 = q[2] + q[2]; { float xx2 = q[0] * x2; float yy2 = q[1] * y2; float zz2 = q[2] * z2; m[0*4+0] = 1.0f - yy2 - zz2; m[1*4+1] = 1.0f - xx2 - zz2; m[2*4+2] = 1.0f - xx2 - yy2; } { float yz2 = q[1] * z2; float wx2 = q[3] * x2; m[2*4+1] = yz2 - wx2; m[1*4+2] = yz2 + wx2; } { float xy2 = q[0] * y2; float wz2 = q[3] * z2; m[1*4+0] = xy2 - wz2; m[0*4+1] = xy2 + wz2; } { float xz2 = q[0] * z2; float wy2 = q[3] * y2; m[0*4+2] = xz2 - wy2; m[2*4+0] = xz2 + wy2; } } } The above routine localizes variable dependencies with additional braces to make it easier for the compiler to produce optimized FPU code. One thing becomes immediately apparent when examining the above routine. The number of mathematical operations is minimal compared to the number of data move operations. Furthermore the way the quaternion components are scattered into a matrix makes it hard to exploit parallelism through increased throughput. The required swizzle of the quaternion components and de-swizzle of the calculated matrix elements easily nullifies any gain from executing four operations at once for the few mathematical operations used in the conversion. Instead of exploiting parallelism through increased throughput, parallelism can also be exploited with a compressed calculation. As it turns out it is not that hard to find common operations that can be executed in parallel, but it is not trivial to arrange them in such a way that consecutive operations in the conversion can be executed with SIMD instructions without requiring excessive swizzling. However, the following prototype can be constructed which has several advantageous properties. void ConvertJointQuatsToJointMats( JointMat *jointMats, const JointQuat *jointQuats, const int numJoints ) { for ( int i = 0; i < numJoints; i++ ) { const float *q = &jointQuats[i].q; float *m = jointMats[i].mat; float x2 = q[0] + q[0]; float y2 = q[1] + q[1]; float z2 = q[2] + q[2]; float w2 = q[3] + q[3]; float yy2 = q[1] * y2; float xy2 = q[0] * y2; float xz2 = q[0] * z2; float yz2 = q[1] * z2; float zz2 = q[2] * z2; float wz2 = q[3] * z2; float wy2 = q[3] * y2; float wx2 = q[3] * x2; float xx2 = q[0] * x2; m[0*4+0] = - yy2 - zz2 + 1.0f; m[0*4+1] = xy2 + wz2; m[0*4+2] = xz2 - wy2; m[0*4+3] = q[4]; m[1*4+0] = xy2 - wz2; m[1*4+1] = - xx2 - zz2 + 1.0f; m[1*4+2] = yz2 + wx2; m[1*4+3] = q[5]; m[2*4+0] = xz2 + wy2; m[2*4+1] = yz2 - wx2; m[2*4+2] = - xx2 - yy2 + 1.0f; m[2*4+3] = q[6]; } } The above routine should not be used as a replacement for the former routine because it is significantly slower when compiled to FPU code. However, the above routine does provide a good starting point for an SSE optimized version. The conversion counts 9 multiplications that can be executed with three SSE instructions. Because of the way the multiplications are arranged in the above routine, the first row of the matrix can be calculated directly from the first 8 products. The second row can be calculated by replacing one of the first 8 products with the 9th product. As such the swizzling required during the conversion is minimized. Because the elements of the first two rows are calculated by adding and subtracting products, the sign of some of the products is changed with the 'xorps' instruction which allows a single 'subps' instruction to be used per row. Only the first three elements of the first two rows are calculated from the 9 products. Because of the way the products are arranged the 'subps' instructions used for the first two rows also calculate two elements for the last row in the fourth elements of the SSE registers. The last diagonal element is then calculated separately and combined with these fourth elements to form the third row. The complete SSE optimized code for the conversion can be found in appendix A. The code assumes that both the list with joints and the list with matrices are at least 16 byte aligned. The SSE2 instruction 'pshufd' is used to swizzle the quaternion components before multiplying them. This instruction is meant to be used for double word integer data. However, since every 32 bits floating point bit pattern represents a valid integer this instruction can be used on floating point data without problems. The advantage of using the 'pshufd' instruction is that the complete contents of one SSE register can be copied and swizzled into another SSE register. 4. Matrix to Quaternion Converting a rotation matrix to a quaternion is a bit more challenging. The quaternion components always appear in pairs in the rotation matrix and some manipulation is required to extract them. To avoid sign loss only one component of the quaternion is extracted using the diagonal and divided into cross-diagonal sums. The algorithm avoids precision loss due to near-zero divides by looking for a component of large magnitude as divisor, first w, then x, y, or z. When the trace of the matrix (sum of diagonal elements) is greater than zero, |w| is greater than 1/2, which is as small as the largest component can be. Otherwise, the largest diagonal element corresponds to the largest of |x|, |y|, or |z|, one of which must be larger than |w|, and at least 1/2. The following routine converts JointQuats to JointMats using the quaternion to matrix conversion. float ReciprocalSqrt( float x ) { long i; float y, r; y = x * 0.5f; i = *(long *)( &x ); i = 0x5f3759df - ( i >> 1 ); r = *(float *)( &i ); r = r * ( 1.5f - r * r * y ); return r; } void ConvertJointMatsToJointQuats( JointQuat *jointQuats, const JointMat *jointMats, const int numJoints ) { for ( int i = 0; i < numJoints; i++ ) { float *q = &jointQuats[i].q; const float *m = jointMats[i].mat; if ( m[0 * 4 + 0] + m[1 * 4 + 1] + m[2 * 4 + 2] > 0.0f ) { float t = + m[0 * 4 + 0] + m[1 * 4 + 1] + m[2 * 4 + 2] + 1.0f; float s = ReciprocalSqrt( t ) * 0.5f; q[3] = s * t; q[2] = ( m[0 * 4 + 1] - m[1 * 4 + 0] ) * s; q[1] = ( m[2 * 4 + 0] - m[0 * 4 + 2] ) * s; q[0] = ( m[1 * 4 + 2] - m[2 * 4 + 1] ) * s; } else if ( m[0 * 4 + 0] > m[1 * 4 + 1] && m[0 * 4 + 0] > m[2 * 4 + 2] ) { float t = + m[0 * 4 + 0] - m[1 * 4 + 1] - m[2 * 4 + 2] + 1.0f; float s = ReciprocalSqrt( t ) * 0.5f; q[0] = s * t; q[1] = ( m[0 * 4 + 1] + m[1 * 4 + 0] ) * s; q[2] = ( m[2 * 4 + 0] + m[0 * 4 + 2] ) * s; q[3] = ( m[1 * 4 + 2] - m[2 * 4 + 1] ) * s; } else if ( m[1 * 4 + 1] > m[2 * 4 + 2] ) { float t = - m[0 * 4 + 0] + m[1 * 4 + 1] - m[2 * 4 + 2] + 1.0f; float s = ReciprocalSqrt( t ) * 0.5f; q[1] = s * t; q[0] = ( m[0 * 4 + 1] + m[1 * 4 + 0] ) * s; q[3] = ( m[2 * 4 + 0] - m[0 * 4 + 2] ) * s; q[2] = ( m[1 * 4 + 2] + m[2 * 4 + 1] ) * s; } else { float t = - m[0 * 4 + 0] - m[1 * 4 + 1] + m[2 * 4 + 2] + 1.0f; float s = ReciprocalSqrt( t ) * 0.5f; q[2] = s * t; q[3] = ( m[0 * 4 + 1] - m[1 * 4 + 0] ) * s; q[0] = ( m[2 * 4 + 0] + m[0 * 4 + 2] ) * s; q[1] = ( m[1 * 4 + 2] + m[2 * 4 + 1] ) * s; } q[4] = m[0 * 4 + 3]; q[5] = m[1 * 4 + 3]; q[6] = m[2 * 4 + 3]; q[7] = 0.0f; } } The above routine may appear to be quite different from the commonly used implementation as presented by Ken Shoemake [6]. However, the above routine just unrolls the four cases for the different divisors. The routine is typically faster because it does not use any variable indexing into arrays. The above routine also uses a fast reciprocal square root approximation [14,15,16]. When examining the above code a key observation can be made. The code for each of the four cases is almost the same. The only differences are a couple of signs and the order in which the components of the quaternion are stored. To emphasize these differences the above routine can be rewritten to the following routine. void ConvertJointMatsToJointQuats( JointQuat *jointQuats, const JointMat *jointMats, const int numJoints ) { for ( int i = 0; i < numJoints; i++ ) { float s0, s1, s2; int k0, k1, k2, k3; float *q = &jointQuats[i].q; const float *m = jointMats[i].mat; if ( m[0 * 4 + 0] + m[1 * 4 + 1] + m[2 * 4 + 2] > 0.0f ) { k0 = 3; k1 = 2; k2 = 1; k3 = 0; s0 = 1.0f; s1 = 1.0f; s2 = 1.0f; } else if ( m[0 * 4 + 0] > m[1 * 4 + 1] && m[0 * 4 + 0] > m[2 * 4 + 2] ) { k0 = 0; k1 = 1; k2 = 2; k3 = 3; s0 = 1.0f; s1 = -1.0f; s2 = -1.0f; } else if ( m[1 * 4 + 1] > m[2 * 4 + 2] ) { k0 = 1; k1 = 0; k2 = 3; k3 = 2; s0 = -1.0f; s1 = 1.0f; s2 = -1.0f; } else { k0 = 2; k1 = 3; k2 = 0; k3 = 1; s0 = -1.0f; s1 = -1.0f; s2 = 1.0f; } float t = s0 * m[0 * 4 + 0] + s1 * m[1 * 4 + 1] + s2 * m[2 * 4 + 2] + 1.0f; float s = ReciprocalSqrt( t ) * 0.5f; q[k0] = s * t; q[k1] = ( m[0 * 4 + 1] - s2 * m[1 * 4 + 0] ) * s; q[k2] = ( m[2 * 4 + 0] - s1 * m[0 * 4 + 2] ) * s; q[k3] = ( m[1 * 4 + 2] - s0 * m[2 * 4 + 1] ) * s; q[4] = m[0 * 4 + 3]; q[5] = m[1 * 4 + 3]; q[6] = m[2 * 4 + 3]; q[7] = 0.0f; } } In the above code each case sets 4 indices (k0, k1, k2, k3) and three sign multipliers (s0, s1, s2). The indices are used to determine the order in which the different quaternion components are stored and the sign multipliers are used to change the signs in the calculation. The above routine should not be used as a replacement for the former routine because it is significantly slower when compiled to FPU code. However, the above routine does provide a blue print for an SSE optimized version. The best approach to SIMD for the joint matrix to joint quaternion conversion is to exploit parallelism through increased throughput. The routine presented here will operate on four conversion per iteration and the scalar instructions are replaced with functionally equivalent SSE instructions. This requires a swizzle because the matrices are stored per joint while some of the individual elements of four matrices need to be grouped into SSE registers. Furthermore the conditionally executed code for the four different cases has to be replaced with a single sequence of instructions for all cases. The initial swizzle loads the diagonal elements of four matrices into three SSE registers. The swizzle loads one element at a time and shuffles it into one of the SSE registers. The diagonal elements are stored in the xmm5, xmm6 and xmm7 register. Based on the diagonal elements the three conditions are evaluated and the results are stored in the xmm0, xmm2, and xmm4 register as follows: movaps xmm0, xmm5 addps xmm0, xmm6 addps xmm0, xmm7 cmpnltps xmm0, SIMD_SP_zero // xmm0 = m[0 * 4 + 0] + m[1 * 4 + 1] + m[2 * 4 + 2] > 0.0f movaps xmm1, xmm5 movaps xmm2, xmm5 cmpnltps xmm1, xmm6 cmpnltps xmm2, xmm7 andps xmm2, xmm1 // xmm2 = m[0 * 4 + 0] > m[1 * 4 + 1] && m[0 * 4 + 0] > m[2 * 4 + 2] movaps xmm4, xmm6 cmpnltps xmm4, xmm7 // xmm4 = m[1 * 4 + 1] > m[2 * 4 + 2] From the three conditions four masks are calculated for the four cases. These masks are stored in the xmm0, xmm1, xmm2 and xmm3 register. Based on the chosen divisor only one of these registers will be fi l led with all one bits and the other registers will be all zeros. The masks are calculated as follows. movaps xmm1, xmm0 andnps xmm1, xmm2 orps xmm2, xmm0 movaps xmm3, xmm2 andnps xmm2, xmm4 orps xmm3, xmm2 xorps xmm3, SIMD_SP_not The components of a quaternion are stored in a different order based on the chosen divisor. The indices k0 through k3 in the C/C++ blue print basically specify a swizzle to store the components of a quaternion. The correct swizzle corresponding to the chosen divisor can be selected using the four masks calculated above. The four different swizzles are stored as 8 bit indices in 16 byte constants as follows. #define ALIGN4_INIT1( X, I ) __declspec(align(16)) static X[4] = { I, I, I, I } ALIGN4_INIT1( unsigned long SIMD_DW_mat2quatShuffle0, (3<<0)|(2<<8)|(1<<16)|(0<<24) ); ALIGN4_INIT1( unsigned long SIMD_DW_mat2quatShuffle1, (0<<0)|(1<<8)|(2<<16)|(3<<24) ); ALIGN4_INIT1( unsigned long SIMD_DW_mat2quatShuffle2, (1<<0)|(0<<8)|(3<<16)|(2<<24) ); ALIGN4_INIT1( unsigned long SIMD_DW_mat2quatShuffle3, (2<<0)|(3<<8)|(0<<16)|(1<<24) ); One of the swizzles can be selected by using a binary 'and' of each of the above swizzle constants with one of the four masks and using a binary 'or' on the results. The following SSE code selects one of the swizzles for each of the four conversions and stores the result in a local byte array called 'shuffle'. ALIGN16( byte shuffle[16]; ) andps xmm0, SIMD_DW_mat2quatShuffle0 movaps xmm4, xmm1 andps xmm4, SIMD_DW_mat2quatShuffle1 orps xmm0, xmm4 movaps xmm4, xmm2 andps xmm4, SIMD_DW_mat2quatShuffle2 orps xmm0, xmm4 movaps xmm4, xmm3 andps xmm4, SIMD_DW_mat2quatShuffle3 orps xmm4, xmm0 movaps shuffle, xmm4 Next to the swizzle the three signs for each of the four cases need to be calculated as well. The following SSE code calculates sign bits from the four masks for the four conversions and stores them in the xmm0, xmm1 and xmm2 register. ALIGN4_INIT1( unsigned long SIMD_SP_signBit, IEEE_S

                    本文档为【四元数简介】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

四元数简介

你可能还喜欢