麦克风阵列123

麦克风阵列123 IEEE SIGNAL PROCESSING MAGAZINE [127] NOVEMBER 2012 Digital Object Identifier 10.1109/MSP.2012.2205285 Date of publication: 15 October 2012 D istant speech recognition (DSR) holds the prom- ise of the most natural human computer interface because it ...

IEEE SIGNAL PROCESSING MAGAZINE [127] NOVEMBER 2012 Digital Object Identifier 10.1109/MSP.2012.2205285 Date of publication: 15 October 2012 D istant speech recognition (DSR) holds the prom- ise of the most natural human computer interface because it enables man-machine interactions through speech, without the necessity of donning intrusive body- or head-mounted microphones. Recognizing distant speech robustly, however, remains a chal- lenge. This contribution provides a tutorial overview of DSR systems based on microphone arrays. In particular, we present recent work on acoustic beamforming for DSR, along with experimental results verifying the effectiveness of the various algorithms described here; beginning from a word error rate (WER) of 14.3% with a single microphone of a linear array, our state-of-the-art DSR system achieved a WER of 5.3%, which was comparable to that of 4.2% obtained with a lapel micro- phone. Moreover, we present an emerging technology in the area of far-field audio and speech processing based on spherical microphone arrays. Performance comparisons of spherical and linear arrays reveal that a spherical array with a diameter of 8.4 cm can provide recognition accuracy comparable or better than that obtained with a large linear array with an aperture length of 126 cm. INTRODUCTION When the signals from the individual sensors of a microphone array with a known geometry are suitably combined, the array [Kenichi Kumatani, John McDonough, and Bhiksha Raj] [From close-talking microphones to far-field sensors] 1053-5888/12/$31.00©2012IEEE © IS TO C K P H O TO .C O M /S U C H O A L E R TA D IP AT FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION IEEE SIGNAL PROCESSING MAGAZINE [128] NOVEMBER 2012 functions as a spatial filter capable of suppressing noise, reverberation, and competing speech. Such beamforming techniques have received a great deal of attention within the acoustic array processing com- munity in the recent past [1]– [7]. Despite this effort, however, such techniques have often been ignored within the mainstream community working on DSR. As pointed out in [6] and [7], this could be due to the fact that the disparate research communities for acoustic array processing and automatic speech recognition (ASR) have failed to adopt each other’s best practices. For instance, the array processing community tends to ignore speaker adap- tation techniques, which can compensate for mismatches between acoustic conditions during training and testing. Moreover, this community has largely preferred to work on controlled, synthetic recordings, obtained by convolving noise- and reverberation-free speech with measured, static room impulse responses, with subsequent artificial addition of noise, as in the recent Pattern Analysis, Statistical Modeling, and Computational Learning (PASCAL) Computational Hearing in Multisource Environments (CHiME) Speech Separation Challenge [8]–[11]. A notable exception was the PASCAL Speech Separation Challenge 2 [5], [12] which fea- tured actual array recordings of real speakers; this task, how- ever, has fallen out of favor, to the extent that it is currently not even mentioned on the PASCAL CHiME Challenge Web site, nor in any of the concomitant publications. This is unfor- tunate because improvements obtained with novel speech enhancement techniques tend to diminish, or even disappear, after speaker adaptation; similarly, techniques that work well on artificially convolved data with artificially added noise tend to fail on data captured in real acoustic environments with real human speakers. Mainstream speech recognition researchers, on the other hand, are often unaware of advanced signal and array processing techniques. They are equally unaware of the dramatic reductions in error rate that such techniques can provide in DSR tasks. The primary goal of this contribution is to provide a tuto- rial in the application of acoustic array processing to DSR that is intelligible to anyone with a general signal processing background, while still maintaining the interest of experts in the field. Our secondary goal is to bridge the gaps between the current acoustic array processing and speech recognition communities. A third and over- arching goal is to provide a concise report on the state of the art in DSR. Toward this end, we present two empirical stud- ies: the first is a comparison of several beamforming algo- rithms for their effectiveness in a DSR task with real speakers in a real acoustic environment. These are conducted with a conventional linear array. The second performance comparison is between a conventional linear array and a much more compact spherical array. The latter is gaining importance as the emphasis in acoustic array processing moves from large static fixtures to smaller mobile devices such as robots. OVERVIEW OF DSR Figure 1 shows a block diagram of a DSR system with a micro- phone array. The microphone array module typically consists of a speaker tracker, beamformer (BF), and postfilter. The speaker tracker estimates a speaker’s position. Given that posi- tion estimate, the BF emphasizes sound waves coming from the direction of interest or “look direction.” The beamformed signal can be further enhanced with postfiltering. The final output is then fed into a speech recognizer. We note that this framework can readily incorporate other information sources such as a mouth locator based on video data [13]. FUNDAMENTAL ISSUES IN MICROPHONE ARRAY PROCESSING As shown in Figure 2, the array processing components of a DSR system are prone to several errors. First, there are errors in speaker tracking that cause the beam to be “steered” in the wrong direction [14]; such errors can in turn cause signal can- cellation. Second, the individual microphones in the array can have different amplitude and phase responses even if they are of the same type [15, Sec. 5.5]. Finally, the placement of the sensors can deviate from their nominal positions. All of these factors degrade beamforming performance. SPEAKER TRACKING The speaker tracking problem is generally distinguished from the speaker localization problem. Speaker localization meth- ods estimate a speaker’s position at a single instant in time without relying on past information. On the other hand, speaker tracking algorithms consider a trajectory of instanta- neous position estimates. Speaker localization techniques could be categorized into three approaches: seeking a position that provides the maxi- mum steered response power (SRP) of a BF [16, Sec. 8.2.1], localizing a source based on the application of high-resolu- tion spectral estimation techniques such as subspace algorithms [17, Sec. 9.3], and [FIG1] Block diagram of a typical DSR system. Speaker Tracker Beamformer Postfilter Speech Recognizer Microphone Array Multichannel Data DISTANT SPEECH RECOGNITION HOLDS THE PROMISE OF THE MOST NATURAL HUMAN COMPUTER INTERFACE BECAUSE IT ENABLES MAN-MACHINE INTERACTIONS THROUGH SPEECH. IEEE SIGNAL PROCESSING MAGAZINE [129] NOVEMBER 2012 estimating sources’ positions from time delays of arrival (TDOA) at the micro- phones. Due to computational efficiency as well as robustness against mismatches of signal models and microphone errors, TDOA-based speaker localization approaches are perhaps the most popular in DSR. Here, we briefly introduce speak- er tracking methods based on the TDOA. Shown in Figure 3(a) is a sound wave propagating from a point x to each microphone located at m s for all , ,s S0 1f= - where S is the total number of sensors. Assuming that the position of each microphone is specified in Cartesian coordinates, denote the dis- tance between the point source and each microphone as x m 0 1D s S, ,s s 6 f_ - = - . Then, the TDOA between microphones m and n can be expressed as x( ) / ,D D c,m n m n_x -^ h (1) where c is the speed of sound. Notice that (1) implies that the wavefront—a surface comprised of the locus of all points on the same phase—is spherical. In the case that the array is located far from the speaker, the wavefront can be assumed to be planar, which is called the far-field assumption. Figure 3(b) illustrates a plane wave propagating from the far-field to the microphones. Under the far-field assumption, the TDOA becomes a function of the angle i between the direction of arrival (DOA) and the line connecting two sensors’ positions, and (1) can be simplified as ( ) / ,cosd c, ,m n m n_x i i (2) where d ,m n is the distance between the microphones m and n. Various techniques have been devel- oped for estimation of the TDOAs. A comprehensive overview of those algo- rithms is provided by [18] and compara- tive studies on real data can be found in [19] . From the TDOA between the microphone pairs, the speaker’s position can be computed using classical methods, namely, spherical intersection, spherical interpolation, or linear inter- section [2, Sec. 10.1]. These methods can readily be extended to track a mov- ing speaker by applying a Kalman filter (KF) to smooth the time series of the instantaneous estimates as in [16, Sec. 10]. Klee et al. [20] demonstrated, however, that instead of smoothing a series of instantaneous position esti- mates, better tracking could be performed by simply using the TDOAs as a sequence of obser- vations for an extended KF (EKF) and estimating the speak- er’s position directly from the standard EKF state estimate update formulae. Klee’s algo- rithm was extended to incorpo- rate video features in [21] and to track multiple simultaneous speakers [22]. CONVENTIONAL BEAMFORMING TECHNIQUES In the case of the spherical wavefront depicted in Figure 3(a), let us define the propagation delay as /D cs s_x . In the far- field case shown in Figure 3(b), let us define the wavenum- ber k as a vector perpendicular to the planar wavefront pointing in the direction of propagation with magnitude [FIG2] Representative errors in microphone array processing. Target Sound Source Phase Error Amplitude Error Microphone Position Error Steering Error Microphone Errors Localization Error Direction of Arrival [FIG3] Propagation of (a) the spherical wave and (b) plane wave. ms mS–1 ms+1 m0 x DS–1 Ds+1 D0 Ds Target Sound Source Microphone Array Spherical Wavefront z (a) (b) ms mS–1 ms+1 m0 Direction of Arrival Planar Wavefront z ds, s+1 ds, s+1cosθ k θ DUE TO COMPUTATIONAL EFFICIENCY AS WELL AS ROBUSTNESS AGAINST MISMATCHES OF SIGNAL MODELS AND MICROPHONE ERRORS, TDOA-BASED SPEAKER LOCALIZATION APPROACHES ARE PERHAPS THE MOST POPULAR IN DSR. IEEE SIGNAL PROCESSING MAGAZINE [130] NOVEMBER 2012 / 2 /c~ r m= . Then, the propagation delay with respect to the origin of the coordinate system for microphone s is deter- mined through k ms T s~x = . The simplest model of wave propagation assumes that a signal f(t), carried on a plane wave, reaches all sensors in an array, but not at the same time. Hence, let us form the vector ( ) ( ) ( ) ( )f t f t f t f t S T0 1 1gx x x= - - - -6 @ of the time-delayed signals reaching each sensor. In the fre- quency domain, the comparable vector of phase-delayed sig- nals is F v k( ) ( ) ( , )F~ ~ ~= where ( )F ~ is the transform of f(t) and ( , )v k e e ei i i TS0 1 1g_~ ~x ~x ~x- - - -6 @ (3) is the array manifold vector and i 1= - . The latter is mani- festly a vector of phase delays for a plane wave with wavenum- ber k. To a first order, the array manifold vector is a complete description of the interaction of a propagating wave and an array of sensors. If X ( )~ denotes the vector of frequency domain signals for all sensors, the so-called snapshot vector, and ( )Y ~ the fre- quency domain output of the array, then the operation of a BF can be represented as X( ) ( ) ( ),wY H~ ~ ~= (4) where w ( )~ is a vector of frequency-dependent sensor weights. The differences between various BF designs are com- pletely determined by the specification of the weight vector w ( )~ . The simplest beamforming algorithm, the delay-and- sum (DS) BF, time aligns the signals for a plane wave arriving from the look direction by setting w ( , )/ .v k SDS _ ~ (5) Substituting k( ) ( ) ( ) ( , )X F vF~ ~ ~ ~= = into (4) provides k( ) ( ) ( , ) ( ) ( );w vY F FHDS~ ~ ~ ~ ~= = i.e., the output of the array is equivalent to the original signal in the absence of any interference or distortion. In general, this will be true for any weight vector achieving w k( ) ( , ) 1.vH ~ ~ = (6) Hereafter we will say that any weight vector ( )w ~ achieving (6) satisfies the distortionless constraint, which implies that any wave impinging from the look direction is neither ampli- fied nor attenuated. Figure 4(a) shows the beampattern of the DS BF, which indicates the sensitivity of the BF in decibels to plane waves impinging from various directions. The beampatterns are plot- ted as a function of cosu i= where i is the angle between the DOA and the axis of the linear array. The beampatterns in Figure 4 were computed for a linear array of 20 uniformly spaced microphones with an intersensor spacing of /2d m= , where m is the wavelength of the impinging plane waves; the look direction is u = 0. The lobe around the look direction is the mainlobe, while the other lobes are sidelobes. The large sidelobes indicate that the suppression of noise and interfer- ence off the look direction is poor; in the case of DS beam- forming, the first sidelobe is only 13 dB below the mainlobe. [FIG4] Beampatterns of (a) the DS BF and (b) MVDR BF as a function of cosu i= for the linear array; the noise covariance matrix of the MVDR BF is computed with the interference plane waves arriving from /u 1 3!= . −1 −0.8 −0.4 0 0.4 0.8 1 −50 −45 −40 −35 −30 −25 −20 −15 −10 −5 0 u R es po ns e Fu nc tio n (dB ) Mainlobe Sidelobes Sidelobes Look Direction (a) −1 −0.8 −0.4 0 0.4 0.8 1 u −50 −45 −40 −35 −30 −25 −20 −15 −10 −5 0 R es po ns e Fu nc tio n (dB ) DOA of Interference (b) IEEE SIGNAL PROCESSING MAGAZINE [131] NOVEMBER 2012 To improve upon noise sup- pression performance provided by the DS BF, it is possible to adaptively suppress spatially correlated noise and interfer- ence ( )N ~ , which can be achieved by adjusting the weights of a BF so as to minimize the variance of the noise and interference at the output subject to the distortionless constraint (6). More concretely, we seek ( )w ~ achieving ( ) ( ) ( ),argmin w ww NH ~ ~ ~R (7) subject to (6), where N{ ( ) ( )}NN H_ ~ ~fR and ·{ }f is the expectation operator. In practice, NR is computed by averag- ing or recursively updates the noise covariance matrix [17, Sec. 7]. The weight vectors obtained under these conditions correspond to the minimum variance distortionless response (MVDR) BF, which has the well-known solution [2, Sec. 13.3.1] v k v k v k ( ) ( , ) ( ) ( , ) ( , ) ( ) .w N NH H H 1 1 MVDR ~ ~ ~ ~ ~ ~ = - - / / (8) If N ( )~ consists of a single plane interferer with wavenumber kI and spectrum ( )N ~ , then N ( ) ( ) ( )v kN I~ ~= and v k v k( ) ( ) ( ) ( )I IN N H~ ~R R= , where ( ) { ( ) }NN 2; ;~ f ~R = . Figure 4(b) shows the beampattern of the MVDR BF for the case of two plane wave interferers arriving from directions /u 1 3!= . It is apparent from the figure that such a BF can place deep nulls on the interference signals while maintaining unity gain in the look direction. In the case of INR = , which indicates that the noise field is spatially uncorrelated, the MVDR and DS BFs are equivalent. Depending on the acoustic environment, adapting the sen- sor weights w ( )~ to suppress discrete sources of interference can lead to excessively large sidelobes, resulting in poor sys- tem robustness. A simple technique for avoiding this is to impose a quadratic constraint w 2 # c , for some 02c , in addition to the distortionless constraint six, when estimating the sensor weights. The MVDR solution will then take the form [2, Sec. 13.3.7] w v I v v I , N NH H H DL d 2 1 d 2 1 v v R R = + + - - ^ ^ h h (9) which is referred to as diagonal loading where 2dv is the load- ing level; the dependence on ~ in (9) has been suppressed for convenience. While (9) is straightforward to implement, there is no direct relationship between c and d2v ; hence the latter is typically set either based on experimentation or through an iterative procedure. Increasing d2v decreases wDL , which implies that the white noise gain (WNG) also increases [23]; WNG is a measure of the robustness of the system to the types of errors shown in Figure 2. A theoretical model of diffuse noise that works well in prac- tice is the spherically isotropic field, wherein spatially separat- ed microphones receive equal energy and random phase noise signals from all directions simultaneously [16, Sec. 4]. The MVDR BF with the diffuse noise model is called the super-direc- tive BF (SD BF) [2, Sec. 13.3.4]. The super-directive beamforming design is obtained by replac- ing the noise covariance matrix ( )N ~R with the coherence matrix ( )~C whose ( , )m n th component is given by sinc( ) ,c d , , m n m n~ ~ C = c m (10) where d ,m n is the distance between the mth and nth elements of the array, and sinc /sin x xx _ . Notice that the weight of the super-directive BF is determined solely based on the dis- tance between the sensors d ,m n and is thus data-independent. In the most general case, the acoustic environment will con- sist both of diffuse noise as well as one or more sources of dis- crete interference, such as in v k v k( ) ( ) ( ) ( ) ( ),N N H 2I SII~ ~ v ~R R C= + (11) where 2SIv is the power spectral density of the diffuse noise. The MVDR BF is of particular interest because it forms the preprocessing component of two other important beamform- ing structures. First, the MVDR BF followed by a suitable post- filter yields the maximum signal-to-noise ratio BF [17, Sec. 6.2.3]. Second, and more importantly, by placing a Wiener fil- ter [24, Sec. 2.2] on the output of the MVDR BF, the minimum mean-square error (MMSE) BF is obtained [17, Sec. 6.2.2]. Such postfilters are important because it has been shown that they can yield significant reductions in error rate [25, 5]. Of the several postfiltering methods proposed in the literature [26], the Zelinski postfiltering [27] technique is arguably the simplest practical implementation of a Wiener filter. Wiener filters in their pure form are unrealizable because they assume that the spectrum of the desired signal is available. The Zelinski postfiltering method uses the auto- and cross-power spectra of the multi-channel input signals to estimate the tar- get signal and noise power spectra effectively under the assumption of zero cross-correlation between the noises at dif- ferent sensors. We have employed the Zelinski postfilter for the experiments described in the sections “Evaluation of Beamforming Algorithms” and “Comparison of Linear and Spherical Arrays for DSR.” The MVDR BF can be implemented in generalized sidelobe canceller (GSC) configuration [17, Sec. 6.7.3] as shown in Figure 5. For the input snapshot vector X(t) at a frame t, the output of a GSC BF can be expressed as w B w X( ) ( ) ( ) ( ) ( ),Y t t t t tHaq= -6 @ (12) where wq is the quiescent weight vector, B is the blocking matrix, and wa is the active weight vector. In keeping with the GSC for- malism, wq is chosen to satisfy the distortionless constraint (6) THE MVDR BF IS OF PARTICULAR INTEREST BECAUSE IT FORMS THE PREPROCESSING COM

                    本文档为【麦克风阵列123】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

麦克风阵列123

你可能还喜欢