IEEE SIGNAL PROCESSING MAGAZINE [127] NOVEMBER 2012
Digital Object Identifier 10.1109/MSP.2012.2205285
Date of publication: 15 October 2012
D
istant speech recognition (DSR) holds the prom-
ise of the most natural human computer interface
because it enables man-machine interactions
through speech, without the necessity of donning
intrusive body- or head-mounted microphones.
Recognizing distant speech robustly, however, remains a chal-
lenge. This contribution provides a tutorial overview of DSR
systems based on microphone arrays. In particular, we present
recent work on acoustic beamforming for DSR, along with
experimental results verifying the effectiveness of the various
algorithms described here; beginning from a word error rate
(WER) of 14.3% with a single microphone of a linear array, our
state-of-the-art DSR system achieved a WER of 5.3%, which
was comparable to that of 4.2% obtained with a lapel micro-
phone. Moreover, we present an emerging technology in the
area of far-field audio and speech processing based on spherical
microphone arrays. Performance comparisons of spherical and
linear arrays reveal that a spherical array with a diameter of
8.4 cm can provide recognition accuracy comparable or better
than that obtained with a large linear array with an aperture
length of 126 cm.
INTRODUCTION
When the signals from the individual sensors of a microphone
array with a known geometry are suitably combined, the array
[Kenichi Kumatani, John McDonough, and Bhiksha Raj]
[From close-talking microphones to far-field sensors]
1053-5888/12/$31.00©2012IEEE
©
IS
TO
C
K
P
H
O
TO
.C
O
M
/S
U
C
H
O
A
L
E
R
TA
D
IP
AT
FUNDAMENTAL TECHNOLOGIES
IN MODERN SPEECH RECOGNITION
IEEE SIGNAL PROCESSING MAGAZINE [128] NOVEMBER 2012
functions as a spatial filter
capable of suppressing noise,
reverberation, and competing
speech. Such beamforming
techniques have received a great
deal of attention within the
acoustic array processing com-
munity in the recent past [1]–
[7]. Despite this effort, however, such techniques have often
been ignored within the mainstream community working on
DSR. As pointed out in [6] and [7], this could be due to the
fact that the disparate research communities for acoustic
array processing and automatic speech recognition (ASR)
have failed to adopt each other’s best practices. For instance,
the array processing community tends to ignore speaker adap-
tation techniques, which can compensate for mismatches
between acoustic conditions during training and testing.
Moreover, this community has largely preferred to work on
controlled, synthetic recordings, obtained by convolving
noise- and reverberation-free speech with measured, static
room impulse responses, with subsequent artificial addition of
noise, as in the recent Pattern Analysis, Statistical Modeling,
and Computational Learning (PASCAL) Computational
Hearing in Multisource Environments (CHiME) Speech
Separation Challenge [8]–[11]. A notable exception was the
PASCAL Speech Separation Challenge 2 [5], [12] which fea-
tured actual array recordings of real speakers; this task, how-
ever, has fallen out of favor, to the extent that it is currently
not even mentioned on the PASCAL CHiME Challenge Web
site, nor in any of the concomitant publications. This is unfor-
tunate because improvements obtained with novel speech
enhancement techniques tend to diminish, or even disappear,
after speaker adaptation; similarly, techniques that work well
on artificially convolved data with artificially added noise tend
to fail on data captured in real acoustic environments with
real human speakers. Mainstream speech recognition
researchers, on the other hand, are often unaware of advanced
signal and array processing techniques. They are equally
unaware of the dramatic reductions in error rate that such
techniques can provide in DSR tasks.
The primary goal of this contribution is to provide a tuto-
rial in the application of acoustic array processing to DSR
that is intelligible to anyone with a general signal processing
background, while still maintaining the interest of experts in
the field. Our secondary goal is to bridge the gaps between
the current acoustic array processing and speech recognition
communities. A third and over-
arching goal is to provide a
concise report on the state of
the art in DSR. Toward this end,
we present two empirical stud-
ies: the first is a comparison of
several beamforming algo-
rithms for their effectiveness in
a DSR task with real speakers in a real acoustic environment.
These are conducted with a conventional linear array. The
second performance comparison is between a conventional
linear array and a much more compact spherical array. The
latter is gaining importance as the emphasis in acoustic array
processing moves from large static fixtures to smaller mobile
devices such as robots.
OVERVIEW OF DSR
Figure 1 shows a block diagram of a DSR system with a micro-
phone array. The microphone array module typically consists
of a speaker tracker, beamformer (BF), and postfilter. The
speaker tracker estimates a speaker’s position. Given that posi-
tion estimate, the BF emphasizes sound waves coming from
the direction of interest or “look direction.” The beamformed
signal can be further enhanced with postfiltering. The final
output is then fed into a speech recognizer. We note that this
framework can readily incorporate other information sources
such as a mouth locator based on video data [13].
FUNDAMENTAL ISSUES IN MICROPHONE
ARRAY PROCESSING
As shown in Figure 2, the array processing components of a
DSR system are prone to several errors. First, there are errors
in speaker tracking that cause the beam to be “steered” in the
wrong direction [14]; such errors can in turn cause signal can-
cellation. Second, the individual microphones in the array can
have different amplitude and phase responses even if they are
of the same type [15, Sec. 5.5]. Finally, the placement of the
sensors can deviate from their nominal positions. All of these
factors degrade beamforming performance.
SPEAKER TRACKING
The speaker tracking problem is generally distinguished from
the speaker localization problem. Speaker localization meth-
ods estimate a speaker’s position at a single instant in time
without relying on past information. On the other hand,
speaker tracking algorithms consider a trajectory of instanta-
neous position estimates.
Speaker localization techniques could
be categorized into three approaches:
seeking a position that provides the maxi-
mum steered response power (SRP) of a
BF [16, Sec. 8.2.1], localizing a source
based on the application of high-resolu-
tion spectral estimation techniques such
as subspace algorithms [17, Sec. 9.3], and [FIG1] Block diagram of a typical DSR system.
Speaker Tracker Beamformer Postfilter Speech Recognizer
Microphone Array
Multichannel Data
DISTANT SPEECH RECOGNITION
HOLDS THE PROMISE OF THE MOST
NATURAL HUMAN COMPUTER
INTERFACE BECAUSE IT ENABLES
MAN-MACHINE INTERACTIONS
THROUGH SPEECH.
IEEE SIGNAL PROCESSING MAGAZINE [129] NOVEMBER 2012
estimating sources’ positions from time
delays of arrival (TDOA) at the micro-
phones. Due to computational efficiency
as well as robustness against mismatches
of signal models and microphone errors,
TDOA-based speaker localization
approaches are perhaps the most popular
in DSR. Here, we briefly introduce speak-
er tracking methods based on the TDOA.
Shown in Figure 3(a) is a sound wave
propagating from a point x to each
microphone located at m s for all
, ,s S0 1f= - where S is the total
number of sensors. Assuming that the
position of each microphone is specified
in Cartesian coordinates, denote the dis-
tance between the point source and each microphone as
x m 0 1D s S, ,s s 6 f_ - = - . Then, the TDOA between
microphones m and n can be
expressed as
x( ) / ,D D c,m n m n_x -^ h (1)
where c is the speed of sound.
Notice that (1) implies that the
wavefront—a surface comprised
of the locus of all points on the
same phase—is spherical.
In the case that the array is
located far from the speaker, the wavefront can be assumed to
be planar, which is called the far-field assumption. Figure 3(b)
illustrates a plane wave propagating from the far-field to the
microphones. Under the far-field assumption, the TDOA
becomes a function of the angle i between the direction of
arrival (DOA) and the line connecting two sensors’ positions,
and (1) can be simplified as
( ) / ,cosd c, ,m n m n_x i i (2)
where d ,m n is the distance between the
microphones m and n.
Various techniques have been devel-
oped for estimation of the TDOAs. A
comprehensive overview of those algo-
rithms is provided by [18] and compara-
tive studies on real data can be found in
[19] . From the TDOA between
the microphone pairs, the speaker’s
position can be computed using classical
methods, namely, spherical intersection,
spherical interpolation, or linear inter-
section [2, Sec. 10.1]. These methods
can readily be extended to track a mov-
ing speaker by applying a Kalman filter
(KF) to smooth the time series of the
instantaneous estimates as in [16, Sec. 10]. Klee et al. [20]
demonstrated, however, that instead of smoothing a series of
instantaneous position esti-
mates, better tracking could be
performed by simply using the
TDOAs as a sequence of obser-
vations for an extended KF
(EKF) and estimating the speak-
er’s position directly from the
standard EKF state estimate
update formulae. Klee’s algo-
rithm was extended to incorpo-
rate video features in [21] and
to track multiple simultaneous speakers [22].
CONVENTIONAL BEAMFORMING TECHNIQUES
In the case of the spherical wavefront depicted in Figure 3(a),
let us define the propagation delay as /D cs s_x . In the far-
field case shown in Figure 3(b), let us define the wavenum-
ber k as a vector perpendicular to the planar wavefront
pointing in the direction of propagation with magnitude
[FIG2] Representative errors in microphone array processing.
Target Sound Source
Phase Error
Amplitude Error
Microphone Position Error
Steering Error
Microphone Errors
Localization Error
Direction of Arrival
[FIG3] Propagation of (a) the spherical wave and (b) plane wave.
ms
mS–1
ms+1
m0
x
DS–1
Ds+1
D0
Ds
Target Sound Source
Microphone Array
Spherical Wavefront
z
(a) (b)
ms
mS–1
ms+1
m0
Direction of Arrival
Planar Wavefront
z
ds, s+1
ds, s+1cosθ
k
θ
DUE TO COMPUTATIONAL EFFICIENCY
AS WELL AS ROBUSTNESS AGAINST
MISMATCHES OF SIGNAL MODELS
AND MICROPHONE ERRORS,
TDOA-BASED SPEAKER LOCALIZATION
APPROACHES ARE PERHAPS THE
MOST POPULAR IN DSR.
IEEE SIGNAL PROCESSING MAGAZINE [130] NOVEMBER 2012
/ 2 /c~ r m= . Then, the propagation delay with respect to the
origin of the coordinate system for microphone s is deter-
mined through k ms T s~x = . The simplest model of wave
propagation assumes that a signal f(t), carried on a plane
wave, reaches all sensors in an array, but not at the same
time. Hence, let us form the vector
( ) ( ) ( ) ( )f t f t f t f t S T0 1 1gx x x= - - - -6 @
of the time-delayed signals reaching each sensor. In the fre-
quency domain, the comparable vector of phase-delayed sig-
nals is F v k( ) ( ) ( , )F~ ~ ~= where ( )F ~ is the transform of
f(t) and
( , )v k e e ei i i TS0 1 1g_~ ~x ~x ~x- - - -6 @ (3)
is the array manifold vector and i 1= - . The latter is mani-
festly a vector of phase delays for a plane wave with wavenum-
ber k. To a first order, the array manifold vector is a complete
description of the interaction of a propagating wave and an
array of sensors.
If X ( )~ denotes the vector of frequency domain signals for
all sensors, the so-called snapshot vector, and ( )Y ~ the fre-
quency domain output of the array, then the operation of a BF
can be represented as
X( ) ( ) ( ),wY H~ ~ ~= (4)
where w ( )~ is a vector of frequency-dependent sensor
weights. The differences between various BF designs are com-
pletely determined by the specification of the weight vector
w ( )~ . The simplest beamforming algorithm, the delay-and-
sum (DS) BF, time aligns the signals for a plane wave arriving
from the look direction by setting
w ( , )/ .v k SDS _ ~ (5)
Substituting k( ) ( ) ( ) ( , )X F vF~ ~ ~ ~= = into (4) provides
k( ) ( ) ( , ) ( ) ( );w vY F FHDS~ ~ ~ ~ ~= =
i.e., the output of the array is equivalent to the original signal
in the absence of any interference or distortion. In general,
this will be true for any weight vector achieving
w k( ) ( , ) 1.vH ~ ~ = (6)
Hereafter we will say that any weight vector ( )w ~ achieving
(6) satisfies the distortionless constraint, which implies that
any wave impinging from the look direction is neither ampli-
fied nor attenuated.
Figure 4(a) shows the beampattern of the DS BF, which
indicates the sensitivity of the BF in decibels to plane waves
impinging from various directions. The beampatterns are plot-
ted as a function of cosu i= where i is the angle between
the DOA and the axis of the linear array. The beampatterns in
Figure 4 were computed for a linear array of 20 uniformly
spaced microphones with an intersensor spacing of /2d m= ,
where m is the wavelength of the impinging plane waves; the
look direction is u = 0. The lobe around the look direction is
the mainlobe, while the other lobes are sidelobes. The large
sidelobes indicate that the suppression of noise and interfer-
ence off the look direction is poor; in the case of DS beam-
forming, the first sidelobe is only 13 dB below the mainlobe.
[FIG4] Beampatterns of (a) the DS BF and (b) MVDR BF as a function of cosu i= for the linear array; the noise covariance matrix of the
MVDR BF is computed with the interference plane waves arriving from /u 1 3!= .
−1 −0.8 −0.4 0 0.4 0.8 1
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
u
R
es
po
ns
e
Fu
nc
tio
n
(dB
)
Mainlobe
Sidelobes Sidelobes
Look Direction
(a)
−1 −0.8 −0.4 0 0.4 0.8 1
u
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
0
R
es
po
ns
e
Fu
nc
tio
n
(dB
)
DOA of Interference
(b)
IEEE SIGNAL PROCESSING MAGAZINE [131] NOVEMBER 2012
To improve upon noise sup-
pression performance provided
by the DS BF, it is possible to
adaptively suppress spatially
correlated noise and interfer-
ence ( )N ~ , which can be
achieved by adjusting the
weights of a BF so as to minimize the variance of the noise
and interference at the output subject to the distortionless
constraint (6). More concretely, we seek ( )w ~ achieving
( ) ( ) ( ),argmin w ww NH ~ ~ ~R (7)
subject to (6), where N{ ( ) ( )}NN H_ ~ ~fR and ·{ }f is the
expectation operator. In practice, NR is computed by averag-
ing or recursively updates the noise covariance matrix [17,
Sec. 7]. The weight vectors obtained under these conditions
correspond to the minimum variance distortionless response
(MVDR) BF, which has the well-known solution [2, Sec. 13.3.1]
v k v k
v k
( )
( , ) ( ) ( , )
( , ) ( )
.w
N
NH
H
H
1
1
MVDR ~
~ ~ ~
~ ~
=
-
-
/
/
(8)
If N ( )~ consists of a single plane interferer with wavenumber
kI and spectrum ( )N ~ , then N ( ) ( ) ( )v kN I~ ~= and
v k v k( ) ( ) ( ) ( )I IN N H~ ~R R= , where ( ) { ( ) }NN 2; ;~ f ~R = .
Figure 4(b) shows the beampattern of the MVDR BF for the
case of two plane wave interferers arriving from directions
/u 1 3!= . It is apparent from the figure that such a BF can
place deep nulls on the interference signals while maintaining
unity gain in the look direction. In the case of INR = , which
indicates that the noise field is spatially uncorrelated, the
MVDR and DS BFs are equivalent.
Depending on the acoustic environment, adapting the sen-
sor weights w ( )~ to suppress discrete sources of interference
can lead to excessively large sidelobes, resulting in poor sys-
tem robustness. A simple technique for avoiding this is to
impose a quadratic constraint w 2 # c , for some 02c , in
addition to the distortionless constraint six, when estimating
the sensor weights. The MVDR solution will then take the
form [2, Sec. 13.3.7]
w
v I v
v I ,
N
NH
H
H
DL
d
2 1
d
2 1
v
v
R
R
=
+
+
-
-
^
^
h
h
(9)
which is referred to as diagonal loading where 2dv is the load-
ing level; the dependence on ~ in (9) has been suppressed for
convenience. While (9) is straightforward to implement, there
is no direct relationship between c and d2v ; hence the latter is
typically set either based on experimentation or through an
iterative procedure. Increasing d2v decreases wDL , which
implies that the white noise gain (WNG) also increases [23];
WNG is a measure of the robustness of the system to the types
of errors shown in Figure 2.
A theoretical model of diffuse noise that works well in prac-
tice is the spherically isotropic field, wherein spatially separat-
ed microphones receive equal
energy and random phase noise
signals from all directions
simultaneously [16, Sec. 4]. The
MVDR BF with the diffuse noise
model is called the super-direc-
tive BF (SD BF) [2, Sec. 13.3.4].
The super-directive beamforming design is obtained by replac-
ing the noise covariance matrix ( )N ~R with the coherence
matrix ( )~C whose ( , )m n th component is given by
sinc( ) ,c
d
,
,
m n
m n~
~
C = c m (10)
where d ,m n is the distance between the mth and nth elements
of the array, and sinc /sin x xx _ . Notice that the weight of
the super-directive BF is determined solely based on the dis-
tance between the sensors d ,m n and is thus data-independent.
In the most general case, the acoustic environment will con-
sist both of diffuse noise as well as one or more sources of dis-
crete interference, such as in
v k v k( ) ( ) ( ) ( ) ( ),N N H 2I SII~ ~ v ~R R C= + (11)
where 2SIv is the power spectral density of the diffuse noise.
The MVDR BF is of particular interest because it forms the
preprocessing component of two other important beamform-
ing structures. First, the MVDR BF followed by a suitable post-
filter yields the maximum signal-to-noise ratio BF [17, Sec.
6.2.3]. Second, and more importantly, by placing a Wiener fil-
ter [24, Sec. 2.2] on the output of the MVDR BF, the minimum
mean-square error (MMSE) BF is obtained [17, Sec. 6.2.2].
Such postfilters are important because it has been shown that
they can yield significant reductions in error rate [25, 5]. Of
the several postfiltering methods proposed in the literature
[26], the Zelinski postfiltering [27] technique is arguably the
simplest practical implementation of a Wiener filter. Wiener
filters in their pure form are unrealizable because they assume
that the spectrum of the desired signal is available. The
Zelinski postfiltering method uses the auto- and cross-power
spectra of the multi-channel input signals to estimate the tar-
get signal and noise power spectra effectively under the
assumption of zero cross-correlation between the noises at dif-
ferent sensors. We have employed the Zelinski postfilter for
the experiments described in the sections “Evaluation of
Beamforming Algorithms” and “Comparison of Linear and
Spherical Arrays for DSR.”
The MVDR BF can be implemented in generalized sidelobe
canceller (GSC) configuration [17, Sec. 6.7.3] as shown in
Figure 5. For the input snapshot vector X(t) at a frame t, the
output of a GSC BF can be expressed as
w B w X( ) ( ) ( ) ( ) ( ),Y t t t t tHaq= -6 @ (12)
where wq is the quiescent weight vector, B is the blocking matrix,
and wa is the active weight vector. In keeping with the GSC for-
malism, wq is chosen to satisfy the distortionless constraint (6)
THE MVDR BF IS OF PARTICULAR
INTEREST BECAUSE IT FORMS THE
PREPROCESSING COM
本文档为【麦克风阵列123】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。