International Journal of Computer Vision 60(1), 63–86, 2004
c© 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
Scale & Affine Invariant Interest Point Detectors
KRYSTIAN MIKOLAJCZYK AND CORDELIA SCHMID
INRIA Rhne-Alpes GRAVIR-CNRS, 655 av. de l’Europe, 38330 Montbonnot, France
Krystian.Mikolajczyk@inrialpes.fr
Cordelia.Schmid@inrialpes.fr
Received January 3, 2003; Revised September 24, 2003; Accepted January 22, 2004
Abstract. In this paper we propose a novel approach for detecting interest points invariant to scale and affine
transformations. Our scale and affine invariant detectors are based on the following recent results : (1) Interest points
extracted with the Harris detector can be adapted to affine transformations and give repeatable results (geometrically
stable). (2) The characteristic scale of a local structure is indicated by a local extremum over scale of normalized
derivatives (the Laplacian). (3) The affine shape of a point neighborhood is estimated based on the second moment
matrix.
Our scale invariant detector computes a multi-scale representation for the Harris interest point detector and then
selects points at which a local measure (the Laplacian) is maximal over scales. This provides a set of distinctive
points which are invariant to scale, rotation and translation as well as robust to illumination changes and limited
changes of viewpoint. The characteristic scale determines a scale invariant region for each point. We extend the
scale invariant detector to affine invariance by estimating the affine shape of a point neighborhood. An iterative
algorithm modifies location, scale and neighborhood of each point and converges to affine invariant points. This
method can deal with significant affine transformations including large scale changes. The characteristic scale and
the affine shape of neighborhood determine an affine invariant region for each point.
We present a comparative evaluation of different detectors and show that our approach provides better results
than existing methods. The performance of our detector is also confirmed by excellent matching results; the image
is described by a set of scale/affine invariant descriptors computed on the regions associated with our points.
Keywords: interest points, local features, scale invariance, affine invariance, matching, recognition
1. Introduction
Local features have been shown to be well suited to
matching and recognition as well as to many other ap-
plications as they are robust to occlusion, background
clutter and other content changes. The difficulty is to
obtain invariance to viewing conditions. Different solu-
tions to this problem have been developed over the past
few years and are reviewed in Section 1.1. These ap-
proaches first detect features and then compute a set of
descriptors for these features. In the case of significant
transformations, feature detection has to be adapted
to the transformation, as at least a subset of the fea-
tures must be present in both images in order to allow
for correspondences. Features which have proved to
be particularly appropriate are interest points. How-
ever, the Harris interest point detector is not invari-
ant to scale and affine transformations (Schmid et al.,
2000). In this paper we give a detailed description of
a scale and an affine invariant interest point detector
introduced in Mikolajczyk and Schmid (2001, 2002).
Our approach combines the Harris detector with the
Laplacian-based scale selection. The Harris-Laplace
detector is then extended to deal with significant
affine transformations. Previous detectors partially
handle the problem of affine invariance since they
64 Mikolajczyk and Schmid
assume that the localization and scale are not affected
by an affine transformation of the local image struc-
tures. The proposed improvements result in better re-
peatability and accuracy of interest points. Moreover,
the scale invariant Harris-Laplace approach detects dif-
ferent regions than the DoG detector (Lowe, 1999). The
latter one detects mainly blobs, whereas the Harris de-
tector responds to corners and highly textured points,
hence these detectors extract complementary features
in images.
If the scale change between images is known, we
can adapt the Harris detector to the scale change
(Dufournaud et al., 2000) and we then obtain points,
for which the localization and scale perfectly reflect
the real scale change between two images. If the scale
change between images is unknown, a simple way to
deal with scale changes is to extract points at several
scales and to use all these points to represent an im-
age. The problem with a multi-scale approach is that
in general a local image structure is present in a certain
range of scales. The points are then detected at each
scale within this range. As a consequence, there are
many points, which represent the same structure, but
the location and the scale of the points is slightly differ-
ent. The unnecessarily high number of points increases
the probability of mismatches and the complexity of the
matching algorithms. In this case, efficient methods for
rejecting the false matches and for verifying the results
are necessary.
Our scale invariant approach solves this problem by
selecting the points in the multi-scale representation
which are present at characteristic scales. Local ex-
trema over scale of normalized derivatives indicate the
presence of characteristic local structures (Lindeberg,
1998). Here we use the Laplacian-of-Gaussian to se-
lect points localized at maxima in scale-space. This
detector can deal with significant scale changes, as pre-
sented in Section 2. To obtain affine invariant points,
we adapt the shape of the point neighborhood. The
affine shape is determined by the second moment ma-
trix (Lindeberg and Garding, 1997). We then obtain
a truly affine invariant image description which gives
stable/repeatable results in the presence of arbitrary
viewpoint changes. Note that a perspective transforma-
tion of a smooth surface can be locally approximated
by an affine transformation. Although smooth surfaces
are almost never planar in the large, they are always
planar in the small that is, sufficiently small surface
patches can always be thought of as being comprised
of coplanar points. Of course this does not hold if the
point is localized on a depth boundary. However, such
points are rejected during the subsequent steps, for ex-
ample during matching. An additional post-processing
method can be used to separate the foreground
from the background (Borenstein and Ullman, 2002;
Mikolajczyk and Schmid, 2003b). The affine invari-
ant detector is presented in Section 3. To measure the
accuracy of our detectors we introduce a repeatability
criterion which we use to evaluate and compare our
detectors to existing approaches. Section 4 presents
the evaluation criteria and the results of the compar-
ison, which shows that our detector performs better
then existing ones. Finally, in Section 5 we present
experimental results for matching.
1.1. Related Work
Many approaches have been proposed for extracting
scale and affine invariant features. These are reviewed
in the following.
Scale Invariant Detectors. There are a few ap-
proaches which are truly invariant to significant scale
changes. Typically, such techniques assume that the
scale change is the same in every direction, although
they exhibit some robustness to weak affine deforma-
tions. Existing methods search for local extrema in
the 3D scale-space representation of an image (x, y
and scale). This idea was introduced in the early
eighties by Crowley (1981) and Crowley and Parker
(1984). In this approach the pyramid representation
is computed using difference-of-Gaussian filters. A
feature point is detected if a local 3D extremum is
present and if its absolute value is higher than a
threshold. The existing approaches differ mainly in the
differential expression used to build the scale-space
representation.
Lindeberg (1998) searches for 3D maxima of scale
normalized differential operators. He proposes to use
the Laplacian-of-Gaussian (LoG) and several other
derivative based operators. The scale-space represen-
tation is built by successive smoothing of the high res-
olution image with Gaussian based kernels of different
size. The LoG operator is circularly symmetric and it
detects blob-like structures. The scale invariance of in-
terest point detectors with automatic scale selection has
also been explored by Bretzner and Lindeberg (1998)
in the context of tracking.
Lowe (1999) proposed an efficient algorithm for
object recognition based on local 3D extrema in
Scale & Affine Invariant Interest Point Detectors 65
the scale-space pyramid built with difference-of-
Gaussian (DoG) filters. The input image is successively
smoothed with a Gaussian kernel and sampled. The
difference-of-Gaussian representation is obtained by
subtracting two successive smoothed images. Thus, all
the DoG levels are constructed by combined smoothing
and sub-sampling. The local 3D extrema in the pyramid
representation determine the localization and the scale
of the interest points. The DoG operator is a close ap-
proximation of the LoG function but the DoG can sig-
nificantly accelerate the computation process (Lowe,
1999). A few images per second can be processed with
this algorithm.
The common drawback of the DoG and the LoG rep-
resentation is that local maxima can also be detected in
the neighborhood of contours or straight edges, where
the signal change is only in one direction. These max-
ima are less stable because their localization is more
sensitive to noise or small changes in neighboring tex-
ture. A more sophisticated approach, solving this prob-
lem, is to select the scale for which the trace and the
determinant of the Hessian matrix (H) simultaneously
assume a local extremum (Mikolajczyk, 2002). The
trace of the H matrix is equal to the LoG but detect-
ing simultaneously the maxima of the determinant pe-
nalizes points for which the second derivatives detect
signal changes in only one direction. A similar idea
is explored in the Harris detector, although it uses the
first derivatives. The second derivative gives a small
response exactly in the point where the signal change
is most significant. Therefore the maxima are not lo-
calized exactly at the largest signal variation, but in its
neighborhood.
A different approach for the scale selection was pro-
posed by Kadir and Brady (2001). They explore the
idea of using local complexity as a measure of saliency.
The salient scale is selected at the entropy extremum
of the local descriptors. The selected scale is therefore
descriptor dependent. The method searches for scale lo-
calized features with high entropy, with the constraint
that the scale is isotropic.
Affine Invariant Detectors. An affine invariant de-
tector can be seen as a generalization of the scale in-
variant detector. In the case of an affine transformation
the scaling can be different in each direction. The non-
uniform scaling has an influence on the localization, the
scale and the shape of a local structure. Therefore, the
scale invariant detectors fail in the case of significant
affine transformations.
An affine invariant algorithm for corner detection
was proposed by Alvarez and Morales (1997). They
apply affine morphological multi-scale analysis to ex-
tract corners. For each extracted point they build a chain
of points detected at different scales, but associated
with the same local image structure. The final loca-
tion and orientation of the corner is computed using
the bisector line given by the chain of points. A similar
idea was previously explored by Deriche and Giraudon
(1993). The main drawback of these approaches is that
an interest point in images of natural scenes cannot
be approximated by a model of a perfect corner, as it
can take any form of a bi-directional signal change.
The real points detected at different scales do not move
along a straight bisector line as the texture around the
points significantly influences the location of the local
maxima. This approach cannot be a general solution
to the problem of affine invariance but gives good re-
sults for images where the corners and multi-junctions
are formed by straight or nearly straight step-edges.
Our approach makes no assumption on the form of a
local structure. It only requires a bi-directional signal
change.
Recently, Tuytelaars and Van Gool (1999, 2000) pro-
posed two approaches for detecting image features in
an affine invariant way. The first one starts from Harris
points and uses the nearby edges. Two nearby edges,
which are required for each point, limit the number of
potential features in an image. A parallelogram region
is bounded by these two edges and the initial Harris
point. Several intensity based functions are used to de-
termine the parallelogram. In this approach, a reliable
algorithm for extracting the edges is necessary. The sec-
ond method is purely intensity-based and starts with ex-
traction of local intensity extrema. Next, the algorithm
investigates the intensity profiles along rays going out
of the local extremum. An ellipse is fitted to the re-
gion determined by significant changes in the intensity
profiles. A similar approach based on local intensity
extrema was introduced by Matas et al. (2002). They
use the water-shed algorithm to find intensity regions
and fit an ellipse to the estimated boundaries.
Lindeberg and Garding (1997) developed a method
for finding blob-like affine features with an iterative
procedure in the context of shape from texture. The
affine invariance of shape adapted fixed points was also
used for estimating surface orientation from binocular
data (shape from disparity gradients). This work pro-
vided the theory for the affine invariant detector pre-
sented in this paper. It explores the properties of the
66 Mikolajczyk and Schmid
second moment matrix and iteratively estimates the
affine transformation of local patterns. The authors pro-
pose to extract the points using the maxima of a uniform
scale-space representation and to iteratively modify the
scale and the shape of points. However, the location of
points is detected only at the initial step of the algo-
rithm, by the circularly symmetric, not affine invariant
Laplacian measure. Therefore, the spatial location of
the maximum can be slightly different if the pattern un-
dergoes a significant affine deformation. This method
was also applied to detect elliptical blobs in the con-
text of hand tracking (Laptev and Lindeberg, 2001).
The affine shape estimation was used for matching and
recognition by Baumberg (2000). He extracts interest
points at several scales using the Harris detector and
then adapts the shape of the point neighborhood to
the local image structure using the iterative procedure
proposed by Lindeberg. The affine shape is estimated
for a fixed scale and fixed location, that is the scale
and the location of the points are not extracted in an
affine invariant way. The points as well as the associ-
ated regions are therefore not invariant in the case of
significant affine transformations (see Section 4.1 for a
quantitative comparison). Furthermore, there are many
points repeated at the neighboring scale levels (Fig. 2),
which increases the probability of false matches and
the complexity. Recently, Schaffalitzky and Zisser-
man (2002) extended the Harris-Laplace detector
(Mikolajczyk and Schmid, 2001) by affine normaliza-
tion proposed by Baumberg (2000). However, the loca-
tion and scale of points are provided by the scale invari-
ant Harris-Laplace detector (Mikolajczyk and Schmid,
2001), which is not invariant to significant affine
transformations.
2. Scale Invariant Interest Point Detector
The evaluation of interest point detectors presented in
Schmid et al. (2000) demonstrate an excellent perfor-
mance of the Harris detector compared to other exis-
ting approaches (Cottier, 1994; Forstner, 1994; Heitger
et al., 1992; Horaud et al., 1990). However this detec-
tor is not invariant to scale changes. In this section
we propose a new interest point detector that combines
the reliable Harris detector (Harris and Stephens, 1988)
with automatic scale selection (Lindeberg, 1998) to ob-
tain a scale invariant detector. In Section 2.1 we intro-
duce the methods on which we base the approach. In
Section 2.2 we discuss in detail the scale invariant
detector and present an example of extracted points.
2.1. Feature Detection in Scale-Space
Scale Adapted Harris Detector. The Harris detector
is based on the second moment matrix. The second
moment matrix, also called the auto-correlation matrix,
is often used for feature detection or for describing local
image structures. This matrix must be adapted to scale
changes to make it independent of the image resolution.
The scale-adapted second moment matrix is defined by:
µ(x, σI , σD) =
[
µ11 µ12
µ21 µ22
]
= σ 2D g(σI ) ∗
[ L2x (x, σD) Lx L y(x, σD)
Lx L y(x, σD) L2y(x, σD)
]
(1)
where σI is the integration scale, σD is the differen-
tiation scale and La is the derivative computed in the
a direction. The matrix describes the gradient distri-
bution in a local neighborhood of a point. The local
derivatives are computed with Gaussian kernels of the
size determined by the local scale σD (differentiation
scale). The derivatives are then averaged in the neigh-
borhood of the point by smoothing with a Gaussian
window of size σI (integration scale). The eigenvalues
of this matrix represent two principal signal changes
in the neighborhood of a point. This property enables
the extraction of points, for which both curvatures are
significant, that is the signal change is significant in the
orthogonal directions i.e. corners, junctions etc. Such
points are stable in arbitrary lighting conditions and are
representative of an image. One of the most reliable in-
terest point detectors, the Harris detector (Harris and
Stephens, 1988), is based on this principle. The Harris
measure combines the trace and the determinant of the
second moment matrix:
cornerness = det(µ(x, σI, σD))
− αtrace2(µ(x, σI, σD)) (2)
Local maxima of cornerness determine the location of
interest points.
Automatic Scale Selection. Automatic scale selec-
tion and the properties of the selected scales have been
extensively studied by Lindeberg (1998). The idea is to
select the characteristic scale of a local structure, for
which a given function attains an extremum over scales.
In relation to automatic scale selection, the term char-
acteristic originally referred to the fact that the selected
Scale & Affine Invariant Interest Point Detectors 67
scale estimates the characteristic length of the corre-
sponding image structures, in a similar manner as the
notion of characteristic length is used in physics. The
selected scale is characteristic in the quantitative sense,
since it measures the scale at which there is maximum
similarity between the feature detection operator and
the local image structures. This scale estimate will (for
a given image operator) obey perfect scale invariance
under rescaling of the image pattern.
Given a point in an image and a scale selection op-
erator we compute the operator responses for a set
of scales σn (Fig. 1). The characteristic scale corre-
sponds to the local extremum of the responses. Note
that there might be several maxima or minima, that
is several characteristic scales corresponding to differ-
ent local structures centered on this point. The char-
acteristic scale is relatively independent of the image
resolution. It is related to the structure and not to the
resolution at which the structure is represented. The
ratio of the scales at which the extrema are found for
corresponding points is the actual scale factor between
the point neighborhoods. In Mikolajczyk and Schmid
(2001) we compared several differential o
本文档为【图像处理(特征提取)必看文献mikolajczyk_ijcv2004】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。