1. Introduction
The feature extraction problem, in the domain of image analysis systems, poses two
main research questions. How can
distinctive areas within an image
be
detected? And, how can distinctive areas be
described
in such a way as to facilitate their identification? The main concepts to be taken
from those questions are: detection and description. Concerning the former, the
detection problem, a mainstay in vision systems are interest point or interest
region extraction algorithms. These techniques search for image pixels, or image
regions, that exhibit high signal variations with respect to a particular local
measure. Solutions have been designed based on studying intrinsic properties of
2D-signals [
1
H. P. Moravec, “Towards automatic visual obstacle
avoidance,” in IJCAI, pp.
584 (1977).
], and more recently a novel methodology has been proposed
that solves a properly framed optimization problem and automatically synthesizes
interest point operators with Genetic Programming [
2
L. Trujillo and G. Olague, “Synthesis of interest point
detectors through genetic programming, ” in
Proceedings from GECCO 2006, Mike Cattolico, eds., Vol.1, (ACM Press
2006), pp.
887–894.
,
3
L. Trujillo and G. Olague, “Using Evolution to learn how to
perform interest point detection,” in
Proceedings from ICPR 2006, X.Y. Tang, et al., eds., Vol. 1,
(IEEE
2006), pp.
211–214.
]. In response to the second question, dealing with the
concept of description, different techniques have been proposed that encode the
information within these so called interesting regions. Hence, discriminative
feature are constructed that uniquely characterize each interest regions. This in
turn allows for efficient feature matching in a wide range of imaging problems.
Currently, the SIFT [
4
D. G. Lowe, “Distinctive Image Features from
Scale-Invariant Keypoints,” Int. J.
Comput. Vision
2, 91–110
(2004). [CrossRef]
] descriptor has proven to be the most discriminative local
descriptor in machine vision literature, and shows the highest performance with
respect to the current set of benchmark tests [
5
C. Schmid and K. Mikolajczyk, “A performance evaluation of local
descriptors,” IEEE Trans. on Patt. Ana.
and Mach. Int.
27, 1615–1630
(2005). [CrossRef]
].
This paper presents a novel region descriptor based on the concept of
Hölderian regularity. By approximating the pointwise Hölder
exponent, also known as the Lipschitz exponent, using local signal oscillations
around each image point, we are able to construct discriminative feature vectors.
Our proposed descriptor is invariant to several types of changes in viewing
conditions, exhibiting high and stable performance. As such, the main contribution
of this work is that it introduces novel concepts to the field of feature extraction
algorithms, using formal mathematical tools and corroborated by high performance on
standard tests.
The remainder of this paper is organized as follows. Section 2 gives a brief overview
of related work. Section 3 presents the concept of Hölderian regularity
and how to estimate it. Section 4 introduces our local descriptor based on pointwise
Hölder exponents. Later in Section 5, experimental results are provided.
Section 6 presents conclusions and outline future work.
2. Related work
It is not our intention to give a comprehensive summary on the subject of local
descriptors, such a discussion can be found in Ref. [
5
C. Schmid and K. Mikolajczyk, “A performance evaluation of local
descriptors,” IEEE Trans. on Patt. Ana.
and Mach. Int.
27, 1615–1630
(2005). [CrossRef]
]. Hence, we will only focus on presenting the basic
strategies followed by the most common type of region detectors, distribution based
descriptors, and discuss the SIFT strategy.
Currently, most state-of-the-art local descriptors use a distribution based approach.
These techniques characterize image information using local histograms of a
particular measure related to shape or appearance. The most simple would be using
histograms of pixel intensities, while more complex representations could be based
on representing texture characteristics. The most successful descriptor currently
available in computer vision literature is SIFT, developed by David Lowe in [
4
D. G. Lowe, “Distinctive Image Features from
Scale-Invariant Keypoints,” Int. J.
Comput. Vision
2, 91–110
(2004). [CrossRef]
], which builds an histogram of gradient distributions within
an interest region. The descriptor builds a 3D histogram of gradient locations and
orientations, weighted by the gradient magnitudes. Although SIFT combines both a
scale invariant detector with the gradient distribution descriptor, only the latter
has proven to outperform other types of techniques, and it is possible to replace
the former with a more reliable region detector.
3. Hölder regularity
One of the most popular ways to measure the regularity of a signal, be it pointwise
or local, is to consider Hölder spaces. Hence, we will present the
concept of regularity expressed through the Hölder exponent.
DEFINITION 1. Let f:
ℜ→ℜ, s ∈
ℜ+*\N and
x
0 ∈ ℜ. Then, f
∈
C
s(x
0)⇔∃η
∈ ℜ+*, a polynom P of
degree < s and a constant c such that
The pointwise Hölder exponent of
f at
x
0 is α
p = sup
s
{
f ∈
Cs
(
x
0)} (see
Fig. 1).
Fig. 1. Hölderian envelope of signal f at point x0.
The concept of signal regularity, characterized by the Hölder exponent,
has been widely used in fractal analysis [
6
K. Falconer, Fractal geometry Mathematical Foundations and
Applications (Wiley,
1990).
].With regards to image analysis, the Hölder
exponent provides a great deal of information related to the local structure around
each point. Hence, it has been applied to such tasks as edge detection [
7
J. Lévy Véhel, “Fractal Approaches in Signal
Processing,” Fractals
3, 755–775
(1995). [CrossRef]
] and image interpolation [
8
P. Legrand and J. Lévy Véhel, “Local regularity-based
interpolation,” in WAVELET X, Part
of SPIE’s Symposium on Optical Science and Technology, SPIE
5207 (2003).
]. Furthermore, because most local image descriptors are
fundamentally attempting to describe local image variations and overall structure,
it is a natural conclusion to expect that Hölderian regularity will prove
to be a useful tool in this task.
3.1. Estimating the Hölder exponent with oscillations
The most natural way to estimate the Hölder exponent, because it
follows from its definition, consists in studying the oscillations around each
point. This method gives accurate results, better than those obtained using
wavelet analysis inmost cases [
9
P. Legrand, “Debruitage et interpolation par
analyse de la regularite Hölderienne. Application a la
modelisation du frottement pneumatique-chaussee,”
P.h.D. thesis, Université de Nantes (2004).
], hence it will be the technique of choice to compute our
proposed descriptor. A brief description of this technique will now be given,
for a more detailed analysis please see Ref. [
10
C. Tricot, Curves and Fractal Dimension
(Springer-Verlag
1995). [CrossRef]
]. It is pointed out that the Hölder exponent
of function
f (
t) at
t is
α
p ∈ [0,1], if a constant
c exists such that ∀
t′
in a vicinity of
t,
In terms of signal oscillations, this condition can be written as: a function
f(t) is Hölderian with exponent
αp ∈ [0,1] at t if
∃c ∀τ such that
osc
τ (t)
≤ cταp
,
with
An estimation of the regularity will be built at each point by computing the
slope of the regression between the logarithm of the oscillation and the
logarithm of the dimension of the neighborhood at which one calculates the
oscillation. From an algorithmic point of view, it is preferable not to use all
sizes of neighborhoods between two values τ
min
and τ
max
. Hence, we calculate the oscillation at point
t only on
intervals of the form [
t-τ
r :
t+τ
r], where τ
r
=
baser
. Here, we use least squares
regression, with
base = 2 and
r = 1,2,
… ,7. For a 2D signal,
t defines a point in 2D space
and τ
r a radius around
t, such that the
Euclidian distances
d(
t′,
t) and
d(
t′′,
t) are ≤τ
r.We can visualize
this process in
Fig. 2. The method is reliable under three conditions:
that α
p< 1, the regression converges, and it
converges towards a valid slope.
Fig. 2. Estimating the Hölder exponent with oscillations.
Left: the region of interest λ, and three of the
seven neighborhoods around point t, when
r = 1,2, ⋯,7. Center: the
neighborhood of radius τ5 = 32 pixels, with
base = 2. Right: computing the
supremum of the differences within radius τ5,
where d denotes the Euclidian distance.
4. Hölder descriptor
Now that we have described a method to accurately characterize the pointwise signal
regularity, we can now move on to describe how we use this information to build our
local descriptors. The process, described in
Fig. 3, is as follows. First, a set Λ of regions
of interest are extracted from an image. Second, the dominant gradient orientation
ϕ
λ is computed, this preserves rotation
invariance. Finally, our feature vector δ
λ contains
the Hölder exponent α
pof the region center and of
128 concentric points, ordered according to ϕ
λ.
Fig. 3. Descriptor building process. First, a region detector extrats a set
Λ of interesting regions. Then, ∀λ
∈ Λ we compute a decriptor
δλ . A descriptor contains the
Hölder exponent at the region center
(x
λ,
y
λ), and of 32 points on the
perimeter of four concentric rings, each ring with radii of and s
λ
respectively.
Region extraction: The first step in the process requires stable
detection of salient image regions. The type of regions to be extracted will depend
on the requirement of the higher level application with respect to invariance. For
instance, an interest point detector will suffice when the scale of the imaged scene
is not modified. In our work, we use a detector optimized for geometric stability
and global point separability, the IPGP2 detector which is the determinant of the
Hessian matrix smoothed by a 2D Gaussian kernel [
2
L. Trujillo and G. Olague, “Synthesis of interest point
detectors through genetic programming, ” in
Proceedings from GECCO 2006, Mike Cattolico, eds., Vol.1, (ACM Press
2006), pp.
887–894.
,
3
L. Trujillo and G. Olague, “Using Evolution to learn how to
perform interest point detection,” in
Proceedings from ICPR 2006, X.Y. Tang, et al., eds., Vol. 1,
(IEEE
2006), pp.
211–214.
]. All regions extracted with an interest point detector are
assigned the same scale,
w
λ = 2.5 pixels. For
images where scale is a factor, we use the Hessian-Laplace detector presented in
Ref. [
11
K. Mikolajczyk and C. Schmid, “Scale and affine invariant interest
point detectors,” Int. J. Comput. Vision
1, 63–86
(2004). [CrossRef]
], which searches for extrema in a linear scale space
generated with a Gaussian kernel. After this step we are left with a set
Λ of circular regions, the scale is set to
s
λ = 5 ∙
w
λ, and
w
λ is the scale given by the detector.
Dominant orientation: In order to preserve rotation invariance, the
dominant gradient orientation is computed and used as a reference for the subsequent
sampling process. For the scale invariant detector, all image regions are normalized
to 41×41 bit size using bicubic interpolation. An orientation histogram
is constructed using gradient orientations within the interest region, similar to
what is described in Ref. [
4
D. G. Lowe, “Distinctive Image Features from
Scale-Invariant Keypoints,” Int. J.
Comput. Vision
2, 91–110
(2004). [CrossRef]
]. The histogram peak is obtained and thus
∀λ∈ Λ a corresponding dominant
orientation
ϕ
λ is assigned. In
this way, each region is described by a set λ =
{
x
λ,
y
λ,
s
λ, ϕ
λ},
the region center, scale and orientation of the region.
Descriptor: Now that regions are appropriately detected and described
with λ, we can now continue to construct our region descriptor
δ
λ,∀λ∈
Λ. Our sampling process is simple, see
Fig. 3, the first element of
δ
λ is the Hölder exponent
α
p computed at the region center
(
x
λ,
y
λ).
Next, the Hölder exponent of points on the perimeter of four concentric
rings are sampled, with radii of
and
s
λ respectively. A
total of 32 points on each ring are sampled, starting from the position given by
ϕ
λ, uniformly spaced and ordered
counterclockwise. Hence, our feature vector δ
λ has
129 dimensions, compared to the 128 of SIFT. The choice of the parameters, such as
the size of the rings and the number of sample points, is related with the challenge
of building a discriminative descriptor while at the same time maintaining a compact
representation. A problem faced by any attempt to describe real-world information.
The final values were selected empirically, guided by experimental runs, however an
optimization process could be advantageous, i.e. evolutionary computation.
5. Experimental results
In order to effectively evaluate and compare our results, we use standard image
sequences provided by the Visual Geometry Group [
12]. From each sequence there is one reference image and a set
of test images. Due to prior knowledge of the transformation between the reference
and test images, we can quantify a matching score. Sample images and experimental
results using the following performance metrics are shown in
Fig. 4. We evaluate with threshold based matching, where two
image regions λ
1 and λ
2 are matched if
the following relation holds:
δ(δ
λ1
,
δ
λ2
) < δ.
The value of δ is varied to obtain two types of performance curves: one
plots
Recall versus
1-Precision
, characterizing the matching between one test image and the reference
image (row 3) [
5
C. Schmid and K. Mikolajczyk, “A performance evaluation of local
descriptors,” IEEE Trans. on Patt. Ana.
and Mach. Int.
27, 1615–1630
(2005). [CrossRef]
]; the other, is a double
y-
axis plot, one axis for average
Recall and the other for average
1-Precision
, that characterizes the performance of the descriptor on a complete
sequence (rows 4 & 5).
Recall/1-Precision
gives the number of correct and
false matches between two images.
Recall is the number of correctly
matched regions with respect to the number of corresponding regions between two
images of the same scene. The number of false matches relative to the total number
of matches is represented by
1-Precision. A perfect descriptor
would give a
Recall equal to 1 for any
Precision.
Recall and
1-precision are defined as in Ref. [
5
C. Schmid and K. Mikolajczyk, “A performance evaluation of local
descriptors,” IEEE Trans. on Patt. Ana.
and Mach. Int.
27, 1615–1630
(2005). [CrossRef]
]:
, and
Note that the second type of plot, includes errorbars to visualize
the stability of the descriptor. The performance of our descriptor is compared
against SIFT. To compute SIFT descriptors, the Harris and Harris-Laplace detectors
were used to extract image regions (executables obtained from Ref. [
12]).
Figure 4 exhibits the following patterns.
Rotation: the Hölder descriptor outper-forms SIFT, with
higher
Recalland better
Precision.
Illumination & JPEG : very comparable overall
performance.
Scale: both exhibit the same performance patterns with
SIFT consistently better.
Fig. 4. Columns, left to right: 1) Rotation (36 images in sequence), 2)
Illumination change (10 images), 3) JPEG compression (6 images), and 4)
Scale change (first 6 images of sequence). Rows, top to bottom:
1) Reference image, 2) Test Image, 3) Performance between test and reference
with Hölder in green and SIFT in red, plotting
Recall vs. 1-Precision, 4) & 5)
Average performance on the complete image sequence for SIFT and
Hölder respectively (y-axis with
Recallin blue and 1-Precision in green and
threshold on x-axis).
6. Conclusions and future work
Results show very promising experimental results, in general we can appreciate how
the regularity and SIFT descriptors exhibit comparable performance. For image
rotation and illumination change, our Hölder descriptor is consistently
better, with the opposite being true for JPEG compression and scale change. In the
case of scale change, the performance of our descriptor is expected to be directly
related to the method of Hölder exponent estimation. For this reason, an
appropriate modification of the oscillations method is necessary. For image
compression, it is a consequence of the intrinsic change in image regularity induced
by this transformation.