Multilayer pLSA for Multimodal Image Retrieval
∗
Rainer Lienhart
Multimedia Computing Lab
University of Augsburg
Augsburg, Germany
lienhart@informatik.uni-
augsburg.de
Stefan Romberg
Multimedia Computing Lab
University of Augsburg
Augsburg, Germany
romberg@informatik.uni-
augsburg.de
Eva Ho¨rster
Multimedia Computing Lab
University of Augsburg
Augsburg, Germany
hoerster@informatik.uni-
augsburg.de
ABSTRACT
It is current state of knowledge that our neocortex consists of
six layers [10]. We take this knowledge from neuroscience as
an inspiration to extend the standard single-layer probabilis-
tic Latent Semantic Analysis (pLSA) [13] to multiple layers.
As multiple layers should naturally handle multiple modali-
ties and a hierarchy of abstractions, we denote this new ap-
proach multilayer multimodal probabilistic Latent Semantic
Analysis (mm-pLSA). We derive the training and inference
rules for the smallest possible non-degenerated mm-pLSA
model: a model with two leaf-pLSAs (here from two different
data modalities: image tags and visual image features) and
a single top-level pLSA node merging the two leaf-pLSAs.
From this derivation it is obvious how to extend the learn-
ing and inference rules to more modalities and more layers.
We also propose a fast and strictly stepwise forward proce-
dure to initialize bottom-up the mm-pLSA model, which in
turn can then be post-optimized by the general mm-pLSA
learning algorithm. We evaluate the proposed approach ex-
perimentally in a query-by-example retrieval task using 50-
dimensional topic vectors as image models. We compare
various variants of our mm-pLSA system to systems relying
solely on visual features or tag features and analyze possi-
ble pitfalls of the mm-pLSA training. It is shown that the
best variant of the the proposed mm-pLSA system outper-
forms the unimodal systems by approximately 19% in our
query-by-example task.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous;
D.2.8 [Software Engineering]: Metrics—complexity mea-
sures, performance measures; I.4.8 [Scene Analysis]: Sen-
sor fusion; I.4.10 [Image Representation]: Statistical
∗Funded by Deutsche Forschungsgemeinschaft (DFG) under
contract number LI 1816/1-1
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIVR ’09, July 8-10, 2009 Santorini, GR
Copyright 2009 ACM 978-1-60558-480-5/09/07 ...$5.00.
General Terms
Probabilistic image models, hierarchical pLSA, image re-
trieval, pLSA, SIFT, image tags
Keywords
Image retrieval, multimodal pLSA, hierarchical pLSA, SIFT,
tags
1. INTRODUCTION
Many content-based image retrieval systems either solely
rely on visual features or on text features to derive a rep-
resentation of the image content. This is especially true
for systems using topic models based on probabilistic La-
tent Semantic Analysis (pLSA) [6, 18, 13]. There are good
reasons why pLSA is applied to unimodal data: The ap-
parently straightforward application of pLSA to multimodal
data by subsuming all words of the various modes (which
are generally derived from appropriate features of the re-
spective modality) into one large word set (called vocabu-
lary) frequently does not lead to the expected improvement
in retrieval performance. Even mixing words derived from
different kinds of features within one domain such as differ-
ent kinds of visual salient point descriptors (e.g., SIFT [19],
SURF [11], Geometric blur [2], or self-similarity feature [22])
using different sampling strategies (e.g., dense versus sparse
sampling) does not work satisfactorily with this straightfor-
ward application of pLSA.
Thus, we propose a multilayer multimodal pLSA model
that can handle different modalities as well as different fea-
tures within a mode effectively and efficiently. For this we
introduce not just a single layer of topics or aspects, but a
hierarchy of topics. We explain the overall approach by us-
ing the smallest possible non-degenerated mm-pLSA model:
a model with two separate sets of topics for data from two
different modes and a set of top-level topics that merges
the knowledge of the two leaf-topic sets. This resembles
somewhat two leaf-pLSAs from two different data modali-
ties that in turn are merged by a single top-level pLSA node,
and lends the proposed approach its name: mm-pLSA. From
this derivation it is obvious how to extend the learning and
inference rules to more modalities and more layers. We also
propose a fast and strictly stepwise forward procedure to ini-
tialize bottom-up the mm-pLSA model that leads to much
better learning results of the mm-pLSA learning algorithm
compared to random initialization.
The paper is organized as follows. In Section 2 we first
describe the model of the standard pLSA algorithm (Sub-
section 2.1) as well as how to learn a pLSA model in gen-
eral (Subsection 2.2) and specifically from the visual features
(Subsection 2.3) and tag features (Subsection 2.4). Classi-
fication of a new image or text document is also addressed.
Then, Section 3 presents the core novelty of our work in de-
tail: the multilayer multimodal probabilistic Latent Semantic
Analysis model (mm-pLSA). It starts in Subsection 3.1 with
a motivation and a detailed explanation of the model, before
we derive the training and inference steps in Subsection 3.2.
A heuristic for fast and good initialization of the multilayer
multimodal pLSA model is presented in Subsection 3.3 and
carefully evaluated in Section 4 on a large scale database
consisting of 246,347 images downloaded from Flickr. Our
proposed mm-pLSA based image retrieval system is com-
pared to systems relying solely on visual features [18] or
tag features as well as to a pLSA-based system with the
combined vocabulary set from the visual and tag domain.
Section 5 details related work, before Section 6 concludes
the paper.
2. STANDARD PLSA
2.1 Motivation and Model
The pLSA was originally devised by T. Hofmann in the
context of text document retrieval, where words constitute
the elementary parts of documents [13]. Applied to images,
each image represents a single visual document. pLSA can
be applied directly to image tags, as image tags consist of
words. However, for our visual features we need comparable
elementary parts called visual words. For the moment we
assume that all features we computed in a given mode are
somehow mapped to words in that mode. Details of the
mapping from the visual features to the mode-specific words
are given in Subsection 2.3. For now we just assume that
we have words.
The key concept of the pLSA model is to map the high-
dimensional word distribution vector of a document to a
lower dimensional topic vector (also called aspect vector).
Therefore pLSA introduces a latent, i.e. unobservable, topic
layer between the documents (i.e., images here) and the
words. It is assumed that each document consists of a mix-
ture of multiple topics and that the occurrences of words
(i.e., visual words in images or tags of images, respectively)
is a result of the topic mixture. This generative model is
expressed by the following probabilistic model:
P (di, wj) = P (di)
X
K
P (zk|di)P (wj |zk) (1)
where P (di) denotes the probability of a document di of the
database to be picked, P (zk|di) the probability of a topic zk
given the current document, and P (wj |zk) the probability
of a visual word wj given a topic. The model is graphically
depicted in Fig. 1. Ni denotes the number of words of which
each of the M documents consists. It is important not to
confuse Ni, the number of words in document di, with N ,
the number of words in the vocabulary.
Once a topic mixture P (zk|di) is derived for each docu-
ment di, a high-level representation based on the respective
mode the words belong to has been found. At the same time
this representation is of low dimensionality as we commonly
choose the number of concepts in our model to be much
smaller than the number of words. The K-dimensional topic
vector can be used directly in a query-by-example retrieval
d
M
z w
Ni
Figure 1: Standard pLSA-Model.
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18
0 0 1 41 0 0 3 5 0 12 0 18 document vector
term vector
images
d1
dN
v1 vW
visual words
d i
v j
di vjn( ),
Figure 2: Term-document matrix.
task, if we measure document similarity by computing the
L1 or cosine distance between topic vectors of different doc-
uments.
2.2 Training and Inference
Computing the term-document matrix of the training cor-
pus is a prerequisite for deriving the pLSAmodel (see Fig. 2).
Each row i in the term-document matrix [n(di, wj)]i,j de-
scribes the absolute count with which word wj (also called
a term) occurs in document di. The terms are taken from
a predefined dictionary consisting of N terms. The number
of documents is M . Thus n(di, wj) specifies the number of
times the word wj occurred in document di. Note that by
normalizing each document vector to 1 using the L1-norm,
the document vector (n(di, w1), ..., n(di, wN )) of di becomes
the estimated mass probability distribution P (wj |di).
We learn the unobservable probability distribution P (zk|di)
and P (wj |zk) from the data using the Expectation-Maximi-
zation-Algorithm (EM-Algorithm) [7, 13]:
E-Step:
P (zk|di, wj) =
P (wj |zk)P (zk|di)PK
l=1 P (wj |zl)P (zl|di)
(2)
M-Step:
P (wj |zk) =
PM
i=1 n(di, wj)P (zk|di, wj)PN
j=1
PM
i=1 n(di, wj)P (zk|di, wj)
(3)
P (zk|di) =
PN
j=1 n(di, wj)P (zk|di, wj)
n(di)
(4)
Given a new test image dtest, we estimate the topic proba-
bilities from the observed words. The sole difference between
inference and learning is that theK learned conditional word
distributions P (wj |zk) are never updated during inference.
Thus, only eq. (2) and eq. (4) are iteratively updated during
inference.
2.3 Visual pLSA-Model
The first step in building a bag-of-words representation
for the visual content of images is to extract visual features
from each image. In our case we apply dense sampling with a
vertical and horizontal step size of 10 pixels across the image
pyramid created with a scale factor of 1.2 to extract local
image features at regular grid keypoints. SIFT descriptors
[19] computed over a region of 41 × 41 pixels are used to
describe the grayscale image region around each keypoint in
a scale and orientation invariant fashion. Although we use
SIFT features in this work, any other feature could be used
instead.
Next the 128-dimensional real-valued local image features
have to be quantized into discrete visual words to derive a
finite vocabulary. Quantization of the features into visual
words is performed by using a vocabulary tree [21] in order
to support large vocabulary sizes. The vocabulary tree is
computed by repeated k-means clusterings that hierarchi-
cally partition the feature space. This hierarchical approach
overcomes two major problems of the traditional direct k-
means clustering in cases where k is large. Firstly clustering
is more efficient during visual word learning and secondly
the mapping of visual features to discrete words is way faster
than using a plain list of visual words.
Once the visual vocabulary of size Nv is determined we
map each feature vector of an image to its closest visual
word. Therefore we query the vocabulary tree for each ex-
tracted feature, and the best matching visual word ID is
returned. This ID in turn is used to compute the document
vector that holds the counts of the occurrences of visual
words in the corresponding image by incrementing the asso-
ciated word count (see Fig. 3).
Note that this image content description does not preserve
any spatial relationship between the occurrences of the vi-
sual words. The co-occurrence vectors of all training images
are used to train the pLSA model. Once the pLSA model is
learned it can be applied to all images in the database, hence
deriving a vector representation for each image, where the
vector elements denote the degree to which an image depicts
a certain visual topic.
2.4 Tag-based pLSA-Model
The second modality we considered are tags – the free-text
annotations provided by the image author. We assume that
all of the images in our database have been tagged by their
authors. Besides a single word, a tag can also be a phrase
or a sentence. However, in this work we treat each word of
the image annotations separately. Thus, in the following the
term tag denotes a single word and is used interchangeably
with ”word” and ”term”.
As we use Flickr images to evaluate our multilevel multi-
modal pLSA model, it is important to note that these tags
reflect the photographer/author’s personal view with respect
to the uploaded image. Thus, in contrast to carefully anno-
tated image databases traditionally used for learning com-
bined image and tag models [1], these image tags from Flickr
are in many cases subjective, ambiguous, and do not neces-
sarily describe the image content shown [17, 18]. This makes
it difficult to use the tags directly for retrieval purposes and
thus some preprocessing is required.
breakfast eat food meal
house construction home object structure
love emotion state
Table 1: Examples for hypernyms (right) found in
Wordnet for the words left.
document
vector
query
for
hypernyms
0 1 0 0 0 1 0 1, , , , , , ,
count++
Wordnet
hypernyms
tag specified
by
photographer
count++
count++
Figure 4: Building a document vector from tags and
their hypernyms by assuming that both tag and hy-
pernyms are present in the vocabulary.
To apply a pLSA model to tags we need to define a finite
vocabulary first. Building the vocabulary starts with listing
all tags that have been used more than N thresh1 times and
by at least N thresh2 different users. This heuristics enforces
that all rarely used tags are neglected. Note that a tag is
also rarely used, if only a few users have used it independent
of the actually count. We further filter the list by discarding
all tags that contain numbers and by splitting tags at under-
scores into separate words. In a last filtering step all words
within the vocabulary are checked whether they are known
by Wordnet [8]. Wordnet is a lexical database of English.
Only words that exist according to Wordnet are assimilated
into the final vocabulary.
Once the tag vocabulary of sizeN t is defined, a co-occurrence
table is built by counting the tag occurrences for each image.
However, before forming the document vector we expand the
list of tags associated with an image by using Wordnet to
enrich the annotations. For each image we query Wordnet
for the semantic parents of the tags specified by the author.
This is done to emphasize semantic features rather than just
the simple word count features.
Semantic parents for a word can easily be extracted from
Wordnet, as each word in Wordnet is associated with hyper-
nyms1. A word’s hypernyms denote its parents that express
the specific concept of the tag more generally. Table 1 shows
hypernyms (right) for some example tags (left). As these hy-
pernyms build a hierarchy and form a tree structure, we add
the hypernyms up to three levels above in the hierarchy of
the tag itself into the tag list of the corresponding image.
Thus, while counting tag occurrences for each image in or-
der to build the document vector, these parents are included
in our model by counting them as if they were present in the
list of tags. This procedure is visualized in Fig. 4. In case
the vocabulary does not contain a tag or hypernyms used
for annotation, the word is simply ignored.
In our experiments we set the parameters for the tag vo-
cabulary to N thresh1 = 18 and N
thresh
2 = 10 resulting in a
vocabulary size of 2421 words. Most images in our database
have between 5 and 15 tags plus hypernyms as can be de-
rived from Fig. 5. The number of tags for some images
is however unreasonably large as users labeled images with
whole sentences or phrases.
1Y is a hypernym of X if every X is a (kind of) Y .
Vocabulary Tree
0 1 2 3 4 5 6 7
visual
word
SIFT
feature
0 1 2 3 4 5 6 7
0 0 0 0 0 1 0 0, , , , , , ,document
vector:
5
count++
assign visual word
Figure 3: Quantization of features into discrete visual words.
Figure 5: Histogram of the total number of tags plus
hypernyms for each image within our database.
3. MULTILAYER MULTIMODAL PLSA
3.1 Motivation and Model
In recent years pLSA has been applied successfully to
unimodal data such as text [13], image tags [20], or visual
words [16]. However, combining two modes such as visual
words and image tags is challenging. The obvious approach
of simply concatenating the two associated term-document
matrices NM×Nv and NM×Nt into NM×(Nv+Nt) and then
applying standard pLSA usually does not lead to the desired
retrieval improvements. One reason is the difference in the
order of magnitude with which words occur in the respective
mode. For instance, a few thousand to ten thousand features
per image are usually computed from images that are resized
to having roughly the same number of dense samples while
preserving the image’s aspect ratio. In contrast, most im-
ages are annotated with fewer than 20 tags. Compensating
between the differences in the order of the magnitude by
some kind of normalization is possible, but will require a
lot of testing to determine an appropriate weighting factor
between the different modes since the actual importance of
each mode must also be taken into account. Another reason
may be the difference in the size of the respective vocabu-
laries. In contrast, a well-founded mathematical approach
with top-level topics will solve this issue effectively and ef-
ficiently. Some empirical evidence for these claims will be
given in section 4.
Our basic idea is to apply pLSA in a first step to each
mode separately, and in a second step concatenate the de-
rived topic vectors of each mode to learn another pLSA on
top of that (see Fig. 8). While we describe this layering of
multiple pLSAs only for two leaf-pLSAs and a node pLSA, it
is obvious that the proposed pLSA layering can be extended
to more than two layers and applied to more than just two
leaf-pLSAs.
The smallest possible multilayer multimodal pLSA model
(mm-pLSA) considering two modes with their respective ob-
servable word occurrences and hidden topics as well as a sin-
gle top-level of hidden aspects is graphically depict in Fig. 6.
Every word of mode x (here: x ∈ {v, t} with v standing for
visual and t for text) occurring in document di is generated
by an unobservable model document:
• Pick a document di with prior probability P (di)
• For each visual word in the document:
– Select a latent top-level concept ztopl with proba-
bility P (ztopl |di)
– Select a visual topic zvk with probability P (z
v
k |z
top
l )
– Generate a visual word wvm with probability P (w
v
m|z
v
k)
• For each tag associated with the document:
– Select a latent top-level concept ztopl with proba-
bility P (ztopl |di)
– Select an image tag topic ztp with probabilityP (z
t
p|z
top
l )
– Generate an image tag wtn with probability P (w
t
n|z
t
p)
Thus the probability of observing a visual word wvm or a tag
wtn in document di is
P (di, w
v
m) =
LX
l=1
KX
k=1
P (di)P (z
top
l
|di)P
本文档为【Multilayer pLSA】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。