Multilayer pLSA

Multilayer pLSA Multilayer pLSA for Multimodal Image Retrieval ∗ Rainer Lienhart Multimedia Computing Lab University of Augsburg Augsburg, Germany lienhart@informatik.uni- augsburg.de Stefan Romberg Multimedia Computing Lab University of Augsburg Augsburg, Germany romb...

Multilayer pLSA for Multimodal Image Retrieval ∗ Rainer Lienhart Multimedia Computing Lab University of Augsburg Augsburg, Germany lienhart@informatik.uni- augsburg.de Stefan Romberg Multimedia Computing Lab University of Augsburg Augsburg, Germany romberg@informatik.uni- augsburg.de Eva Ho¨rster Multimedia Computing Lab University of Augsburg Augsburg, Germany hoerster@informatik.uni- augsburg.de ABSTRACT It is current state of knowledge that our neocortex consists of six layers [10]. We take this knowledge from neuroscience as an inspiration to extend the standard single-layer probabilis- tic Latent Semantic Analysis (pLSA) [13] to multiple layers. As multiple layers should naturally handle multiple modali- ties and a hierarchy of abstractions, we denote this new ap- proach multilayer multimodal probabilistic Latent Semantic Analysis (mm-pLSA). We derive the training and inference rules for the smallest possible non-degenerated mm-pLSA model: a model with two leaf-pLSAs (here from two different data modalities: image tags and visual image features) and a single top-level pLSA node merging the two leaf-pLSAs. From this derivation it is obvious how to extend the learn- ing and inference rules to more modalities and more layers. We also propose a fast and strictly stepwise forward proce- dure to initialize bottom-up the mm-pLSA model, which in turn can then be post-optimized by the general mm-pLSA learning algorithm. We evaluate the proposed approach ex- perimentally in a query-by-example retrieval task using 50- dimensional topic vectors as image models. We compare various variants of our mm-pLSA system to systems relying solely on visual features or tag features and analyze possi- ble pitfalls of the mm-pLSA training. It is shown that the best variant of the the proposed mm-pLSA system outper- forms the unimodal systems by approximately 19% in our query-by-example task. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics—complexity mea- sures, performance measures; I.4.8 [Scene Analysis]: Sen- sor fusion; I.4.10 [Image Representation]: Statistical ∗Funded by Deutsche Forschungsgemeinschaft (DFG) under contract number LI 1816/1-1 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIVR ’09, July 8-10, 2009 Santorini, GR Copyright 2009 ACM 978-1-60558-480-5/09/07 ...$5.00. General Terms Probabilistic image models, hierarchical pLSA, image re- trieval, pLSA, SIFT, image tags Keywords Image retrieval, multimodal pLSA, hierarchical pLSA, SIFT, tags 1. INTRODUCTION Many content-based image retrieval systems either solely rely on visual features or on text features to derive a rep- resentation of the image content. This is especially true for systems using topic models based on probabilistic La- tent Semantic Analysis (pLSA) [6, 18, 13]. There are good reasons why pLSA is applied to unimodal data: The ap- parently straightforward application of pLSA to multimodal data by subsuming all words of the various modes (which are generally derived from appropriate features of the re- spective modality) into one large word set (called vocabu- lary) frequently does not lead to the expected improvement in retrieval performance. Even mixing words derived from different kinds of features within one domain such as differ- ent kinds of visual salient point descriptors (e.g., SIFT [19], SURF [11], Geometric blur [2], or self-similarity feature [22]) using different sampling strategies (e.g., dense versus sparse sampling) does not work satisfactorily with this straightfor- ward application of pLSA. Thus, we propose a multilayer multimodal pLSA model that can handle different modalities as well as different fea- tures within a mode effectively and efficiently. For this we introduce not just a single layer of topics or aspects, but a hierarchy of topics. We explain the overall approach by us- ing the smallest possible non-degenerated mm-pLSA model: a model with two separate sets of topics for data from two different modes and a set of top-level topics that merges the knowledge of the two leaf-topic sets. This resembles somewhat two leaf-pLSAs from two different data modali- ties that in turn are merged by a single top-level pLSA node, and lends the proposed approach its name: mm-pLSA. From this derivation it is obvious how to extend the learning and inference rules to more modalities and more layers. We also propose a fast and strictly stepwise forward procedure to ini- tialize bottom-up the mm-pLSA model that leads to much better learning results of the mm-pLSA learning algorithm compared to random initialization. The paper is organized as follows. In Section 2 we first describe the model of the standard pLSA algorithm (Sub- section 2.1) as well as how to learn a pLSA model in gen- eral (Subsection 2.2) and specifically from the visual features (Subsection 2.3) and tag features (Subsection 2.4). Classi- fication of a new image or text document is also addressed. Then, Section 3 presents the core novelty of our work in de- tail: the multilayer multimodal probabilistic Latent Semantic Analysis model (mm-pLSA). It starts in Subsection 3.1 with a motivation and a detailed explanation of the model, before we derive the training and inference steps in Subsection 3.2. A heuristic for fast and good initialization of the multilayer multimodal pLSA model is presented in Subsection 3.3 and carefully evaluated in Section 4 on a large scale database consisting of 246,347 images downloaded from Flickr. Our proposed mm-pLSA based image retrieval system is com- pared to systems relying solely on visual features [18] or tag features as well as to a pLSA-based system with the combined vocabulary set from the visual and tag domain. Section 5 details related work, before Section 6 concludes the paper. 2. STANDARD PLSA 2.1 Motivation and Model The pLSA was originally devised by T. Hofmann in the context of text document retrieval, where words constitute the elementary parts of documents [13]. Applied to images, each image represents a single visual document. pLSA can be applied directly to image tags, as image tags consist of words. However, for our visual features we need comparable elementary parts called visual words. For the moment we assume that all features we computed in a given mode are somehow mapped to words in that mode. Details of the mapping from the visual features to the mode-specific words are given in Subsection 2.3. For now we just assume that we have words. The key concept of the pLSA model is to map the high- dimensional word distribution vector of a document to a lower dimensional topic vector (also called aspect vector). Therefore pLSA introduces a latent, i.e. unobservable, topic layer between the documents (i.e., images here) and the words. It is assumed that each document consists of a mix- ture of multiple topics and that the occurrences of words (i.e., visual words in images or tags of images, respectively) is a result of the topic mixture. This generative model is expressed by the following probabilistic model: P (di, wj) = P (di) X K P (zk|di)P (wj |zk) (1) where P (di) denotes the probability of a document di of the database to be picked, P (zk|di) the probability of a topic zk given the current document, and P (wj |zk) the probability of a visual word wj given a topic. The model is graphically depicted in Fig. 1. Ni denotes the number of words of which each of the M documents consists. It is important not to confuse Ni, the number of words in document di, with N , the number of words in the vocabulary. Once a topic mixture P (zk|di) is derived for each docu- ment di, a high-level representation based on the respective mode the words belong to has been found. At the same time this representation is of low dimensionality as we commonly choose the number of concepts in our model to be much smaller than the number of words. The K-dimensional topic vector can be used directly in a query-by-example retrieval d M z w Ni Figure 1: Standard pLSA-Model. 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 0 0 1 41 0 0 3 5 0 12 0 18 document vector term vector images d1 dN v1 vW visual words d i v j di vjn( ), Figure 2: Term-document matrix. task, if we measure document similarity by computing the L1 or cosine distance between topic vectors of different doc- uments. 2.2 Training and Inference Computing the term-document matrix of the training cor- pus is a prerequisite for deriving the pLSAmodel (see Fig. 2). Each row i in the term-document matrix [n(di, wj)]i,j de- scribes the absolute count with which word wj (also called a term) occurs in document di. The terms are taken from a predefined dictionary consisting of N terms. The number of documents is M . Thus n(di, wj) specifies the number of times the word wj occurred in document di. Note that by normalizing each document vector to 1 using the L1-norm, the document vector (n(di, w1), ..., n(di, wN )) of di becomes the estimated mass probability distribution P (wj |di). We learn the unobservable probability distribution P (zk|di) and P (wj |zk) from the data using the Expectation-Maximi- zation-Algorithm (EM-Algorithm) [7, 13]: E-Step: P (zk|di, wj) = P (wj |zk)P (zk|di)PK l=1 P (wj |zl)P (zl|di) (2) M-Step: P (wj |zk) = PM i=1 n(di, wj)P (zk|di, wj)PN j=1 PM i=1 n(di, wj)P (zk|di, wj) (3) P (zk|di) = PN j=1 n(di, wj)P (zk|di, wj) n(di) (4) Given a new test image dtest, we estimate the topic proba- bilities from the observed words. The sole difference between inference and learning is that theK learned conditional word distributions P (wj |zk) are never updated during inference. Thus, only eq. (2) and eq. (4) are iteratively updated during inference. 2.3 Visual pLSA-Model The first step in building a bag-of-words representation for the visual content of images is to extract visual features from each image. In our case we apply dense sampling with a vertical and horizontal step size of 10 pixels across the image pyramid created with a scale factor of 1.2 to extract local image features at regular grid keypoints. SIFT descriptors [19] computed over a region of 41 × 41 pixels are used to describe the grayscale image region around each keypoint in a scale and orientation invariant fashion. Although we use SIFT features in this work, any other feature could be used instead. Next the 128-dimensional real-valued local image features have to be quantized into discrete visual words to derive a finite vocabulary. Quantization of the features into visual words is performed by using a vocabulary tree [21] in order to support large vocabulary sizes. The vocabulary tree is computed by repeated k-means clusterings that hierarchi- cally partition the feature space. This hierarchical approach overcomes two major problems of the traditional direct k- means clustering in cases where k is large. Firstly clustering is more efficient during visual word learning and secondly the mapping of visual features to discrete words is way faster than using a plain list of visual words. Once the visual vocabulary of size Nv is determined we map each feature vector of an image to its closest visual word. Therefore we query the vocabulary tree for each ex- tracted feature, and the best matching visual word ID is returned. This ID in turn is used to compute the document vector that holds the counts of the occurrences of visual words in the corresponding image by incrementing the asso- ciated word count (see Fig. 3). Note that this image content description does not preserve any spatial relationship between the occurrences of the vi- sual words. The co-occurrence vectors of all training images are used to train the pLSA model. Once the pLSA model is learned it can be applied to all images in the database, hence deriving a vector representation for each image, where the vector elements denote the degree to which an image depicts a certain visual topic. 2.4 Tag-based pLSA-Model The second modality we considered are tags – the free-text annotations provided by the image author. We assume that all of the images in our database have been tagged by their authors. Besides a single word, a tag can also be a phrase or a sentence. However, in this work we treat each word of the image annotations separately. Thus, in the following the term tag denotes a single word and is used interchangeably with ”word” and ”term”. As we use Flickr images to evaluate our multilevel multi- modal pLSA model, it is important to note that these tags reflect the photographer/author’s personal view with respect to the uploaded image. Thus, in contrast to carefully anno- tated image databases traditionally used for learning com- bined image and tag models [1], these image tags from Flickr are in many cases subjective, ambiguous, and do not neces- sarily describe the image content shown [17, 18]. This makes it difficult to use the tags directly for retrieval purposes and thus some preprocessing is required. breakfast eat food meal house construction home object structure love emotion state Table 1: Examples for hypernyms (right) found in Wordnet for the words left. document vector query for hypernyms 0 1 0 0 0 1 0 1, , , , , , , count++ Wordnet hypernyms tag specified by photographer count++ count++ Figure 4: Building a document vector from tags and their hypernyms by assuming that both tag and hy- pernyms are present in the vocabulary. To apply a pLSA model to tags we need to define a finite vocabulary first. Building the vocabulary starts with listing all tags that have been used more than N thresh1 times and by at least N thresh2 different users. This heuristics enforces that all rarely used tags are neglected. Note that a tag is also rarely used, if only a few users have used it independent of the actually count. We further filter the list by discarding all tags that contain numbers and by splitting tags at under- scores into separate words. In a last filtering step all words within the vocabulary are checked whether they are known by Wordnet [8]. Wordnet is a lexical database of English. Only words that exist according to Wordnet are assimilated into the final vocabulary. Once the tag vocabulary of sizeN t is defined, a co-occurrence table is built by counting the tag occurrences for each image. However, before forming the document vector we expand the list of tags associated with an image by using Wordnet to enrich the annotations. For each image we query Wordnet for the semantic parents of the tags specified by the author. This is done to emphasize semantic features rather than just the simple word count features. Semantic parents for a word can easily be extracted from Wordnet, as each word in Wordnet is associated with hyper- nyms1. A word’s hypernyms denote its parents that express the specific concept of the tag more generally. Table 1 shows hypernyms (right) for some example tags (left). As these hy- pernyms build a hierarchy and form a tree structure, we add the hypernyms up to three levels above in the hierarchy of the tag itself into the tag list of the corresponding image. Thus, while counting tag occurrences for each image in or- der to build the document vector, these parents are included in our model by counting them as if they were present in the list of tags. This procedure is visualized in Fig. 4. In case the vocabulary does not contain a tag or hypernyms used for annotation, the word is simply ignored. In our experiments we set the parameters for the tag vo- cabulary to N thresh1 = 18 and N thresh 2 = 10 resulting in a vocabulary size of 2421 words. Most images in our database have between 5 and 15 tags plus hypernyms as can be de- rived from Fig. 5. The number of tags for some images is however unreasonably large as users labeled images with whole sentences or phrases. 1Y is a hypernym of X if every X is a (kind of) Y . Vocabulary Tree 0 1 2 3 4 5 6 7 visual word SIFT feature 0 1 2 3 4 5 6 7 0 0 0 0 0 1 0 0, , , , , , ,document vector: 5 count++ assign visual word Figure 3: Quantization of features into discrete visual words. Figure 5: Histogram of the total number of tags plus hypernyms for each image within our database. 3. MULTILAYER MULTIMODAL PLSA 3.1 Motivation and Model In recent years pLSA has been applied successfully to unimodal data such as text [13], image tags [20], or visual words [16]. However, combining two modes such as visual words and image tags is challenging. The obvious approach of simply concatenating the two associated term-document matrices NM×Nv and NM×Nt into NM×(Nv+Nt) and then applying standard pLSA usually does not lead to the desired retrieval improvements. One reason is the difference in the order of magnitude with which words occur in the respective mode. For instance, a few thousand to ten thousand features per image are usually computed from images that are resized to having roughly the same number of dense samples while preserving the image’s aspect ratio. In contrast, most im- ages are annotated with fewer than 20 tags. Compensating between the differences in the order of the magnitude by some kind of normalization is possible, but will require a lot of testing to determine an appropriate weighting factor between the different modes since the actual importance of each mode must also be taken into account. Another reason may be the difference in the size of the respective vocabu- laries. In contrast, a well-founded mathematical approach with top-level topics will solve this issue effectively and ef- ficiently. Some empirical evidence for these claims will be given in section 4. Our basic idea is to apply pLSA in a first step to each mode separately, and in a second step concatenate the de- rived topic vectors of each mode to learn another pLSA on top of that (see Fig. 8). While we describe this layering of multiple pLSAs only for two leaf-pLSAs and a node pLSA, it is obvious that the proposed pLSA layering can be extended to more than two layers and applied to more than just two leaf-pLSAs. The smallest possible multilayer multimodal pLSA model (mm-pLSA) considering two modes with their respective ob- servable word occurrences and hidden topics as well as a sin- gle top-level of hidden aspects is graphically depict in Fig. 6. Every word of mode x (here: x ∈ {v, t} with v standing for visual and t for text) occurring in document di is generated by an unobservable model document: • Pick a document di with prior probability P (di) • For each visual word in the document: – Select a latent top-level concept ztopl with proba- bility P (ztopl |di) – Select a visual topic zvk with probability P (z v k |z top l ) – Generate a visual word wvm with probability P (w v m|z v k) • For each tag associated with the document: – Select a latent top-level concept ztopl with proba- bility P (ztopl |di) – Select an image tag topic ztp with probabilityP (z t p|z top l ) – Generate an image tag wtn with probability P (w t n|z t p) Thus the probability of observing a visual word wvm or a tag wtn in document di is P (di, w v m) = LX l=1 KX k=1 P (di)P (z top l |di)P

                    本文档为【Multilayer pLSA】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Multilayer pLSA

你可能还喜欢