by G Csurka · Cited by 6224 — vector, and thus determine which category or categories to assign to the image. Ideally these steps are designed to maximize classification accuracy while minimiz

91 KB – 16 Pages

PAGE – 1 ============
Visual Categorization wi th Bags of Keypoints Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cédric Bray Xerox Research Centre Europe 6, chemin de Maupertuis 38240 Meylan, France {gcsurka,cdance} Abstract. We present a novel method for generic visual catego- rization: the problem of identifying the object content of natural images while generalizing across variations inherent to the ob- ject class. This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches. We propose and compare two alternative implementations using different classifiers: Naïve Bayes and SVM. The main advan- tages of the method are that it is simple, computationally effi- cient and intrinsically invariant. We present results for simulta- neously classifying seven semantic visual categories. These re- sults clearly demonstrate that the method is robust to back- ground clutter and produces good categorization accuracy even without exploiting geometric information. 1. Introduction The proliferation of digital imaging sensors in mobile phones and consumer-level cameras is producing a growing number of large digital image collections. To manage such collections it is useful to have access to high-level information about objects contained in the image. Given an appropriate categorization of image contents, one may efficiently search, recommend, react to or reason with new image instances. We are thus confronted with the problem of generic visual categorization . We should like to identify processes that are sufficiently generic to cope with many object types simultaneously and which are readily extended to new object types. At the same time, these processes should handle the variations in view, imaging, lighting and oc- clusion, typical of the real world, as well as the intra-class variations typical of seman- tic classes of everyday objects. The task-dependent and evolving nature of visual categories motivates an example- based machine learning approach. This paper presents a bag of keypoints approach to visual categorization. A bag of keypoints corresponds to a histogram of the number of occurrences of particular image patterns in a given image. The main advantages of the

PAGE – 2 ============
method are its simplicity, its computational efficiency and its invariance to affine transformations, as well as occlusion, lighting and intra-class variations. It is important to understand the distinction of visual categorization from three re- lated problems: Recognition: This concerns the identification of particular object instances. For in- stance, recognition would distinguish between images of two structurally distinct cups, while categorization would place them in the same class. Content Based Image Retrieval: This refers to the process of retrieving images on the basis of low-level image features, given a query image or manually constructed de- scription of these low-level features. Such descriptions frequently have little rela- tion to the semantic content of the image. Detection: This refers to deciding whether or not a member of one visual category is present in a given image. Most previous work on detection has centered on ma- chine learning approaches to detecting f aces, cars or pedestrians [1]-[6] While it would be possible to perform generic categ orization by applying a detector for each class of interest to a given image, this approach becomes inefficient given a large number of classes. In contrast to the technique proposed in this paper, most existing detection techniques require precise manual alignment of the training images and the segregation of these images into different views [5], neither of which is neces- sary in our approach. Our bag of keypoints approach can be motivated by an analogy to learning methods using the bag-of-words representation for text categorization [7]-[10]. The idea of adapting text categorization appr oaches to visual categorization is not new. Zhu et al [11] investigated the vector quantization of small square image windows, which they called keyblocks. They showed that these features produced more fisemantics- orientedfl results than color and texture ba sed approaches, when combined with ana- logues of the well-known vector-, histogram-, and n-gram-models of text retrieval. In contrast to our approach, their keyblocks do not possess any invariance properties. The idea of clustering invariant descriptor s of image patches has previously been applied to the problem of texture classification [12]-[14]. Clearly the problem of texture classification is rather different from that of generic categorization. Therefore it is natural that these approaches differ from ours. While our method uses clustering to obtain quite high-dimensional feature vectors for a classifier, these texture recog- nizers use clustering to obtain relatively low-dimensional histograms and evaluate the similarity of these histograms to previously seen probability densities. In [12]-[13] filter responses are clustered and the recognition is done using the closest model measured by a 2 test. Lazebnik et al [14] cluster affine invariant interest points in each image individually and summarize the distribution of the descriptors in form of a signature composed of representative cluster members and weights proportional to cluster sizes. Signatures of different images are compared using the Earth Mover™s Distance [15]. Recently Fergus et al [16] proposed a visual categorization method based on in- variant descriptors of image patches. Their method exploits a probabilistic model that combines likelihoods for appearance, relative scale and position, as well as a model of

PAGE – 3 ============
the statistics of their patch detector. This elegant approach has a number of limita- tions. Firstly the method is not efficient: even when models are restricted to 6 image patches and training images only contain up to 30 patches, days of CPU time are re- quired to learn several categories. Secondly , views of objects used for training must be segregated, for instance into cars (rear) and cars (side). This is unsurprising given the use of an explicit 2D model of relative positions. In section 2 we explain the categorizatio n algorithms and the choice of their com- ponents. In section 3 we present results from applying of the algorithms to the dataset of Fergus et al and to a more challenging seven class dataset. We demonstrate that our approach is robust to the presence of bac kground clutter and produces state-of-the-art recognition performance. 2. The method The main steps of our method are: • Detection and description of image patches • Assigning patch descriptors to a set of predetermined clusters (a vocabu-lary ) with a vector quantization algorithm • Constructing a bag of keypoints , which counts the number of patches as- signed to each cluster • Applying a multi-class classifier, treating the bag of keypoints as the fea- ture vector, and thus determine which category or categories to assign to the image. Ideally these steps are designed to maxi mize classification accuracy while minimiz- ing computational effort. Thus, the descriptors extracted in the first step should be invariant to variations that are irrelevant to the categorization task (image transforma- tions, lighting variations and occlusions) but rich enough to carry enough information to be discriminative at the category level. The vocabulary used in the second step should be large enough to distinguish relevant changes in image parts, but not so large as to distinguish irrelevant variations such as noise. We refer to the quantized feature vectors (cluster centres) as fikeypointsfl by anal- ogy with fikeywordsfl in text categorization. However, in our case fiwordsfl do not necessarily have a repeatable meaning such as fieyesfl, or ficar wheelsfl, nor is there an obvious best choice of vocabulary. Rather, ou r goal is to use a vocabulary that allows good categorization performance on a given training dataset. Therefore the steps in- volved in training the system allow consideration of multiple possible vocabularies: • Detection and description of image patches for a set of labeled training images • Constructing a set of vocabularies: each is a set of cluster centres, with re- spect to which descriptors are vector quantized. • Extracting bags of keypoints for these vocabularies • Training multi-class classifiers using the bags of keypoints as feature vec- tors

PAGE – 4 ============
• Selecting the vocabulary and classifier giving the best overall classifica- tion accuracy. We now discuss the choices made for each step in more detail. 2.1 Feature extraction In computer vision, local descriptors (i.e. features computed over limited spatial sup- port) have proved well-adapted to matching and recognition tasks, as they are robust to partial visibility and clutter. Such tasks require descriptors that are repeatable. Here, we mean repeatable in the sense that if th ere is a transformation between two instances of an object, corresponding point s are detected and (ideally ) identical descriptor val- ues are obtained around each. This has motivated the development of several scale and affine invariant point detectors, as well as descriptors that are resistant to geomet- ric and illumination variations [17]-[21]. It was shown in [21] that if we have affine transformation between two images a scale invariant point detector is not sufficient to have the stability of the point™s loca- tion. Therefore we preferred to work with the Harris affine detector described in [21]. However, the reader should be aware that th e benefits of this choice are not clear cut: firstly because most real world objects have three-dimensional structures whose varia- tions are not well captured by affine transformations; secondly since attempts to in- crease the invariance of a feature typically result in a loss of discriminative informa- tion. Harris affine points are detected by an iterative process. Firstly, positions and scales of interest points are determined as local maxima (in position) of a scale- adapted Harris function, and as local extrem a in scale of the Laplacian operator. Then an elliptical (i.e. affine) neighborhood is determined. This has a size given by the selected scale and a shape given by the eigenvalues of the image™s second moment matrix. The selection of position/scale and the elliptical neighborhood estimation are then iterated and the point is kept only if the process converges within a fixed number of iterations. The affine region is then mapped to a circular region, so normalizing it for affine transformations. Scale Invariant Feature Tr ansform (SIFT) descriptors [18] are com- puted on that region. SIFT descriptors are multi-image representations of an image neighborhood. They are Gaussian derivatives computed at 8 orientation planes over a 4×4 grid of spatial locations, giving a 128-dimension vector. Fig. 1 shows an example of the maps of gradient magnitude corresponding to the 8 orientations.

PAGE – 5 ============
Fig. 1. (From left to right) A Harris affine region; the normalized region; and the 8 maps of gradient magnitude constituting the SIFT descriptor. We prefer SIFT descriptors to alternatives such as steered Gaussian derivatives or differential invariants of the local jet for the following reasons: 1. They are simple linear Gaussian derivatives. Hence we expect them to be more stable to typical image perturbations such as noise than higher Gaussian deriva- tives or differential invariants. 2. The use of a simple Euclidean metric in the feature space seems justified. In the case of differential invariants obtained by a combination of the components of the local jet, the use of a Mahalanobis distance is more appropriate. For instance, one would expect a second derivative feature to have a higher variance than a first de- rivative. Selecting an appropriate Mahalanobis distance a priori seems challeng- ing. It would not be appropriate to use the covariance matrix of SIFT descriptors over the entire dataset, since this is predominantly influenced by inter-class varia- tions (or more precisely, by variations between keypoints that we do not wish to ignore). Measuring a Mahalanobis distance would probably requiring manual specification of multiple homologous matching points between different images of objects of the same category, seriously working against our objective of pro- ducing a simple and automated categorization system. 3. There are far more components to these feature vectors (128 rather than 12 to 16). Hence we have a richer and potentially more discriminative representation. Recently Mikolajczyk et al [22] have compared several descriptors for matching and found that SIFT descriptors perform best. 2.3 Visual vocabulary construction In our method, the vocabulary is a way of constructing a feature vector for classifi- cation that relates finewfl descriptors in query images to descriptors previously seen in training. One extreme of this approach woul d be to compare each query descriptor to all training descriptors: this seems impractical given the huge number of training de- scriptors involved (over 600 000 for our data set). Another extreme would be to try to identify a small number of large ficlustersfl that are good at discriminating a given class: for instance [16] operates with 6 parts per category. In practice we find that the best tradeoffs of accuracy and computationa l efficiency are obtained for intermediate sizes of clustering. Most clustering or vector quantization algorithms are based on iterative square- error partitioning or on hierarchical techniques. Square-error partitioning algorithms attempt to obtain the partition which minimizes the within-cluster scatter or maximizes the between-cluster scatter. Hierarchical techniques organize data in a nested sequence of groups which can be displayed in the form of a dendrogram or a tree. They need

PAGE – 6 ============
some heuristics to form clusters and hence are less frequently used than square-error partitioning techniques in pattern recognition. We chose to use the simplest square-error partitioning method: k-means [23]. This algorithm proceeds by iterated assignments of points to their closest cluster centers and recomputation of the cluster centers. Two difficulties are that the k-means algo- rithm converges only to local optima of the squared distortion, and that it does not determine the parameter k. There exist methods allowing to automatically estimating the number of clusters. For example, Pelleg et al [24] use cluster splitting to do it, where the splitting decision is done by computing the Bayesian Information Criterion. However, in the present case we do not really know anything about the density or the compactness of our clusters. Moreover, we are not even interested in a ficorrect clus- teringfl in the sense of feature distributions, but rather in accurate categorization. We therefore run k-means several times with different number of desired representative vectors ( k) and different sets of initial cluster centers. We select the final clustering giving the lowest empirical risk in categorization [25]. 2.3 Categorization Once descriptors have been assigned to clusters to form feature vectors, we reduce the problem of generic visual categorization to that of multi-class supervised learning, with as many classes as defined visual categories. The categorizer performs two sepa- rate steps in order to predict the classes of unlabeled images: training and testing. During training, labeled data is sent to the classifier and used to adapt a statistical decision procedure for distinguishing categories. Among many available classifiers, we compared the Naïve Bayes classifier for its simplicity and its speed, and the Sup- port Vector Machine since is it often known to produce state-of-the-art results in high- dimensional problems. 2.3.2 Categorization by Naïve Bayes Naïve Bayes [26] is a simple classifier used often in text categorization. It can be viewed as the maximum a posteriori probability classifier for a generative model in which: 1) a document category is selected according to class prior probabilities; 2) each word in the document is chosen independently from a multinomial distribution over words specific to that class. While independence is a naïve assumption, the accu- racy of the Naïve Bayes classification is typically high [27]. Now, considering visual categ orization, assume we have a set of labeled images {}iI=I and a vocabulary {}iv=V of representative keypoints (i.e. cluster centers). Each descriptor extracted from an image is labeled with the keypoint to which it lies closest in feature space. We count the number N(t,i) of times keypoint iv occurs in image iI. To categorize a new image, we apply Bayes™s rule and take the largest a posteriori score as the prediction:

PAGE – 8 ============
to the separating hyperplane. In our case, the input features ix are the binned histo- grams formed by the number of occurrences of each keypoint iv from the vocabulary V in the image Ii . In order to apply the SVM to multi-class problems we take the one-against-all ap- proach. Given an m-class problem, we train m SVM™s, each distinguishes images from some category i from images from all the other m-1 categories j not equal to i. Given a query image, we assign it to the class with the largest SVM output. 3 Experiments We present results from three experiments. In the first we explore the impact of the number of clusters on classifier accuracy and evaluate the performance of the Naïve Bayes classifier. We then explore the performance of the SVM on the same problem. The first two experiments were conducted on an in-house seven class dataset. In the last experiment we describe results on the four class dataset employed in [16]. Our in-house database contains 1776 images in seven classes 1: faces, buildings, trees, cars, phones, bikes and books. Fig. 2 shows some examples from this dataset. Fig. 2. Example images from our in-house database It is a challenging dataset, not only because of the large number of classes, but also because it contains images with highly variable poses and significant amounts of 1 We hope to make this dataset publicly available soon. It contains 792 f aces, 150 buildings, 150 trees, 201 cars, 216 phones, 125 bikes and 142 books.

PAGE – 9 ============
background clutter, sometimes presence of objects from multiple classes although a large proportion of each image™s area is occupied by the target category. The images have resolutions between 0.3 and 2.1 megapixels and were acquired with a diverse set of cameras. The images are color but only the luminance component is used in our method. These were gathered by XRCE and Graz University, excepting faces which were downloaded from the Web. We used three performance measures to evaluate our multi-class classifiers. Œ The confusion matrix: (){}jkjkCiIh:CI==ijM (6) where {}cN1,,ji,, jC is the set of test images from category j and ()kIhis the category which obtained the highest classifier output for image kI. Œ The overall error rate: ==−=ccN1jjN1jjCC1jjMR (7) Œ The mean ranks. These are the mean position of the correct labels when labels output by the multi-class classifier are sorted by the classifier score. Each performance metric was evaluated with 10-fold cross validation. 3.2 Naïve Bayes Results In Fig. 3 we present the overall error rates using Naïve Bayes as a function of the number of clusters k. Each point in Fig. 3 is the best of 10 random trials of k-means. The standard-error on the maximum is in the range [1, 3]%. 2 The error rate only im- proves slightly as we move from k = 1000 to k = 2500. We therefore assert that k = 1000 present a good trade-off between accuracy and speed 3. 2 That is, if the experiment of taking the maximum of 10 is repeated multiple times with differ- ent random clusterings, the standard deviations of the result for the different classes are in the range [1,3]%. 3 It takes about one minute to get predicted labels on the whole database using the Naive Bayes classifier with k = 1000 on a 2 GHz processor.

PAGE – 10 ============
Fig. 3. The lowest overall error rate (percentage) found for different choices of k Table 1 shows the performance as a function of category obtained with this k. Table 1. Confusion matrix and the mean rank for the best vocabulary ( k=1000). In the clustering process, there is a risk of bias since images from different catego- ries contain different numbers of interest points. We therefore used random samples of the training data in the clustering step, each sample containing 5000 interest points randomly chosen from each class (there are around 640 000 interest points extracted from 1776 training images). Fig. 4 illustrates example clusters obtained for the selected fibestfl vocabulary. True classes facesbuildings trees cars phones bikes books faces 76 4 2 3 4 4 13 buildings 2 44 5 0 5 1 3 trees 3 2 80 0 0 5 0 cars 4 1 0 75 3 1 4 phones 9 15 1 16 70 14 11 bikes 2 15 12 0 8 73 0 books 4 19 0 6 7 2 69 Mean ranks 1.49 1.88 1.33 1.33 1.63 1.57 1.57

PAGE – 11 ============
Fig. 4. Left all patches detected for this image. Right patches from two selected clus- ters occurring in this image (yellow and magenta ellipses). The algorithm handles easily multiple objects in the same images (see Fig. 5), oc- clusion, partial view, any orientation (Fig. 6). Fig. 5. Images correctly classified containing multiple objects of the same category. Fig. 6. Profile face, partial view of a car, roof of a house correctly classified as f ace, cars, building. In Fig. 7 we show examples where images were well classified even if most of the detected interest points were on the fibackgroundfl.

91 KB – 16 Pages