搜档网
当前位置:搜档网 › Object categorization by learned universal visual dictionary

Object categorization by learned universal visual dictionary

Object Categorization by Learned Universal Visual Dictionary

J.Winn,A.Criminisi and T.Minka

Microsoft Research,Cambridge,UK–

https://www.sodocs.net/doc/8d11590915.html,/vision/cambridge/recognition/ Figure1:Exemplar snapshots of our interactive object categorization demo application.A user selects(sloppily)a region of interest and our algorithm associates an object class label with it.Despite large differences in pose,size,illumination and visual appearance the correct class label(e.g.cow,building,car...)is automatically associated with each selected object instance.Some of these test images were downloaded from the web and none were part of the training set.A video of the interactive demo may be found at the above web site.

Abstract

This paper presents a new algorithm for the automatic recognition of object classes from images(categorization). Compact and yet discriminative appearance-based object class models are automatically learned from a set of train-ing images.

The method is simple and extremely fast,making it suit-able for many applications such as semantic image re-trieval,web search,and interactive image editing.It classi-?es a region according to the proportions of different visual words(clusters in feature space).The speci?c visual words and the typical proportions in each object are learned from a segmented training set.The main contribution of this pa-per is two fold:i)an optimally compact visual dictionary is learned by pair-wise merging of visual words from an ini-tially large dictionary.The?nal visual words are described by GMMs.ii)A novel statistical measure of discrimination is proposed which is optimized by each merge operation.

High classi?cation accuracy is demonstrated for nine object classes on photographs of real objects viewed under general lighting conditions,poses and viewpoints.The set of test images used for validation comprise:i)photographs acquired by us,ii)images from the web and iii)images from the recently released Pascal dataset.The proposed algo-rithm performs well on both texture-rich objects(e.g.grass, sky,trees)and structure-rich ones(e.g.cars,bikes,planes).

1.Introduction

This paper studies the problem of constructing compact and discriminative models of object classes and presents a novel

algorithm for the automatic recognition of objects from im-ages.An example is shown in?g.1where the objects in the manually selected test regions(marked as rectangles) have correctly been recognized by the proposed algorithm as instances of the classes cow,aeroplane,car,face etc.

Object categorization is dif?cult because differing pose, scale,illumination and intrinsic visual differences produce highly different images for objects of the same class.For example?g.1shows deformable objects(sitting/standing cows),and extreme partial occlusions(in the car and bike images).Existing shape-based modeling techniques are not designed to deal with these large variations.Thus,we have built our algorithm upon appearance-based models drawn from the material classi?cation literature.Speci?cally,we have borrowed the texton-based models developed in the context of texture recognition[9,16]and extended by[1].

The challenge in object categorization is to?nd class models that are invariant enough to incorporate naturally-occurring intra-class variations and yet discriminative enough to distinguish between different classes.In this pa-per we propose a supervised learning algorithm which au-tomatically?nds such models.Additionally we require the learned models to be compact and light-weight so as to en-able ef?cient classi?cation.

The learned models specify the typical proportions of textons in each class,regardless of spatial layout.To our surprise we have found that the learned models perform ex-tremely well with both shape-free objects(sky,grass and trees)and also with highly structured object-classes(faces, cars,aeroplanes and bikes).

1

2.Previous work

Object class recognition is a well-studied vision problem, with approaches ranging from voting independent patches to full models of spatial layout and deformation.For ex-ample,constellation models[3,4,5],fragment-based mod-els[14]and pictorial structures[8]try to locate distinctive object parts and determine constraints on their spatial ar-rangement.While these approaches are potentially very powerful,the spatial models which are typically used can-not handle signi?cant deformations such as large out-of-plane rotations.They also do not consider objects with vari-able numbers of parts such as buildings and trees.Our ap-proach can be viewed as a simpli?ed parts model in which the parts can be arbitrarily rearranged but tend to occur in particular proportions,such as leaves on trees or windows on buildings.This approach runs the risk of not being able to discriminate shapes,but surprisingly it seems suf?cient to recognize a wide range of object classes without explicit shape modeling.

A similar image labeling task was considered in[2] and[10].These systems used machine learning techniques to classify regions found by automatic segmentation.How-ever such segmentations often do not correlate with seman-tic objects,for example an object in shadow may be divided into a shadowed versus non-shadowed part.Our solution to this problem is to:i)take the region as input from the user or ii)test a variety of regions and pick the one that is most likely from the point of view of the classi?er as opposed to a separate segmentation algorithm.

The approach proposed here can be considered an exten-sion of the method of[1].In that work,images were de-scribed by histograms over a dictionary(of selected size)of visual words1.The visual words were chosen by K-means clustering,and features computed only on a sparse set of interest points.We build upon[1]by automatically learning the optimal visual words and dictionary size.Unlike[1], our approach is dense;i.e.we process every pixel,avoiding early removal of potentially useful regions,such as texture-less blue/grey regions which can be distinctive of sky. 3.Training set and visual features

The training image set.Our class models are learned from a set of240manually segmented and annotated pho-tographs(?g.2).Those photographs depict different ob-jects in completely general positions,lighting conditions and viewpoints.The objects belong to the nine classes: building,grass,tree,cow,sky,aeroplane,face,car and bi-cycle.

The training images were manually segmented(quickly and sloppily)into object-de?ned regions by means of a 1In the literature the terms“textons”,“keypoints”or“visual words”have been used with approximately the same meaning,i.e.clusters of?lter responses/feature vectors in a high-dimensional

space.

a b c

d

tree

sky

tree

cow

grass

building

sky

sky

grass

aeropl.

face

null

null

grass

building

bldg

bike

null

null

car

Figure2:The labeled training set.(Columns a-c)A selection of images in the240-image training set(image size is320×213).Notice the large within-class variability.(Column d)Ground-truth annotation for column c.

Labeling has been achieved for all training images by a simple,interactive “paint”interface.Same colours correspond to same object class.

“paint”-like interface(?g.2d),with the assigned colours acting as indices into the list of object classes.

The face images were downloaded from the Caltech dataset2while the other photographs were taken by us.The entire annotated database is available on our web site.

Textons and texton histograms.Each training image is convolved with a?lter-bank to generate a set of?lter re-sponses[9,16].These?lter responses are aggregated over all the images in the entire training set(independently from class labels)and clustered using a K-means approach.Ma-halanobis distance between features is used during clus-tering.Then,the set of estimated cluster centres(tex-tons/visual words)and their associated covariances de?ne a universal visual dictionary(UVD).In this initial step large values of K are employed(in the order of thousands),but the next sections will show how to reduce the size of the UVD without loss of class discrimination.Given a UVD, any image can be?ltered and each pixel associated with the closest texton in the dictionary,thus generating a map of in-dices into the UVD.At this point normalized histograms of textons can readily be computed on a region or image basis.

2https://www.sodocs.net/doc/8d11590915.html,/html-?les/archive.html

2

Filter-banks.In this paper we have tested a number of different?lter-banks made of combinations of Gaussians,?rst and second order derivatives of Gaussians and Ga-bor kernels.Many?lter-banks produced comparable re-sults with the best one made of3Gaussians,4Laplacian of Gaussians(LoG)and4?rst order derivatives of Gaus-sians.The three Gaussian kernels(withσ=1,2,4)are applied to each CIE L,a,b channel[7],thus producing9?l-ter responses.The four LoGs(withσ=1,2,4,8)were applied to the L channel only,thus producing4?lter re-sponses.The four derivatives of Gaussians were divided into the two x?and y?aligned sets,each with two different values ofσ(σ=2,4).Derivatives of Gaussians were also applied to the L channel only,thus producing4?nal?lter responses.Therefore,each pixel in each image has associ-ated a17?dimensional feature vector.Note that?rst order derivatives of Gaussian kernels are not rotational invariant. However,rather than deciding a-priori whether to remove rotational dependency or not,we let our supervised learn-ing algorithm decide for us.In addition to this?lter-bank, we also investigated the performance of raw5×5colour patches(5×5×3=75?dimensional feature vectors).In our experiments using colour and intensity alone(only the 9Gaussian?lter responses)performed poorly.

4.Modeling object classes

This section describes the main contribution of this paper:a statistical algorithm for learning a compact and yet discrim-inative representation of object classes.

4.1.Objects as texture conglomerates

In texture classi?cation[9,16]classes are modeled by his-tograms of textons(visual words).The assumption being that similar distributions of textons(from a unique dictio-nary)apply to similar textures.In this paper we represent objects as conglomerates of different texture regions and thus we apply the same histogram-of-texton modeling tech-nique.Note that we never need to explicitly recognize each component texture(each“part”),as much as the overall dis-tribution of the“words”from the dictionary.

Interestingly,the size and nature of the dictionary affects the class models and thus the discrimination power.In[16] it was noticed that there is an optimal dictionary size K for which classi?cation accuracy is maximum.Both larger or smaller visual dictionaries do not perform as well.

Unlike previous techniques which manually?x the dic-tionary size and then estimate the textons by unsupervised clustering,here we infer both the best visual words and dic-tionary size from the training data in a supervised fashion. In fact,we propose a new statistical generative technique that,by merging textons from a large initial dictionary,es-timates a new,considerably smaller target dictionary with-out loss of class discriminability.The two driving forces of

our supervised clustering technique are high class discrim-inability and compactness of dictionary.Not only do we estimate the appropriately small size of the UVD,but we also make sure that we maintain high classi?cation accu-racy by only merging visual words which do not need to be kept separate.

The next section will describe inference of the optimal UVD and object class models.

4.2.Learning an optimal visual dictionary and

modeling compact object classes Each image in the training database is convolved with each of the P?lters in the selected?lter-bank.Then,each pixel position is associated with a P?dimensional feature vec-tor p.Note that here all available image data is processed, rather than only some speci?ed interest locations.

The whole set of feature vectors are then clustered using K-means with a large value K(in the order of thousands).

The set of resulting P?dimensional clusters(textons)and the associated covariances constitutes the initial dictionary

F.The goal here is to“manipulate”F and come up with a

new discriminative dictionary T with size T K.

We have given a set of N annotated training regions,with ground-truth class labels c∈{1···C}.Each training re-gion has a texton distribution h(histogram over the initial dictionary F)associated with it;and also a corresponding histogram of“target”textons H.All histograms are nor-malized to sum to one.The aim is to?nd the best mapping H=φ(h).The strategy used here is to de?neφas a pair-wise merging operation acting on textons.The intuition is that by merging textons which do not help distinguish be-tween classes,one can produce a much more compact and yet discriminative visual dictionary T.

4.2.1The generative model for texton histograms

We wish our model to prefer that histograms over the?nal dictionary are similar for regions of the same class(reduc-ing intra-class variation).Hence,we model the set of his-tograms for each class using a Gaussian distribution with meanˉH c and diagonal covariance whose diagonal entries form the vectorβc.Thus,a key assumption is that whenever an object of class c appears in an image,the corresponding region histogram H is close to the mean class histogram ˉH

c,in terms of a Mahalanobis distance given byβc.

We de?ne this relationship probabilistically for a class with parametersθ=(ˉH,β)by

P(H|θ)=

T

i=1

N

H12i|ˉH12i,β?1i

(1)

where H i denotes the i th bin of a histogram.Note that the value in each bin is raised to a power of a half.This is the variance stabilizing transformation[6]of a multinomial(or equivalently a Poisson distribution)which has the effect of

3

making the variance constant,rather than linearly depen-dent on the meanˉH.Hence,it makes the assumption of a Gaussian model with constant variance more accurate for this multinomial data.

Applying Bayesian methodology,we de?ne a common prior over the parameters for each classθ

P(θ)=

T

i=1N

ˉH12

i

|μ,(λβi)?1

G(βi|a,b)(2)

where the hyper-parameters are?xed to{μ=0,λ= 0.1,a=0.01,b=0.01};with G denoting the gamma dis-tribution.

Each training image region is assumed to contain a sin-gle instance of an object class and so?c=[?c1···?c N]is the vector of training labels associated with all of the N train-ing regions.A particular mappingφde?nes new texton his-tograms H1,···,H N.From(1)and(2),the distribution over these histograms conditioned on the ground truth la-

bels is

P({H n}|?c)=

C

c=1

n∈R c

P(H n|θc)P(θc)dθc(3)

where R c is the set of regions with object label c and we have marginalised out the class parametersθc.

4.2.2Compactness v discrimination trade-off

If we attempted to?nd the mapping that maximises the probability of the histograms(3),we would end up merging all the bins together into a single bin,since all histograms for any class would then be identical.Of course,we would be completely unable to discriminate between classes.

Instead,we wish to set up a trade-off between making the histograms more similar within each class and making them more discriminative between classes.Consider?nding the conditional probability of the class labels;using Bayes’rule

P(?c|{H n≡φ(h n)})=

P({H n}|?c)P(?c)

c

P({H n}|c )P(c )(4)

where the sum in the denominator is over all of the C N pos-sible object labelings and P(c)is the prior over labelings, which we set to be uniform.

We now aim to?nd the mappingφwhich maximizes this conditional probability(4).The term in the denomina-tor acts to penalise mappings which reduce discriminability (i.e.which make the observed data likely under class label-ings other than the true one).The numerator still favours mappings which lead to small intra-class variances,ensur-ing that the texton histograms are similar for regions of the same object class.This double pressure enables learning of the correct level of intra-class compactness in relation to inter-class discrimination power and represents the main contribution of this paper.As we are compressing the his-togram whilst preserving meaningful information about the

class labels,our approach can be considered as an applica-tion of the information bottleneck method[13]to the prob-lem of object categorization.

We consider a mappingφwhich merges bins rather than dropping them.However,if some bins are uninformative about the class label then they will be merged together and large variances will be learned for the merged bin.

4.2.3Learning the mappingφ

The goal of our learning algorithm is to?nd the mappingφwhich maximises the conditional probability of the ground truth labels,given the texton histograms of all training re-gions.To achieve this,we?rst,need to compute(3),which can be re-written as

P({H n}|c)=

C

c=1

T

i=1

n∈R c

P(H ni|θci)P(θci)dθci

C

c=1

T

i=1

E ci(5)

whereθci=(ˉH i,βi)and E ci is an evidence term for a particular class c and histogram bin.The integral for E ci can be found analytically to be

E ci=(2π)?|R|2

λ

λ

1

2b a

b a

Γ(a )

Γ(a)

(6)

whereλ =λ+|R|,μ =

μλ+

n∈R

H12n

/λ ,a = a+|R|/2and b =b+

λμ2?λ μ 2+

n∈R

H n

/2.

It follows that we can evaluate the conditional probability of the histograms{H n}given a labeling c by?nding the product of(6)over both bins and classes.

To evaluate the conditional probability of the labels

(4)exactly would require computing P({H n}|c)for each

of the C N labelings–clearly an intractable proposition.

However,we are only interested in the relative values of P({H n}|c)as we change the c.We wish it to have a high value for the ground truth?c and low values for other‘com-petitive’labelings.To achieve one-vs-all discrimination,we need only consider the alternate labeling where all regions are given the same label c same.Hence,we make the approx-imation that we can maximise(4)by instead maximising the quantity P given by

P(?c|{H

n})=

P({H n}|?c)

P({H n}|?c)+P({H n}|c same) where the prior terms have canceled because of the choice of a uniform prior distribution.Now we can rewrite P in terms of the mappingφto give

P(φ)=P({φ(h n)}|?c)

P({φ(h n)}|?c)+P({φ(h n)}|c same)

.(7)

The algorithm we use to learn the mapping which max-imises P consists of the following stages,

4

Figure3:Reducing the dictionary by merging visual words(texton bins).Textons pairs which do not contribute to class discriminability are merged together by our learning algorithm(the pair ab in the?gure). 1.Initialiseφto the identity mapping(where no bins are

merged).

2.Letφij be the mapping that merges the pair of bins i

and j inφ.Compute P(φij)for each pair i and j.

3.Find the mappingφ =arg max i,j P(φij).

4.If P(φ )> P(φ),setφ=φ and go to step2.Other-

wise returnφas the learned mapping.

The mappingφgiven by this algorithm de?nes a grouping of the words in the original dictionary F.This grouping de?nes a more compact visual dictionary T which remains discriminative between object classes,with the optimal dic-tionary size T determined automatically.

One step of the merging algorithm is illustrated in the toy example in?g.3.We have only the two classes blue and red, and the original dictionary F consists of the three words a, b and c.Pixels in regions of the blue class tend to lie only in clusters a and b,whilst pixels of the red class lie pre-dominantly in cluster c,leading to class texton-histograms of the form shown in?g.3a.Our learning algorithm has three possible merges to consider ab,ac and bc.Merging either ac or bc will lower P because it will reduce the dis-criminability between the two classes.However,merging ab(projection along the green arrows in?g.3b)will in-crease P because it makes the points in the blue class closer together in histogram space without affecting discrimina-tion between the two classes(?g.3b).Iterating this basic step produces dictionary size reduction with no loss in class discriminability.

Algorithm ef?ciency.The computation of P(φij)for each pair of bins can be carried out ef?ciently since only the terms in(5)relating to bins i and j differ.The number of single bin evidence computations required for the entire learning process is O(CK2).The ef?ciency could be fur-ther improved by considering only a subset of the possible merges at early stages in the algorithm.We did not?nd this necessary since we were able to apply the full algorithm to initial dictionary sizes up to K=5000.

4.3.Classi?cation

Once a compact UVD has been obtained we can choose to model object classes in a number of ways.For instance,we could think of describing a class as a set of histograms,each associated with the training regions labeled with the same class label and use nearest neighbour classi?cation.This is clearly a highly multi-modal,non-parametric representa-tion.Given an input test region the closest(e.g.in terms of Euclidean or Mahalanobis distance)training histogram is found and the corresponding object class label returned.

Alternatively,we could use the Gaussian class models with the posterior over the parametersθc that we have learned from the training data.We classify each new (test)histogram H by?nding the setting of c which max-imises

P(H |c,θc)P(θc|{H n},?c)dθc.The unimodal-ity of Gaussian distributions may be seen as a disadvan-tage,however the results section will show how the multi-modal nature of the data is captured by the supervised word merging process.Advantages of the Gaussian class mod-els over nearest neighbours ones are:i)their compactness (storing all training examples can be avoided),and ii)the fact that Gaussian models provide proper parametric densi-ties.In the following sections we evaluate and compare the behaviour of parametric and non-parametric class models for object class recognition.

5.Results

In this section we assess the effectiveness of the pro-posed class-modeling technique by:i)measuring object class recognition accuracy with respect to different image databases,ii)comparing compactness of class models,iii) measuring the effect of our learning algorithm on the class discrimination ratio,iv)showing results from our interac-tive categorization demo application.

Accuracy of classi?cation in in-house dataset.In order to measure classi?cation accuracy we have split the240-image in-house database(?g.2)into50%training set and 50%test set.The training images are used to estimate both the visual dictionary and the nine class models.

Accuracy of classi?cation for different class models are shown in table1.If the test image region boundaries were determined from our ground-truth segmentation(and ignor-ing ground-truth class labels)then a nearest neighbour clas-si?cation approach(with or without dictionary compres-sion)and Gaussian class models reached pretty much the same accuracy of about93%.However,the combination of learned Gaussian models and learned UVD achieves the highest compactness of modeling and thus the highest clas-si?cation ef?ciency.For this training set classi?cation via multi-class SVM techniques or Gaussian mixture models produced inferior results.

5

Recognition accuracy

Dict.size Accuracy Accuracy(bbox) N.Neigh.K=200093.4%76.3%

N.Neigh.T=21692.7%78.5%

Gaussian T=21693.4%77.4%

Table1:Accuracy of classi?cation for in-house dataset.The Gaussian method is over140times faster than nearest neighbours with K=2000.

True Inferred label

label Build.Grass Tree Cow Sky Aerop.Face Car Bicyc.

Building3821121

Grass661

Tree11301

Cow212

Sky46

Aeroplane411

Face15

Car15

Bicycle114 Table2:Confusion matrix for in-house data with learned Gaussian class models.Final dictionary size T=216.

As the last column of table1shows,using the regions’bounding boxes rather than the more accurate delineation reduced the classi?cation performances.However,we ex-pect more clutter-robust histogram distances[12]to im-prove the performance even in the case of bounding-box region selection.Furthermore,automatic region segmenta-tion[11]may also bring large improvements in this case.

The confusion matrix for classi?cation using the learned Gaussian class models is reported in table2.It can be seen that most image regions have been classi?ed correctly into one of the nine object classes(numbers on the main diago-nal).However,a few mistakes were made,e.g.four aero-planes were incorrectly classi?ed as buildings.

Accuracy of classi?cation in the Pascal dataset.Similar experiments were run on the Pascal Visual Object Classes challenge training dataset3(587images4).In order to test our algorithm we have once again split the dataset into two equally large training and test sets.The measured classi-?cation accuracy is reported in table3.Notice that in the Pascal dataset only bounding boxes of image regions are provided.By comparing the results in the two tables3 and1we would expect classi?cation accuracy to increase substantially if better region delineation was provided,e.g. through GrabCut automatic segmentation[11].

3https://www.sodocs.net/doc/8d11590915.html,/challenges/VOC/voc/index.html

4We have ignored the grey-level car-only UIUC images since the sim-ple classi?cation rule“grey image→car”would have arti?cially boosted our classi?cation results.

Recognition accuracy

Dict.size Accuracy(bbox) Nearest Neighbour K=120076.9%

Nearest Neighbour T=13474%

Gaussian T=13473.3% Table3:Accuracy of classi?cation for Pascal dataset.The Gaussian method is over310times faster than nearest neighbours with K=1200.

True Inferred label

label Car Bicyc.Motor.Person

Car65442

Bicycle936410

Motorbike1112814

Person110424

Table4:Confusion matrix for Pascal data with learned Gaussian class models.Final dictionary size T=134textons,with bounding-box only region selection.

The corresponding confusion matrix for Gaussian class models is shown in table4.Unsurprisingly motorbikes and bicycles are confused with one another.Furthermore,high level of confusion is detected for the class“person”due to the high variability of people’s clothing.However,the large majority of object instances have been classi?ed correctly.

Comparing performance of Gaussian class models be-fore and after learning.Figure4shows the improvement in classi?cation accuracy when using Gaussian class models before and after learning,for different sizes of the initial vi-sual dictionary.It can be observed that our supervised learn-ing algorithm improves the classi?cation accuracy dramati-cally,especially for larger numbers of visual words;where the highest accuracy is reached.Notice that,without our learned dictionary,it would be impossible to achieve above 90%accuracy with the Gaussian model(red curve).With the learned dictionary,performance comparable to nearest neighbour classi?cation is achieved.

Figure5compares the accuracy of Gaussian class mod-els and nearest neighbour classi?cation for different initial dictionary sizes K.For small initial dictionaries,nearest neighbours is slightly superior to Gaussian models,though the difference has similar magnitude to the differences from different runs of K-means.Then for K≥2000,their per-formance becomes the same(if not inverted),which sug-gests that the merged textons are absorbing the multi-modal nature of the visual object classes.

Model selection.Observing?g.5we can notice that for large values of K the performance of nearest neighbour classi?cation starts to degrade.However,ignoring K-means noise the blue performance curve is monotonically non-decreasing.This effect is advantageous since now we need not to worry about choosing the‘optimal’dictionary size 6

Figure4:Classi?cation performance for Gaussian class models.Be-

Figure5:Comparing classi?cation performance for Gaussian class models vs nearest neighbours classi?cation.

since selecting a suf?ciently large value(e.g.K>3000) suf?ces.

Learning features.As an alternative to the hand crafted 17-dimensional?lter responses we have also tested our al-gorithm using75-dimensional,5x5colour patches in the hope of learning automatically discriminative features from raw input pixel data.Interestingly,we found that compara-ble classi?cation accuracy was achieved if the initial dictio-nary size was large enough(K>1000)(cf.[15]).However, the larger dimensionality of the feature vectors affected both training and testing ef?ciency.

Information summarization.Reducing the dictionary size to increase classi?cation ef?ciency without compro-mising accuracy is fundamental when dealing with large numbers of object classes(e.g.in the order of hundreds or thousands).Moreover,even for a limited number of classes speed may

be important i)when scanning entire photographs to detect and classify all the objects contained within,ii)for content-based clustering of web images,or iii)for the analysis of videos.Figure6illustrates the com-pression effect by plotting the automatically computed?nal

(K)

Figure7:Ratio between inter-and intra-class distances on test set with initial and learned dictionaries.In this experiment the initial dic-tionary size was?xed at K=2000.

UVD size T in relation to the original size K.The relation-ship is highly non linear,with the ratio T/K shrinking as K grows(the line T=K is drawn for comparison).As K be-comes very large,T asymptotically approaches a maximum UVD size T ≈230.

Discrimination ratio.It is informative to look at how our algorithm affects the ratio between average inter-class and intra-class distances.In general,greater classi?cation accu-racy is achieved for a greater distance ratio.Figure7com-pares the distance ratio before and after learning.The intra-class distance is the log-probability of a region known to be in the class.The inter-class distance is the log-probability of a region known not to be in the class.We take the aver-age of both and then the ratio.As it can be seen,learning increases the discriminability of all nine classes,in many cases quite considerably.The least effect is on the grass and sky classes probably due to the fact that these homogeneous ‘objects’are already modeled well by only a small number of textons in the initial dictionary.The most positive effect is on structured objects such as bicycles,faces and cars.

Interactive classi?cation application.In order to further test the?ndings of this paper we have built an interactive 7

a

b

c

d

Figure8:Applications and extensions.a)Single click object categorization:the user“touches”an object and the algorithm associates a category label. b-d)Multi-class object detection:our algorithm automatically lists the object classes contained in the input image.No user interaction is required here.

object recognition demo application where a user selects a rectangular region in an image and the system instantly es-timates the associated class label.Twelve exemplar snap-shots are shown in?g.1.Thanks to our appearance-based models,selecting just a portion of the object of interest suf-?ces,thus demonstrating high robustness with respect to oc-clusions and missing parts.

Single click categorization.High algorithmic ef?ciency al-lows us to:i)select a single image point x,ii)run a whole range of classi?cation tests for different sizes and shapes of the regions of interest centred in x,and iii)determine the MAP object class at the selected location in real time.An example of single click class recognition is shown in?g.8a. Applications in image https://www.sodocs.net/doc/8d11590915.html,eful applications of our class modeling algorithm include:i)multi-class ob-ject detection,ii)object localization,iii)content-based im-age segmentation and iv)content-based clustering.For in-stance,applying single click classi?cation to a regular grid of image positions enables automatic detection of all objects within an image,as shown in?g.8b-d.Space restrictions negate a more detailed explanation.The reader is kindly invited to browse our web pages for more examples and a video of our interactive recognition demo.

6.Conclusion

This paper has studied the problem of de?ning and esti-mating descriptive and compact visual models of object classes for ef?cient object class recognition.A new super-vised learning algorithm has been proposed for estimating appearance-based models from training images.The algo-rithm is designed to produce highly compact class descrip-tions with large discrimination power;accuracy and ef?-ciency of classi?cation being essential prerequisites for se-mantic image retrieval,clustering and editing.

In contrast to previous work here we have avoided focus-ing only on sparse sets of interest-points or parts.Instead, all pixels are taken into account;with the discriminative features learned automatically.This enables treating both texture-rich and texture-less objects in a uni?ed way.

Surprisingly,our learned Gaussian class models have performed comparably to multi-modal nearest neighbour classi?cation.Advantages of the Gaussian models are their compactness and the fact that they are parametric densities.

Finally,our appearance-based models have turned out to be surprisingly powerful for categorizing both“texture-

rich”and“structure-rich”objects.

Currently,we are investigating integration of appearance with local shape information to maintain high class discrim-inability while increasing the number of object classes.The statistical framework developed in this paper readily allows such integration.

References

[1]G.Csurka,C.Bray,C.Dance,and L.Fan.Visual categorization with

bags of keypoints.In Proc.of the8th ECCV,Prague,May2004.

[2]P.Duygulu,N.de Freitas,K.Barnard,and D.A.Forsyth.Object

recognition as machine translation.In Proc.ECCV,Copenhagen,

2002.

[3]L.Fei-Fei,R.Fergus,and P.Perona.A Bayesian approach to unsu-

pervised one-shot learning of object categories.In Proc.of the9th

IEEE ICCV,Nice,France,pages1134–1141,October2003.

[4]R.Fergus,P.Perona,and A.Zisserman.Object class recognition

by unsupervised scale-invariant learning.In Proc.of IEEE CVPR,

Madison,WI,June2003.

[5]R.Fergus,P.Perona,and A.Zisserman.A visual category?lter for

Google images.In Proc.of the8th ECCV,Prague,May2004.

[6]P.Fryzlewicz and G.P.Nason.A Haar-Fisz algorithm for Poisson

intensity estimation.Journal of Computational and Graphical Statis-

tics,13:621–638,2004.

[7]J.M.Kasson and W.Plouffe.An analysis of selected computer inter-

change color spaces.In ACM Transactions on Graphics,volume11,

pages373–405,October1992.

[8]M.P.Kumar,P.H.S.Torr,and A.Zisserman.Extending pictorial

structures for object recognition.In Proc.of BMVC,London,2004.

[9]T.Leung and J.Malik.Recognizing surfaces using three-dimensional

textons.In Proc.IEEE ICCV,Kerkyra,Greece,1999.

[10]R.W.Picard and T.P.Minka.Vision texture for annotation.Multi-

media Syst.,3(1):3–14,1995.

[11] C.Rother,V.Kolmogorov,and A.Blake.GrabCut-interactive

foreground extraction using iterated graph cuts.In ACM Trans.on

Graphics(SIGGRAPH),August2004.

[12]Y.Rubner and C.Tomasi.Texture-based image retrieval without

segmentation.In Proc.IEEE ICCV,Kerkyra,Greece,1999.

[13]N.Tishby,F.Pereira,and W.Bialek.The information bottleneck

method.In37th Allerton Conf.on Communications and Computa-

tion,1999.

[14]S.Ullman,E.Sali,and M.Vidal-Naquet.A fragment-based approach

to object representation and classi?cation.In Proc.4th Intl.Work-

shop on Visual Form,IWVF4,Capri,Italy,2001.

[15]M.Varma and A.Zisserman.Texture classi?cation:Are?lter banks

necessary?In Proc.of IEEE CVPR,Madison,WI,June2003.

[16]M.Varma and A.Zisserman.A statistical approach to texture classi-

?cation from single images.IJCV,62(1–2):61–81,April2005.

8

相关主题