当前位置：搜档网 › Ortiz_Face_Recognition_in_2013_CVPR_paper

Ortiz_Face_Recognition_in_2013_CVPR_paper

Face Recognition in Movie Trailers via Mean Sequence Sparse

Representation-based Classi?cation

Enrique G.Ortiz,Alan Wright,and Mubarak Shah

Center for Research in Computer Vision,University of Central Florida,Orlando,FL eortiz@https://www.sodocs.net/doc/0c103682.html,,alanwright@https://www.sodocs.net/doc/0c103682.html,,shah@https://www.sodocs.net/doc/0c103682.html,

Abstract

This paper presents an end-to-end video face recogni-tion system,addressing the dif?cult problem of identifying a video face track using a large dictionary of still face im-ages of a few hundred people,while rejecting unknown in-dividuals.A straightforward application of the popular 1-minimization for face recognition on a frame-by-frame ba-sis is prohibitively expensive,so we propose a novel algo-rithm Mean Sequence SRC(MSSRC)that performs video face recognition using a joint optimization leveraging all of the available video data and the knowledge that the face track frames belong to the same individual.By adding a strict temporal constraint to the 1-minimization that forces individual frames in a face track to all reconstruct a sin-gle identity,we show the optimization reduces to a single minimization over the mean of the face track.We also in-troduce a new Movie Trailer Face Dataset collected from 101movie trailers on YouTube.Finally,we show that our method matches or outperforms the state-of-the-art on three existing datasets(YouTube Celebrities,YouTube Faces,and Buffy)and our unconstrained Movie Trailer Face Dataset. More importantly,our method excels at rejecting unknown identities by at least8%in average precision.

1.Introduction

Face Recognition has received widespread attention for the past three decades due to its wide-applicability.Only recently has this interest spread into the domain of video, where the problem becomes more challenging due to the person’s motion and changes in both illumination and oc-clusions.However,it also has the bene?t of providing many samples of the same person,thus providing the opportunity to convert many weak examples into a strong prediction of the identity.

As video search sites like YouTube have grown,video content-based search has become increasingly necessary. For example,a capable retrieval system should return

all Figure1.This paper addresses the dif?cult problem of identifying a video face track using a large dictionary of still face images of a few hundred people,while rejecting unknown individuals. videos containing speci?c actors upon a user’s request.On sites like YouTube,where a cast list or script may not be available,the visual content is the key to accomplishing this task successfully.The main drawback is the availability of annotated video face tracks.

With the advent of social networking and photo-sharing, computer vision tasks on the Internet have become increas-ingly fascinating and viable.This avenue is one little ex-ploited by video face recognition.Although large col-lections of annotated individuals in videos are not freely available,collecting data of annotated still images is easily doable,as witnessed by datasets like Labeled Faces in the Wild(LFW)[12]and Public Figures(PubFig)[16].Due to wide availability,we employ large databases of still images to recognize individuals in videos,as depicted in Figure1.

Existing video face recognition methods tend to per-form classi?cation on a frame-by-frame basis and later combining those predictions using an appropriate met-ric.A straight-forward application of 1-minimization in this fashion is very computationally expensive.In con-trast,we propose a novel method,Mean Sequence Sparse Representation-based Classi?cation(MSSRC),that per-forms a joint optimization over all faces in the track at once. Though this seems expensive,we show that this optimiza-

2013 IEEE Conference on Computer Vision and Pattern Recognition

$ "(

$ $(

%$ %$

'$" $

$%" $ "#

+., $%" '$" $

#$ " $ "# $

" " $ "

" & "

+-, "

& " " $%" #

" " * ) $

" $ "#

+/, !%

%$ $%" '$" $

( $

Figure 2.Video Face Recognition Pipeline.With a video as input,we perform face detection and track a face throughout the video clip.Then we extract,PCA,and concatenate three features,Gabor,LBP,and HOG.Finally,we perform face recognition using our novel algorithm MSSRC with an input face track and dictionary of still images.

tion reduces to a single 1-minimization over the mean face track,thus reducing a many classi?cation problem to one with inherent computational and practical bene?ts.

Our proposed method aims to perform video face recognition across domains,leveraging thousands of la-beled,still images gathered from the Internet,specif-ically the PubFig and LFW datasets,to perform face recognition on real-world,unconstrained videos.To do this we collected 101movie trailers from YouTube and automatically extracted and tracked faces in the video to create a dataset for video face recognition (https://www.sodocs.net/doc/0c103682.html, ).Furthermore,we explore the often little-studied,open-universe scenario in which it is important to recognize and reject unknown identities,i.e .we identify famous actors appearing in movie trailers while rejecting background faces that represent un-known extras.We show our method outperforms existing methods in precision and recall,exhibiting the ability to bet-ter reject unknown or uncertain identities.

The contributions of this paper are summarized as fol-lows:(1)We develop a fully automatic end-to-end system for video face recognition,which includes face tracking and recognition leveraging information from both still images for the known dictionary and video for recognition.(2)We propose a novel algorithm,MSSRC,that performs video face recognition using an optimization leveraging all of the available video data.(3)We show that our method matches or outperforms the state-of-the-art on three existing datasets (YouTube Faces,YouTube Celebrities,and Buffy)and our unconstrained Movie Trailer Face Dataset.

The rest of this paper is organized as follows:Section 2discusses the related work on video face recognition.Then Section 3describes our entire framework for video face recognition from tracking to recognition.Next,in Section 4,we describe our unconstrained Movie Trailer Face Dataset.Section 5exhaustively evaluates our method on existing video datasets and our new dataset.Finally,we end with a summary of conclusions and future work in Section 6.

2.Related Work

For a complete survey of video-based face recognition refer to [18];here we focus on an overview of the most related methods.Current video face recognition techniques fall into one of three categories:key-frame based,temporal model based,and image-set matching based.

Key-frame based methods generally perform a predic-tion on the identity of each key-frame in a face track fol-lowed by a probabilistic fusion or majority voting to se-lect the best match.Due to the large variations in the data,key-frame selection is crucial in this paradigm [4].Zhao et al .’s [25]work is most similar to us in that they use a database with still images collected from the Inter-net.They learn a model over this dictionary by learning key faces via clustering.These cluster centers are compared to test frames using a nearest-neighbor search followed by ma-jority,probabilistic voting to make a ?nal prediction.We,on the other hand,use a classi?cation scheme that enhances robustness by ?nding an agreement amongst the individual frames in a single optimization.

Temporal model based methods learn the temporal,fa-cial dynamics of the face throughout a video.Several meth-ods employ Hidden Markov Models(HMM)for this end, e.g.[14].Most related to us,Hadid et al.[10]uses a still image training library by imposing motion information on it to train an HMM and Zhou et al.[26]probabilistically gen-eralizes a still-image library to do video-to-video matching. Generally training these models is prohibitively expensive, especially when the dataset size is large.

Image-set matching based methods allows the model-ing of a face track as an image-set.Many methods,like[24], perform a mutual subspace distance where each face track is modeled in their own subspace from which a distance is computed between each.They are effective with clean data, but these methods are very sensitive to the variations inher-ent in video face tracks.Other methods take a more statis-tical approach,like[5],which used Logistic Discriminant-based Metric Learning(LDML)to learn a relationship be-tween images in face tracks,where the inter-class distances are maximized.LDML is very computationally expensive and focuses more on learning relationships within the data, whereas we directly relate the test track to the training data.

Character recognition methods have been very popu-lar due to their application to movies and sitcoms.[8,19] perform person identi?cation,where they use all available information,e.g.clothing appearance and audio,to identify the cast rather than the facial information alone.Another[3] used a small user selected sample of characters in the given movie to do a pixel-wise Euclidean distance to handle oc-clusion.While others[2],use a manifold for known charac-ters which successfully clusters input frames.While char-acter recognition is suitable for a long-running series,the use of clothing and other contextual clues are not helpful in the task of identifying actors between movies,TV shows,or non-related video clips.In these scenarios,our approach of focusing on the facial recognition aspect from still images is more adept in unconstrained environments.

Still-Image based literature is vast,but a popular ap-proach is Wright et al.’s[23]Sparse Representation-based Classi?cation(SRC),in which they present the principle that a given test image can be represented by a linear com-bination of images from a large dictionary of faces.The key concept is enforcing sparsity,since a test face can be reconstructed best from a small subset of the large dictio-nary,i.e.training faces of the same class.A straight-forward adaptation of this method would be to perform estimation on each frame and fuse results probabilistically,similarly to key-frame based methods.However, 1-minimization is known to be computationally expensive,thus we propose a constrained optimization with the knowledge that the im-ages within a face track are of the same person.We show that imposing this fact reduces the problem to computing a single 1-minimization over the average face track.3.Video Face Recognition Pipeline

In this section,we describe our end-to-end video face recognition system.First,we detail our algorithm for face tracking based on face detections from video.Next,we chronicle the features we use to describe the faces and han-dle variations in pose,lighting,and occlusion.Finally,we derive our optimization for video face recognition that clas-si?es a video face track based on a dictionary of still images.

3.1.Face Tracking

Our method performs the dif?cult task of face track-ing based on face detections extracted using the high-performance SHORE face detection system[15]and gen-erates a face track based on two metrics.To associate a new detection to an existing track,our?rst metric determines the ratio of the maximum sized bounding box encompass-ing both face detections to the size of the larger bounding box of the two detections.The formulation is as follows:

d spatial=

w?h

max(h1?w1,h2?w2),(1) where(x1,y1,w1,h1)and(x2,y2,w2,h2)are the(x,y)lo-cation and the width and height of the previous and current frames respectively.The overall width w and height h are computed as w=max(x1+w1,x2+w2)?min(x1,x2) and h=max(y1+h1,y2+h2)?min(y1,y2).Intuitively, this metric encodes the dimensional similarity of the current and previous bounding boxes,intrinsically considering the spatial information.

The second tracking metric takes into account the ap-pearance information via a local color histogram of the face. We compute the distance as a ratio of the histogram inter-section of the RGB histograms with30bins per channel of the last face of a track and the current detection to the total summation of the histogram bins:

d appearance=

i=1

min(a i,b i)/

i=1

a i+

b i,(2)

where a and b are the histograms of the current and previ-ous face.We compare each new face detection to existing tracks;if the location and appearance metric is similar,the face is added to the track,otherwise a new track is created. Finally,we use a global histogram for the entire frame,en-coding scene information,to detect scene boundaries and impose a lifespan of20frames of no detection to end tracks.

3.2.Feature Extraction

Because real-world datasets contain pose variations even after alignment,we use three fast and popular local fea-tures:Local Binary Patterns(LBP)[1],Histogram of Ori-ented Gradients(HOG)[7],and Gabor wavelets[17].More features aid recognition,but at a higher computational cost.

Algorithm1Mean Sequence SRC(MSSRC)

1.Input:Training gallery A,test face track Y=

[y1,y2,...,y M],and sparsity weight parameterλ. 2.Normalize the columns of A to have unit 2-norm.

https://www.sodocs.net/doc/0c103682.html,pute mean of the trackˉy= M

m=1

y m/M and

normalize to unit 2-norm..

5.Solve the 1-minimation problem

1=arg min

ˉy?A x 22+λ x 1

https://www.sodocs.net/doc/0c103682.html,pute residual errors for each class j∈[1,C]

r j(ˉy)= ˉy?A j x j 2

7.Output:identity I and con?dence P(I|ˉy)

I(ˉy)=arg min

r j(ˉy)

P(I∈[1,C]|ˉy)=C·max j x j 1/ ?x 1?1

C?1

Before feature extraction,all images are?rst eye-aligned

using eye locations from SHORE and normalized by sub-

tracting the mean,removing the?rst order brightness gradient,and performing histogram equalization.Gabor

wavelets were extracted with one scaleλ=4at four ori-entationsθ={0?,45?,90?,135?}with a tight face crop at a resolution of25x30pixels.A null Gabor?lter includes

the raw pixel image(25x30)in the descriptor.The stan-dard LBP U28,2and HOG descriptors are extracted from72x80 loosely cropped images with a histogram size of59and32 over9x10and8x8pixel patches,respectively.All descrip-tors were scaled to unit norm,dimensionality reduced with PCA to1536dimensions each,and zero-meaned.

3.3.Mean Sequence Sparse Representation-based

Classi?cation(MSSRC)

Given a test image y and training set A,we know that the

images of the same class to which y should match is a small

subset of A and their relationship is modeled by y=A x, where x is the coef?cient vector relating them.Therefore, the coef?cient vector x should only have non-zero entries for those few images from the same class and zeros for the rest.Imposing this sparsity constraint upon the coef?cient vector x results in the following formulation:

1=arg min

y?A x 22+λ x 1,(3)

where the 1-norm enforces a sparse solution by minimizing the absolute sum of the coef?cients.

The leading principle of our method is that all of the images y from the face track Y=[y1,y2,...,y M]be-long to the same person.Because all images in a face track belong to the same person,one would expect a high de-gree of correlation amongst the sparse coef?cient vectors x j?j∈[1...M],where M is the length of the track. Therefore,we can look for an agreement on a single coef?-cient vector x determining the linear combination of train-ing images A that make up the unidenti?ed person.In fact, with suf?cient similarity between the faces in a track,one might expect nearly the same coef?cient vector to be recov-ered for each frame.This provides the intuition for our ap-proach:we enforce a single coef?cient vector for all frames. Mathematically,this means the sum squared residual error over the fames should be minimized.We enforce this con-straint on the 1solution of Eqn.3as follows:

=arg min

m=1

y m?A x 22+λ x 1(4)

where we minimize the 2error over the entire image se-quence,while assuming the coef?cient vector x is sparse and the same over all of the images.

Focusing on the?rst part of the equation,more speci?-cally the 2portion,we can rearrange it as follows:

m=1

y m?A x 22=

m=1

y m?ˉy+ˉy?A x 22

m=1

( y?ˉy 22+2(y m?ˉy)T(ˉy?A x)+...

ˉy?A x 22),(5) whereˉy=

m=1

y m/M.However,

m=1

2(y m?ˉy)T(ˉy?A x)

m=1

y m?Mˉy

(ˉy?A x)

=0(ˉy?A x)=0.

Thus,Eq.5becomes:

m=1

y m?A x 22(6) =

m=1

y m?ˉy 22+M ˉy?A x 22,

where the ?rst part of the sum is a constant.Therefore,we obtain the ?nal simpli?cation of our original minimization:

?x 1=arg min

M m =1 y m ?A x 22+λ x 2

=arg min x

M ˉy ?

A x 22

+λ x 1

=arg min x

ˉy ?A x 22+λ x 1

(7)

where M ,by division,is absorbed by the constant weight λ.By this sequence,our optimization reduces to the 1-minimization of x for the mean face track ˉy .

This conclusion,that enforcing a single,consistent co-ef?cient vector x across all images in a face track Y is equivalent to a single 1-minimization over the average of all the frames in the face track,is key to keeping our ap-proach robust yet fast.Instead of performing M individ-ual 1-minimizations over each frame and classifying via some voting scheme,our approach performs a single 1-minimization on the mean of the face track,which is not only a signi?cant speed up,but theoretically sound.Further-more,we empirically validate in subsequent sections that our approach outperforms other forms of temporal fusion and voting amongst individual frames.

Finally,we classify the average test track ˉy by determin-ing the class of training samples that best reconstructs the face from the recovered coef?cients:

I (ˉy )=min j

r j (ˉy )=min ˉy ?A j x j 2,

(8)

where the label I (ˉy )of the test face track is the minimal

residual or reconstruction error r j (ˉy

)and x j is the recov-ered coef?cients from the global solution ?x 1that belong to class j .Con?dence in the determined identity is obtained using the Sparsity Concentration Index (SCI),which is a measure of how distributed the residuals are across classes:

SCI =

C ·max j x j 1/ ?x 1?1

C ?1

∈[0,1],

(9)

ranging from 0(the test face is represented equally by all classes)to 1(the test face is fully represented by one class).

4.Movie Trailer Face Dataset

Existing datasets do not capture the large-scale identi?-cation scope we wish to evaluate.The YouTube Celebrities Dataset [14]has unconstrained videos from YouTube,how-ever they are very low quality and only contain 3unique videos per person,which they segment.The YouTube Faces Dataset [22]and Buffy Dataset [5]also exhibit more chal-lenging scenarios than traditional video face recognition datasets,however YouTube Faces is geared towards face

0255075100125150175200

204060N u m b e r o f T r a c Classes

Figure 3.The distribution of face tracks across the identities in

PubFig+10.

veri?cation,same vs.not same,and Buffy only contains 8

actors;thus,both are ill-suited for the large-scale face iden-ti?cation of our proposed video retrieval framework.

We built our Movie Trailer Face Dataset using 101movie trailers from YouTube from the 2010release year that con-tained celebrities present in the supplemented PublicFig+10dataset.These videos were then processed to generate face tracks using the method described above.The result-ing dataset contains 4,485face tracks,65%consisting of unknown identities (not present in PubFig+10)and 35%known.The class distribution is shown in Fig.3with the number of face tracks per celebrity in the movie trailers ranging from 5to 60labeled samples.The fact that half of the public ?gures do not appear in any of the movie trail-ers presents an interesting test scenario in which the algo-rithm must be able to distinguish the subject of interest from within a large pool of potential identities.

5.Experiments

In this section,we ?rst compare our tracking method to a standard method used in the literature.Then,we evaluate our video face recognition method on three exist-ing datasets,YouTube Faces,YouTube Celebrities,Buffy.We also evaluate several algorithms,including MSSRC (ours),on our new Movie Trailer Face Dataset,showing the strengths and weaknesses of each and thus proving experi-mentally the validity of our algorithm.

5.1.Tracking Results

To analyze the quality of our automatically generated face tracks,we ground-truthed ?ve movie trailers from the dataset:‘The Killer Inside’,‘My Name is Khan’,‘Biutiful’,‘Eat,Pray,Love’,and ‘The Dry Land’.Based on tracking literature [13],we use two CLEAR MOT metrics,Multi-ple Object Tracking Accuracy and Precision (MOTP and MOTA),for evaluation that better consider issues faced by trackers than standard accuracy,precision,or recall.The MOTA tells us how well the tracker did overall in regards to all of the ground-truth labels,while the MOTP appraises how well the tracker performed on the detections that exist in the ground-truth.

Method Video KLT[8]Ours

‘The Killer Inside’MOTP68.9369.35 MOTA42.8842.16

‘My Name is Khan’MOTP65.6365.77 MOTA44.2648.24

‘Biutiful’MOTP61.5861.34 MOTA39.2843.96

‘Eat Pray Love’MOTP56.9856.77 MOTA34.3335.60

‘The Dry Land’MOTP64.1162.70 MOTA27.9030.15

Average MOTP63.4663.19 MOTA37.7340.02

Table1.Tracking Results.Our method outperforms the KLT-based[8]method in terms of MOTA by2%.

Method Accuracy±SE AUC EER

MBGS[22]75.3±2.582.026.0

MSSRC(Ours)75.3±2.282.925.3 Table2.YouTube Faces Dataset.Results for top performing video face veri?cation algorithm MBGS and our competitive method MSSRC.Note:MBGS results are different from those published, but they are the output of default settings in their system.

Although our goal is not to solve the tracking problem, in Table1we show our results compared to a standard face tracking method.The?rst column shows a KLT-based method[8],where the face detections are associated based on a ratio of overlapping tracked features,and the second shows our method.Both methods are similarly precise, however our metrics have a larger coverage of total detec-tions/tracks by2%in MOTA with a3.5x speedup.Results are available online.

5.2.YouTube Faces Dataset

Although face identi?cation is the focus of our paper,we evaluated our method on the YouTube Faces Dataset[22] for face veri?cation(same/not same),to show that our method can also work in this context.To the best of our knowledge,there is only one paper[9],that has done face veri?cation using SRC,however it was not in the context of video face recognition,but that of still images from LFW. The YouTube Faces Dataset consists of5,000video pairs, half same and half not.The videos are divided into10splits each with500pairs.The results are averaged over the ten splits,where for each split one is used for testing and the remaining nine for training.The?nal results are presented in terms of accuracy,area under the curve,and equal error rate.As seen in Table4,we obtain competitive results with

Method Accuracy(%)

HMM[14]71.24

MDA[20]67.20

SANP[11]65.03

COV+PLS[21]70.10

UISA[6]74.60

MSSRC(Ours)80.75

Table3.YouTube Celebrities Dataset.We outperform the best reported result by6%.

Method Accuracy(%)

LDML[5]85.88

MSSRC(Ours)86.27

Table4.Buffy Dataset.We obtain a slight gain in accuracy over the reported method.

the top performing method MBGS[22],within1%in terms of accuracy,and MSSRC even surpasses it in terms of area under the curve(AUC)by just below1%with a lower equal error rate by0.7%.We perform all experiments with the same LBP data provided by[22]and aτvalue of0.0005.

5.3.YouTube Celebrities Dataset

The YouTube Celebrities Dataset[14]consists of47 celebrities(actors and politicians)in1910video clips downloaded from YouTube and manually segmented to the portions where the celebrity of interest appears.There are approximately41clips per person segmented from3unique videos per actor.The dataset is challenging due to pose,il-lumination,and expression variations,as well as high com-pression and low https://www.sodocs.net/doc/0c103682.html,ing our tracker,we successfully tracked92%of the videos as compared to the80%tracked in their paper[14].The standard experimental setup selects 3training clips,1from each unique video,and6test clips, 2from each unique video,per person.In Table3,we sum-marize reported results on YouTube Celebrities,where we outperform the state-of-the-art by at least6%.

5.4.Buffy Dataset

The Buffy Dataset consists of639manually annotated face tracks extracted from episodes9,21,and45from dif-ferent seasons of the TV series“Buffy the Vampire Slayer”. They generated tracks using the KLT-based method[8] (available on the author’s website).For features,we com-pute SIFT descriptors at9?ducial points as described in[5] and use their experimental setup with312tracks for train-ing and327testing.They present a Logistic Discriminant-based Metric Learning(LMDL)method that learns a sub-space.In their supervised experiments,they tried several classi?ers with each obtaining similar results.However,us-ing our classi?er,there is a slight improvement.

Method

AP (%)Recall (%)NN 9.530.00SVM

50.069.69LDML [5]19.480.00L2

36.160.00SRC (First Frame)42.1513.39SRC (V oting)54.8823.47MSSRC (Ours)

58.70

30.23

Table 5.Movie Trailer Face Dataset.MSSRC outperforms all of the non-SRC methods by at least 8%in AP and 20%recall at 90%precision.

506070

100

0102030405060708090100P r e c i s i o n (%)

Recall (%)

NN SVM LDML L2

SRC (1 Frame)SRC (Voting)MSSRC (Ours)

Figure 4.Precision vs.Recall for the Movie Trailer Face Dataset.MSSRC rejects unknowns or distractors better than all others.

5.5.Movie Trailer Face Dataset

In this section,we present results on our unconstrained Movie Trailer Face Dataset that allows us to test larger scale face identi?cation,as well as each algorithms ability to re-ject unknown identities.In our test scenario,we chose the Public Figures (PF)[16]dataset as our training gallery,sup-plemented by images collected of 10actors and actresses from web searches for additional coverage of face tracks extracted from movie trailers.We also cap the maximum number of training images per person in the dataset to 200for better performance due to the fact that predictions are otherwise skewed towards the people with the most exam-ples.The distribution of face tracks across all of the identi-ties in the PubFig+10dataset are shown in Fig.3.In total,PubFig+10consists of 34,522images and our Movie Trailer Face Dataset has 4,485face tracks,which we use to conduct experiments on several algorithms.5.5.1Algorithmic Comparison

The tested methods include NN,LDML,SVM,L2,SRC,and our method MSSRC.For the experiments with NN,

LDML,SVM,L2,and SRC,we test each individual frame of the face track and predict its ?nal identity via probabilis-tic voting and its con?dence is an average over the predicted distances or decision values.The con?dence values are used to reject predictions to evaluate the precision and recall of the system.Note all MSSRC experiments are performed with a λvalue of 0.01.We present results in terms of preci-sion and recall as de?ned in [8].

Table 5presents the results for the described methods on the Movie Trailer Face Dataset in terms of two measures,average precision and recall at 90%precision.NN performs very poorly in terms of both metrics,which explains why NN based methods have focused on ?nding “good”key-frames to test on.LMDL struggles with the larger num-ber of training classes vs.the Buffy experiment with only 19.48%average precision.The L2method performs sur-prisingly well for a simple method.We also tried Mean L2with similar performance.The SVM and SRC based meth-ods perform very closely at high recall,but not in terms of AP and recall at 90%precision with MSSRC outperforming SVM by 8%and 20%respectively.In Fig.4,the SRC based methods reject unknown identities better than the others.The straightforward application of SRC on a frame-by-frame basis and our ef?cient method MSSRC perform within 4%of each other,thus experimentally validating that MSSRC is computationally equivalent to performing stan-dard SRC on each individual frame.Instead of computing SRC on each frame,which takes approximately 45minutes per track,we reduce a face track to a single feature vector for 1-minimization (1.5min/track).Surprisingly,MSSRC obtains better recall at 90%precision by 7%and 4%in aver-age precision.Instead of fusing results after classi?cation,as done on the frame by frame methods,MSSRC bene?ts in better rejection of uncertain predictions.In terms of timing,the preprocessing steps of tracking runs identically for SRC and MSSRC at 20fps and feature extraction runs at 30fps.For identi?cation,MSSRC classi?es at 20milliseconds per frame,whereas SRC on a single frame takes 100millisec-onds.All other methods classify in less than 1ms,however with a steep drop in precision and recall.5.5.2Effect of Varying Track Length

The question remains,do we really need all of the images?To answer this question we select the ?rst m frames for each track and test the two best performing methods from the previous experiments:MSSRC and SVM.Fig.5shows that at just after 20frames performance plateaus,which is close to the average track length of 22frames.Most impor-tantly,the results show that using multiple frames is ben-e?cial since moving from using 1frame to 20frames re-sults in a 5.57%and 16.03%increase in average precision and recall at 90%precision respectively for MSSRC.Fur-

5102040All 0204060Number of Frames

A v e r a g e P r e c i s i o n (% SVM

MSSRC (Ours)(a)Average Precision 1

5102040All 0204060Number of Frames

R e c a l l (%) a t 90% P r e c i SVM

MSSRC (Ours)

(b)Recall at 90%Precision

Figure 5.Effect of Varying Track Length.We see that performance levels out at about 20frames (close to the average track length).MSSRC outperforms SVM by 8%in average in terms of AP.

thermore,Fig.5shows that the SVM’s performance also increases with more frames,although MSSRC outperforms the SVM method in its ability to reject unknown identities.

6.Conclusions and Future Work

In this paper we have presented a fully automatic end-to-end system for video face recognition,which includes face tracking and identi?cation leveraging information from both still images for the known dictionary and video for recognition.We propose a novel algorithm Mean Sequence SRC,MSSRC,that performs a joint optimization using all of the available image data to perform video face recogni-tion.We ?nally showed that our method outperforms the state-of-the-art on real-world,unconstrained videos in our new Movie Trailer Face Dataset.Furthermore,we showed our method especially excels at rejecting unknown identi-ties outperforming the next best method in terms of average precision by 8%.Video face recognition presents a very compelling area of research with dif?culties unseen in still-image recognition.In the future,we would explore the ef-fect of selecting key-frames,or less noisy frames.Further-more,there is a whole area of domain transfer for transfer-ring knowledge from the still-image domain to the videos.

Acknowledgement

We acknowledge Brian C.Becker,Niels da Vitoria Lobo,and Xin Li for their feedback and help.

References

[1]T.Ahonen,A.Hadid,and M.Pietik¨a inen.Face description

with local binary patterns:Application to face recognition.TPAMI ,2006.3

[2]O.Arandjelovic and R.Cipolla.Automatic Cast Listing in

Feature-Length Films with Anisotropic Manifold Space.In CVPR ,2006.3

[3]O.Arandjelovic and A.Zisserman.Automatic face recog-nition for ?lm character retrieval in feature-length ?lms.In CVPR ,2005.3

[4]S.Berrani and C.Garcia.Enhancing face recognition from

video sequences using robust statistics.AVSS ,2005.2

[5]R.G.Cinbis,J.Verbeek,and C.Schmid.Unsupervised met-ric learning for face identi?cation in TV video.ICCV ,2011.3,5,6,7

[6]Z.Cui,S.Shan,H.Zhang,https://www.sodocs.net/doc/0c103682.html,o,and X.Chen.Image

sets alignment for Video-Based Face Recognition.In CVPR ,2012.6

[7]N.Dalal and B.Triggs.Histograms of oriented gradients for

human detection.In CVPR ,2005.3

[8]M.Everingham and J.Sivic.Taking the bite out of automated

naming of characters in TV video.CVIU ,2009.3,6,7

[9]H.Guo,R.Wang,J.Choi,and L.S.Davis.Face veri?cation

using sparse representations.CVPR Workshop ,2012.6[10] A.Hadid and M.Pietikainen.From still image to video-based face recognition:an experimental analysis.FG ,2004.3

[11]Y .Hu,A.S.Mian,and R.Owens.Sparse approximated

nearest points for image set classi?cation.In CVPR ,2011.6[12]G.B.Huang,M.Ramesh,T.Berg,and E.Learned-Miller.

Labeled faces in the wild:A database for studying face recognition in unconstrained environments.Technical report,University of Massachusetts,Amherst,2007.1

[13]R.Kasturi,D.Goldgof,Padmanabhan,V .Manohar,J.Garo-folo,R.Bowers,M.Boonstra,V .Korzhova,and J.Zhang.Framework for performance evaluation of face,text,and ve-hicle detection and tracking in video:Data,metrics,and pro-tocol.TPAMI ,2009.5

[14]M.Kim,S.Kumar,V .Pavlovic,and H.Rowley.Face

tracking and recognition with visual constraints in real-world videos.In CVPR ,2008.3,5,6

[15] C.Kueblbeck and A.Ernst.Face detection and tracking in

video sequences using the modi?ed census transformation.JIVC ,2006.3

[16]N.Kumar,A.Berg,P.Belhumeur,and S.Nayar.Describ-able visual attributes for face veri?cation and image search.TPAMI ,2011.1,7

[17] C.Liu and H.Wechsler.Gabor feature based classi?cation

using the enhanced ?sher linear discriminant model for face recognition.TIP ,2002.3

[18] C.Shan.Face recognition and retrieval in video.Video

Search and Mining ,2010.2[19]M.Tapaswi and M.B¨a uml.“Knock!Knock!Who is

it?”Probabilistic Person Identi?cation in TV-Series.CVPR ,2012.3

[20]R.Wang and X.Chen.Manifold Discriminant Analysis.In

CVPR ,2009.6

[21]R.Wang,H.Guo,L.S.Davis,and Q.Dai.Covariance dis-criminative learning:A natural and ef?cient approach to im-age set classi?cation.In CVPR ,2012.6

[22]L.Wolf,T.Hassner,and Y .Taigman.Effective uncon-strained face recognition by combining multiple descriptors and learned background statistics.TPAMI ,2011.5,6

[23]J.Wright,A.Y .Yang,A.Ganesh,S.S.Sastry,and Y .Ma.

Robust face recognition via sparse representation.TPAMI ,2009.3

[24]O.Yamaguchi,K.Fukui,and K.Maeda.Face recognition

using temporal image sequence.In FG ,1998.3

[25]M.Zhao,J.Yagnik,H.Adam,and https://www.sodocs.net/doc/0c103682.html,rge scale learn-ing and recognition of faces in web videos.FG ,2008.2[26]S.Zhou,V .Krueger,and R.Chellappa.Probabilistic recog-nition of human faces from video.CVIU ,2003.3

Ortiz_Face_Recognition_in_2013_CVPR_paper

相关文档

最新文档