OSA's Digital Library

Journal of the Optical Society of America A

Journal of the Optical Society of America A

| OPTICS, IMAGE SCIENCE, AND VISION

  • Editor: Franco Gori
  • Vol. 31, Iss. 5 — May. 1, 2014
  • pp: 1090–1103
« Show journal navigation

Dictionaries for image and video-based face recognition [Invited]

Vishal M. Patel, Yi-Chen Chen, Rama Chellappa, and P. Jonathon Phillips  »View Author Affiliations


JOSA A, Vol. 31, Issue 5, pp. 1090-1103 (2014)
http://dx.doi.org/10.1364/JOSAA.31.001090


View Full Text Article

Acrobat PDF (1302 KB)





Browse Journals / Lookup Meetings

Browse by Journal and Year


   


Lookup Conference Papers

Close Browse Journals / Lookup Meetings

Article Tools

Share
Citations

Abstract

In recent years, sparse representation and dictionary-learning-based methods have emerged as powerful tools for efficiently processing data in nontraditional ways. A particular area of promise for these theories is face recognition. In this paper, we review the role of sparse representation and dictionary learning for efficient face identification and verification. Recent face recognition algorithms from still images, videos, and ambiguously labeled imagery are reviewed. In particular, discriminative dictionary learning algorithms as well as methods based on weakly supervised learning and domain adaptation are summarized. Some of the compelling challenges and issues that confront research in face recognition using sparse representations and dictionary learning are outlined.

© 2014 Optical Society of America

1. INTRODUCTION

Face recognition is a challenging problem that has been actively researched for over two decades [1

1. W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: a literature survey,” ACM Computing Surveys 35, 399–458 (2003). [CrossRef]

]. Current systems work very well when the test image is captured under controlled conditions [2

2. P. J. Phillips, “Improving face recognition technology,” Computer 44, 84–86 (2011). [CrossRef]

]. However, their performance degrades significantly when the test image contains variations that are not present in the training images. Some of these variations include illumination, pose, expression, cosmetics, and aging.

It has been observed that since human faces have similar overall configuration, face images can be described by a relatively low-dimensional subspace. As a result, holistic dimensionality reduction subspace methods such as principle component analysis (PCA) [3

3. M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cogn. Neurosci. 3, 71–86 (1991).

], linear discriminant analysis (LDA) [4

4. P. Belhumeur, J. Hespanda, and D. Kriegman, “Eigenfaces versus fisherfaces: recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell. 19, 711–720 (1997). [CrossRef]

,5

5. K. Etemad and R. Chellappa, “Discriminant analysis for recognition of human face images,” J. Opt. Soc. Am. A 14, 1724–1733 (1997). [CrossRef]

], and independent component analysis (ICA) [6

6. M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski, “Face recognition by independent component analysis,” IEEE Trans. Neural Netw. 13, 1450–1464 (2002). [CrossRef]

] have been proposed for the task of face recognition. These approaches can be classified into either generative or discriminative methods. An advantage of using generative approaches is their reduced sensitivity to noise [1

1. W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: a literature survey,” ACM Computing Surveys 35, 399–458 (2003). [CrossRef]

,7

7. J. Wright, A. Y. Yang, A. A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 210–227 (2009). [CrossRef]

].

In recent years, generative and discriminative approaches based on sparse representations have been gaining a lot of traction in biometrics recognition [8

8. V. M. Patel, R. Chellappa, and M. Tistarelli, “Sparse representations and random projections for robust and cancelable biometrics,” in International Conference on Control, Automation, Robotics and Vision, Guangzhou, December, 2010, pp. 1–6.

]. In sparse representation, given a signal and a redundant dictionary, the goal is to represent this signal as a sparse linear combination of elements (also known as atoms) from this dictionary. Finding a sparse representation entails solving a convex optimization problem. Using sparse representation, one can extract semantic information from the signal. For instance, one can sparsely represent a test sample in an overcomplete dictionary whose elements are the training samples themselves, provided that sufficient training samples are available from each class [7

7. J. Wright, A. Y. Yang, A. A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 210–227 (2009). [CrossRef]

]. An interesting property of sparse representations is that they are robust to noise and occlusion. For instance, good performance under partial occlusion, missing data, and variations in a background has been demonstrated in many sparsity-based methods [7

7. J. Wright, A. Y. Yang, A. A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 210–227 (2009). [CrossRef]

,9

9. J. K. Pillai, V. M. Patel, R. Chellappa, and N. Ratha, “Secure and robust iris recognition using random projections and sparse representations,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 1877–1893 (2011). [CrossRef]

]. The ability of sparse representations to extract meaningful information is due in part to the fact that face images belonging to the same person lie on a low-dimensional manifold.

In order to successfully apply sparse representation to face recognition problems, one needs to correctly choose an appropriate dictionary. Rather than using a predetermined dictionary, e.g., wavelets, one can train an overcomplete data-driven dictionary. An appropriately trained data-driven dictionary can simultaneously span the subspace of all faces and support optimal discrimination of the classes. These dictionaries tend to provide better classification accuracy than a predetermined dictionary [10

10. V. M. Patel and R. Chellappa, “Sparse representations, compressive sensing and dictionaries for pattern recognition,” in Asian Conference on Pattern Recognition (ACPR), Beijing, 2010.

].

Data-driven dictionaries can produce state-of-the-art results in various face recognition tasks. However, when the target data has a different distribution than the source data, the learned sparse representation may not be optimal. As a result, one needs to adapt these learned representations from one domain to the other. The problem of transferring a representation or classifier from one domain to the other is known as domain adaptation or domain transfer learning [11

11. J. Jiang, “A literature survey on domain adaptation of statistical classifiers,” Tech. Rep. (2008).

,12

12. K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in European Conference on Computer Vision (Springer-Verlag, 2010), Vol. 6314, pp. 213–226.

].

In this paper, we summarize some of the recent advances in still- and video-based face recognition using sparse representation and dictionary learning. Discriminative dictionary learning algorithms as well as methods based on weakly supervised learning and domain adaptation are summarized. These examples show that sparsity and dictionary learning are powerful tools for face recognition. Understanding how well these algorithms work can greatly improve our insights into some of the most compelling challenges in still- and video-based face recognition.

A. Organization of the Paper

This paper is organized as follows. In Section 2, we briefly review the idea behind sparse representation and dictionary learning. Section 3 presents some recent sparse representation and dictionary-based face recognition algorithms for still images. Sections 4 and 5 present interesting applications of dictionary-based methods for low-resolution images and joint expression and face recognition, respectively. Video-based face recognition methods are presented in Section 6. Weakly supervised dictionary learning and domain adaptive dictionary learning methods for face recognition are presented in Sections 7 and 8, respectively. Finally, concluding remarks are made in Section 9.

2. BACKGROUND

In recent years, sparse representation and dictionary learning have undergone rapid development, both in theory and in algorithms [13

13. J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proc. IEEE 98, 1031–1044 (2010). [CrossRef]

15

15. M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing (Springer, 2010).

]. In this section, we briefly review sparse representation (also known as sparse coding) and dictionary learning.

A. Sparse Coding

Let D be a redundant (overcomplete) dictionary with N atoms in Rd:
D=[d1,,dN]Rd×N.
The elements of D are normalized to unit Euclidean norm, i.e., di=1i. Given a signal ytRd, finding the sparsest representation of yt in D entails solving the following optimization problem:
xt=argminxx0subjecttoyt=Dx,
(1)
where x0:=#{j:xj0}, which is a count of the number of nonzero elements in x. Equation (1) is NP-hard and cannot be solved in a polynomial time. Hence approximate solutions are usually sought. For instance, basis pursuit [16

16. S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comp 20, 33–61 (1998). [CrossRef]

] offers the solution via 1 minimization as
xt=argminxx1subjecttoyt=Dx,
(2)
where ·p for 0<p< is the p norm defined as
xp=(j=1d|xj|p)1p.
The sparsest recovery is possible provided that certain conditions are met [15

15. M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing (Springer, 2010).

]. One can adapt the above framework to noisy settings, where the measurements are contaminated with an error n obeying n2<ϵ, that is
yt=Dx+nforn2<ϵ.
(3)
A stable solution can be obtained by solving the following optimization problem [15

15. M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing (Springer, 2010).

]:
xt=argminxx1subjecttoytDx2<ϵ.
(4)

B. Dictionary Learning

Traditionally, the dictionary D in Eq. (1) is predetermined, e.g., wavelets. It has been observed that learning a dictionary directly from the training data rather than using a predetermined dictionary usually leads to a more compact representation and hence can provide improved results in many practical image processing applications such as restoration and classification [14

14. R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proc. IEEE 98, 1045–1057 (2010). [CrossRef]

,15

15. M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing (Springer, 2010).

].

Several algorithms have been developed for the task of learning a dictionary. Two of the most well-known algorithms are the method of optimal directions (MOD) [17

17. K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in IEEE International Conference on Acoustic, Speech, Signal Processing, Phoenix, 1999, Vol. 5, pp. 2443–2446.

] and the KSVD algorithm [18

18. M. Aharon, M. Elad, and A. M. Bruckstein, “The k-svd: an algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process. 54, 4311–4322 (2006). [CrossRef]

]. Given a set of examples Y=[y1,,yn], the goal of the KSVD and MOD algorithms is to find a dictionary D and a sparse matrix X that minimize the following representation error:
(D^,X^)=argminD,XYDXF2s.t.xi0T0i,
(5)
where xi represent the columns of X, AF denotes the Frobenius norm of A, and T0 denotes the sparsity level. Both MOD and KSVD are iterative methods and alternate between sparse-coding and dictionary update steps. First, a dictionary D with 2 normalized columns is initialized. Then the main iteration is composed of the following two stages:
  • Sparse coding: In this step, D is fixed, and the following optimization problem is solved to compute the representation vector xi for each example yi:
    i=1,,n,minγiyiDxi22s.t.xi0T0.
  • Dictionary update: This is where both MOD and KSVD algorithms differ. The MOD algorithm updates all the atoms simultaneously by solving an optimization problem whose solution is given by D=YX, where X denotes the Moore–Penrose pseudo-inverse. Even though the MOD algorithm is very effective and usually converges in a few iterations, it suffers from the high complexity of the matrix inversion as discussed in [18

    18. M. Aharon, M. Elad, and A. M. Bruckstein, “The k-svd: an algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process. 54, 4311–4322 (2006). [CrossRef]

    ].

In the case of KSVD, the dictionary update is performed atom-by-atom in a computationally efficient way rather than using a matrix inversion. It has been observed that the KSVD algorithm requires fewer iterations to converge than the MOD method.

Dictionaries can be trained for reconstruction and classification applications. In the late 1990s, Etemand and Chellappa proposed a linear discriminant analysis (LDA) based basis selection and feature extraction algorithm for classification using wavelet packets [19

19. K. Etemand and R. Chellappa, “Separability-based multiscale basis selection and feature extraction for signal and image classification,” IEEE Trans. Image Process. 7, 1453–1465 (1998). [CrossRef]

]. Recently, similar algorithms for simultaneous sparse signal representation and discrimination have also been proposed [20

20. J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEE Trans. Pattern Anal. Mach. Intell. 34, 791–804 (2012). [CrossRef]

24

24. Q. Qiu, V. M. Patel, and R. Chellappa, “Information-theoretic dictionary learning for image classification,” IEEE Trans. Pattern Anal. Mach. Intell. (to be published).

].

Kernel-based nonlinear sparse coding and dictionary learning methods have also been proposed in the literature [25

25. H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Design of nonlinear kernel dictionaries for object recognition,” IEEE Trans. Image Process. 22, 5123–5135 (2013). [CrossRef]

27

27. L. Zhang, W.-D. Zhou, P.-C. Chang, J. Liu, Z. Yan, T. Wang, and F.-Z. Li, “Kernel sparse representation-based classifier,” IEEE Trans. Signal Process. 60, 1684–1695 (2012). [CrossRef]

]. These methods essentially map the input data onto a high-dimensional feature space using a predetermined kernel function. Sparse codes and dictionaries are then trained on the feature space for better representation and discrimination. Additional techniques for discriminative and kernel-based dictionary learning may be found within these references.

3. DICTIONARY-BASED FACE RECOGNITION FROM STILL IMAGES

Equipped with the above notation on sparse coding and dictionary learning, we present some sparsity and dictionary-based face recognition algorithms for dealing with still images.

A. Sparse Representation-based Classification [7

7. J. Wright, A. Y. Yang, A. A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 210–227 (2009). [CrossRef]

,9

9. J. K. Pillai, V. M. Patel, R. Chellappa, and N. Ratha, “Secure and robust iris recognition using random projections and sparse representations,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 1877–1893 (2011). [CrossRef]

]

Suppose that we are given C distinct faces (or classes) and a set of Nc training images per class. We identify an l×p grayscale image as a d dimensional vector, which can be obtained by stacking its columns. Let Yc=[y1c,,yNcc]Rd×Nc be the matrix of training images from the cth class. Define a new matrix, Y, as the concatenation of training samples from all the classes as
Y=[Y1,,YC]Rd×N=[y11,,yN11|y12,,yN22||y1C,,yNCC][y1,y2,,yN],
where N=cNc. We consider an observation vector ytRd of unknown class as a linear combination of the training vectors as
yt=c=1Ci=1Ncxicyic
(6)
with coefficients xicR. Equation (6) can be more compactly written as
yt=Yx,
(7)
where
x=[x11,,xN11|x12,,xN22||x1C,,xNCC]T[x1,x2,,xN]T
(8)
and (·)T denotes the transposition operation. One can make an assumption that given sufficient training samples of the c th class, Yc, any new test image ytRd that belongs to the same class will approximately lie in the linear span of the training samples from the class c. This implies that most of the coefficients not associated with class c in Eq. (8) will be close to zero. As a result, assuming that observations are noisy, one can recover this sparse vector by solving the following optimization problem:
xt=argminxx1subjecttoytYx2ϵ
(9)
or equivalently the following formulation:
xt=argminxytYx2+λx1,
(10)
where λ is a parameter. The sparse code xt can then be used to determine the class of yt by computing the following error for each class:
ec=ytYcxtc2,
(11)
where xtc is the part of coefficient vector xt that corresponds to Yc. Finally, the class c* that is associated to the test sample yt can be declared as the one that produces the smallest approximation error:
c*=classofyt=argmincec.
(12)

B. Dictionary-Learning-based Methods for Face Recognition

There are a number of challenges that one has to overcome when designing face recognition systems based on sparse representation. One is designing algorithms that are robust to changes in illumination and pose; a second is that algorithms need to efficiently scale as the number of people enrolled in the system increases. In sparse representation-based classification (SRC), the challenges mentioned above are met by collecting a set of images of each person that spans the space of expected variations in illumination and pose. The SRC approach recognizes faces by solving an optimization problem over the set of images enrolled into the database. This solution trades robustness and size of the database against computational efficiency.

In [28

28. V. M. Patel, W. Tao, S. Biswas, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition under variable lighting and pose,” IEEE Trans. Inf. Forensics Secur. 7, 954–965 (2012). [CrossRef]

], a dictionary-based algorithm was proposed to perform face recognition across varying illumination and pose by learning small-sized face-specific dictionaries. This method consists of two main stages. In the first stage, given training samples from each face, face-specific dictionaries are trained with some fixed number of atoms. In the second stage, a novel test face image is projected onto the span of the atoms in each learned face specific dictionary. The residual vectors are then used for classification. Furthermore, an image relighting approach is integrated within this framework based on the Lambertian reflectance model for the surface of a face to deal with the illumination problem [29

29. S. Biswas, G. Aggarwal, and R. Chellappa, “Robust estimation of albedo for illumination-invariant matching and shape recovery,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 884–899 (2009). [CrossRef]

].

In [28

28. V. M. Patel, W. Tao, S. Biswas, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition under variable lighting and pose,” IEEE Trans. Inf. Forensics Secur. 7, 954–965 (2012). [CrossRef]

], rather than finding a sparse representation based on training samples, C face-specific dictionaries are trained by solving the following optimization problem:
(D^i,X^i)=argminDi,XiYiDiXiF2s.t.xj0T0j,
(13)
for i=1,,C. The above optimization problem can be solved by the KSVD algorithm as discussed earlier. Given a test sample yt, it is projected onto the span of the atoms in each Di using the orthogonal projector:
Pi=Di(DiTDi)1DiT.
(14)
The approximation and residual vectors can then be calculated as
y^i=Piyt=Diαi
(15)
and
ri(yt)=yty^i=(IPi)yt,
(16)
respectively, where I is the identity matrix and
αi=(DiTDi)1DiTyt
(17)
are the coefficients. Since the KSVD algorithm finds the dictionary, Di, that leads to the best representation for each examples in Yi, one can suspect ri(yt)2 to be small if yt were to belong to the ith class and large for the other classes. Based on this, we can classify yt by assigning it to the class, d{1,,C}, that gives the lowest reconstruction error, ri(yt)2:
d=identity(yt)=argminiri(yt)2.
(18)

Images of the same person can vary significantly due to variations in illumination conditions. To deal with this problem in [29

29. S. Biswas, G. Aggarwal, and R. Chellappa, “Robust estimation of albedo for illumination-invariant matching and shape recovery,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 884–899 (2009). [CrossRef]

], a relighting method was introduced to capture the illumination conditions that might occur in the test sample in the training samples. Assuming the Lambertian reflectance model for the facial surface, one can relate the surface normals, albedo, and the intensity image by an image formation model. The diffused component of the surface reflection is given by
yi,j=ρi,jmax(ni,jTs,0),
(19)
where yi,j is the pixel intensity at position (i,j), s is the light source direction, ρi,j is the surface albedo at position (i,j), ni,j is the surface normal of the corresponding surface point, and 1il, 1jq. The max function in Eq. (19) accounts for the formation of attached shadows. Neglecting the attached shadows, Eq. (19) can be linearized as
yi,j=ρi,jmax(ni,jTs,0)ρi,jni,jTs.
(20)
Let ni,j(0) and s(0) be the initial values of the surface normal and illumination direction. These initial values can be domain dependent average values. The Lambertian assumption imposes the following constraints on the initial albedo:
ρi,j(0)=yi,jni,j(0)·s(0),
(21)
where · denotes the standard dot product operation. Using Eq. (20), Eq. (21) can be rewritten as
ρi,j(0)=ρi,jni,j·sni,j(0)·s(0)=ρi,j+ni,j·sni,j(0)·s(0)ni,j(0)·s(0)ρi,j,=ρi,j+ωi,j,
(22)
where
ωi,j=ni,j·sni,j(0)·s(0)ni,j(0)·s(0)ρi,j.
(23)
This can be viewed as a signal estimation problem where ρ is the original signal, ρ(0) is the degraded signal, and ω is the signal dependent noise. Using this model, the albedo can be estimated using the method of minimum mean squared error criterion [29

29. S. Biswas, G. Aggarwal, and R. Chellappa, “Robust estimation of albedo for illumination-invariant matching and shape recovery,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 884–899 (2009). [CrossRef]

]. Then, using the estimated albedo map, one can generate new images for a given light source direction using the image formation model in Eq. (19). This can be done by combining the estimated albedo map and light source direction with the average facial information [30

30. V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphable model,” IEEE Trans. Pattern Anal. Mach. Intell. 25, 1063–1074 (2003). [CrossRef]

]. Figure 1 shows some relighted images and the corresponding input images.

Fig. 1. Examples of the original images (first column) and the corresponding relighted images with different light source directions from the PIE data set [31].

The method presented above can be generalized such that it can handle pose variations [32

32. S. Biswas and R. Chellappa, “Pose-robust albedo estimation from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010.

]. Let n¯i,j, s¯, and Θ¯ be some initial estimates of the surface normals, illumination direction and initial estimate of surface normals in pose Θ, respectively. Then the initial albedo at pixel (i,j) can be obtained by
ρ¯i,j=yi,jn¯i,jΘ¯·s¯,
(24)
where n¯i,jΘ¯ denotes the initial estimate of surface normals in pose Θ¯. Using this model, we can reformulate the problem of recovering albedo as a signal estimation problem. Using arguments similar to Eq. (21), we get the following formulation for the albedo estimation problem in the presence of pose
ρ¯i,j=ρi,jhi,j+ωi,j,
where
wi,j=n¯i,jΘ·sn¯i,jΘ·s¯n¯i,jΘ¯·s¯ρi,j,
(25)
hi,j=n¯i,jΘ·s¯n¯i,jΘ¯·s¯,
(26)
ρi,j is the true albedo and ρ¯i,j is the degraded albedo. In the case when the pose is known accurately, Θ¯=Θ and hi,j=1. Hence, this can be viewed as a generalization of Eq. (22) in the case of unknown pose. Using this model, a stochastic filtering framework was recently presented in [32

32. S. Biswas and R. Chellappa, “Pose-robust albedo estimation from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010.

] to estimate the albedo from a single nonfrontal face image. Once pose and illumination have been normalized, one can use the relighting method described above to generate multiple frontal images with different lighting to achieve illumination and poserobust recognition [28

28. V. M. Patel, W. Tao, S. Biswas, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition under variable lighting and pose,” IEEE Trans. Inf. Forensics Secur. 7, 954–965 (2012). [CrossRef]

]. Figure 2 shows some examples of pose normalized images using this method. One can also use the Radon-based dictionary learning method presented in [33

33. Y.-C. Chen, C. S. Sastry, V. M. Patel, P. J. Phillips, and R. Chellappa, “In-plane rotation and scale invariant clustering using dictionaries,” IEEE Trans. Image Process. 22, 2166–2180 (2013). [CrossRef]

] to deal with in-plane rotated face images.

Fig. 2. Poserobust albedo estimation. Left column: original input images. Middle column: recovered albedo maps corresponding to frontal face images. Right column: pose normalized relighted images [28].

C. Discriminative Dictionary-based Face Recognition

Given a data matrix Y, the general cost function for learning a dictionary takes the following form:
minD,XYDXF2+λΨ(X),
(27)
where λ is a parameter, and columns of Y, D, and X contain the training signals, the dictionary atoms, and their coefficients, respectively. While these approaches are purely generative, the design of supervised discriminative dictionaries has also gained a lot of traction in recent years [10

10. V. M. Patel and R. Chellappa, “Sparse representations, compressive sensing and dictionaries for pattern recognition,” in Asian Conference on Pattern Recognition (ACPR), Beijing, 2010.

]. The design of such dictionaries entails modification of the function Ψ(X) in Eq. (27) so that not only sparsity is enforced but discrimination is also maintained. This is often done by introducing LDA type of discrimination on the sparse coefficients, which essentially enforces separability among dictionary atoms of different classes [21

21. Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent k-svd: learning a discriminative dictionary for recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 35, 2651–2664 (2013). [CrossRef]

24

24. Q. Qiu, V. M. Patel, and R. Chellappa, “Information-theoretic dictionary learning for image classification,” IEEE Trans. Pattern Anal. Mach. Intell. (to be published).

]. Manipulation of Ψ(X) so that it enforces group sparsity can also lead to the design of hierarchical dictionaries.

The results obtained by different sparsity-based and dictionary-based face recognition algorithms on the extended YaleB dataset [34

34. A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell. 23, 643–660 (2001). [CrossRef]

] are compared in Table 1. The extended YaleB dataset contains 2414 frontal face images of 38 individuals. This dataset is challenging due to varying illumination conditions and expressions. The experimental setup is adapted from [21

21. Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent k-svd: learning a discriminative dictionary for recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 35, 2651–2664 (2013). [CrossRef]

]. It is noted that SRC uses training samples as the dictionary. For a fair comparison, one needs to constrain the total number of training samples used in SRC based on the dictionary size in other compared methods. For completeness, we have also included results on SRC with all training samples and denoted it as SRC*.

Table 1. Identification Rate (%) on the Extended YaleB Face Dataset [24]

table-icon
View This Table
| View All Tables

As can be seen from this table, both generative and discriminative dictionary-based methods produce comparable results on this dataset.

4. SYNTHESIS-BASED LOW RESOLUTION FACE RECOGNITION

Fig. 3. Typical image in remote face recognition.

In practical scenarios, the resolution change is also coupled with other variations due to pose, illumination variations, and expression. Algorithms specifically designed to deal with low resolution images quite often fail in dealing with these variations. Hence, it is essential to include these parameters while designing a robust method for low-resolution face recognition. In [36

36. S. Shekhar, V. M. Patel, and R. Chellappa, “Synthesis-based recognition of low resolution faces,” in International Joint Conference on Biometrics, Washington, D.C., 2011, pp. 1–6.

], a generative dictionary-based approach is proposed for low-resolution face recognition that is also robust to illumination variations based on learning class specific dictionaries.

Fig. 4. Overview of the dictionary-based low-resolution face recognition [36].

The results from Fig. 5 demonstrates that the dictionary-based SLRFR method gives better performance over the existing low-resolution face-recognition methods. The CLPM algorithm [38

38. B. Li, H. Chang, S. Shan, and X. Chen, “Low-resolution face recognition via coupled locality preserving mappings,” IEEE Signal Process. Lett. 17, 20–23 (2010). [CrossRef]

] performs close to the SLRFR method at 7×6 and 10×8 resolutions, but its performance decreases at 12×10. This shows that the method is not stable over different resolutions. The SLRFR method, however, gives a consistent performance over all the resolutions.

Fig. 5. Recognition rates for FRGC data with probes at low resolutions [36].

5. DICTIONARY-BASED JOINT FACE AND EXPRESSION RECOGNITION

Most existing methods for the recognition of faces and expressions consider either the expression-invariant face recognition problem or the identity-independent facial expression recognition problem. A joint face and facial expression recognition using a dictionary-based component separation (DCS) algorithm was recently proposed in [39

39. S. Taheri, V. M. Patel, and R. Chellappa, “Component-based recognition of faces and facial expressions,” in IEEE Transactions on Affective Computing, October, 2013.

]. In this approach, the given expressive face is viewed as a superposition of a neutral face component with a facial expression component, which is sparse with respect to the whole image (see Fig 6). This assumption leads to a DCS algorithm, which benefits from the idea of sparsity and morphological diversity. This entails building data-driven dictionaries for neutral and expressive components. The DCS algorithm then uses these dictionaries to decompose an expressive test face into its constituent components. The sparse codes obtained as a result of this decomposition are then used for joint face and expression recognition. In what follows, we briefly review this method for facial component separation-based recognition.

Fig. 6. Facial component separation. (a) Original face image, (b) is viewed as the superposition of a neutral component, and (c) with a component containing the expression [39].

A face image y containing an expression can be viewed as a superposition of a neutral face component yn with a facial expression component ye. In other words,
y=yn+ye.
(28)
We assume that yn is sparse in a dictionary Dn, and similarly, ye is sparse in a dictionary De. Given Mn, MeN, the dictionaries DnRN×Mn and DeRN×Me are chosen such that they provide sparse representations of neutral and expression contents, respectively. That is, we assume there are coefficient vectors xnRMn×1 and xeRMe×1 so that yn=Dnxn and ye=Dexe.

One can recover the face image y by estimating the components yn and ye via xn and xe by solving the following optimization problem:
x^n,x^e=argminxn,xeλxn1+λxe1+12yDnxnDexe22.
(29)
The two components are the corresponding representations of the two parts and can be obtained by y^n=Dnx^n and y^e=Dex^e. Figure 6 shows an example of this separation in the testing part of the algorithm.

Finding dictionaries that represent the neutral and expression components of faces is critical as it affects how well the components are separated through successive iterations. Given enough training of each class, we use a low rank and sparse approximation algorithm [40

40. E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM 58(3), 1–37 (2011). [CrossRef]

] to decompose these training samples into neutral and expressive components. Once these components are found, we learn component specific dictionaries to represent expressive and neutral components well. These dictionaries are then used in Eq. (29) to separate the facial components.

Since expression and subject class labels are known, one can use the SRC algorithm to recognize both the subject label and expression label of a test image by representing these separated components as sparse linear combinations of their corresponding training samples. An overview of this component-based joint face and expression recognition algorithm is shown in Fig. 7.

Fig. 7. Component-based recognition of faces and facial expressions algorithm overview [39].

We highlight some of the results presented in [39

39. S. Taheri, V. M. Patel, and R. Chellappa, “Component-based recognition of faces and facial expressions,” in IEEE Transactions on Affective Computing, October, 2013.

] using the CK+ dataset [41

41. P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, San Francisco, 2010, pp. 94–101.

]. In the S3 setup, one-expression-out face recognition is performed to evaluate the effect of various expressions on the face recognition performance. Figure 8 shows the effects of various expressions on the face recognition results using the S3 setup. As the figure shows, while angry and sad faces are the easiest expressive faces to recognize (since these expressions are more subtle compared to others, and so they present less challenges for face recognition), the surprise face is the most challenging one for recognition. See [39

39. S. Taheri, V. M. Patel, and R. Chellappa, “Component-based recognition of faces and facial expressions,” in IEEE Transactions on Affective Computing, October, 2013.

] for more results on various face and expression recognition datasets.

Fig. 8. Effects of various expressions on the face recognition results on the CK+ dataset using S3 set-up. Each bar shows the face recognition rate we obtain when all the faces with corresponding expressions are kept out for testing and the rest are used for training [39].

6. DICTIONARY-BASED FACE RECOGNITION FROM VIDEO

Face recognition research has traditionally concentrated on recognition from still images [1

1. W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: a literature survey,” ACM Computing Surveys 35, 399–458 (2003). [CrossRef]

,7

7. J. Wright, A. Y. Yang, A. A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 210–227 (2009). [CrossRef]

,28

28. V. M. Patel, W. Tao, S. Biswas, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition under variable lighting and pose,” IEEE Trans. Inf. Forensics Secur. 7, 954–965 (2012). [CrossRef]

,42

42. P. J. Phillips, “Matching pursuit filters applied to face identification,” IEEE Trans. Image Process. 7, 1150–1164 (1998). [CrossRef]

]. With inclusion of video cameras in mobile devices, face recognition from video is gaining attention. In video-based face recognition, a key challenge is exploiting the extra information available in a video. In addition, different video sequences of the same subject may contain variations in resolution, illumination, pose, and facial expressions. These variations contribute to the challenges in designing an effective video-based face-recognition algorithm.

To deal with some of these challenges, a generative approach based on dictionary learning methods, which is robust to changes in illumination and pose, was recently proposed in [43

43. Y.-C. Chen, V. M. Patel, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition from video,” in European Conference on Computer Vision, October 2012.

]. One major advantage of this method is that it is robust to some variations in video sequences. Figure 9 shows an overview of this dictionary-based face recognition from video (DFRV) approach.

Fig. 9. DFRV algorithm overview [43].

For each frame in a video sequence, the face regions are first detected and cropped using the Viola–Jones object-detection framework [44

44. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” In IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, June, 2001, pp. 511–518.

]. All the cropped face images are partitioned into K different partitions using a k-means clustering type of algorithm, which is inspired by a video summarization algorithm. By partitioning the original video sequence, one obtains K separate sequences, each containing images with specific pose and/or lighting conditions. To remove the temporal redundancy while capturing variations due to changes in pose and illumination, a dictionary is constructed for each partition. A dictionary is learned with the minimum representation error under a sparseness constraint. Thus there will be K sub-dictionaries built to represent a video sequence. In the recognition phase, frames from a given query video sequence are projected onto the span of atoms in every sequence-specific dictionary. From the projection on to the atoms, the residuals are computed and combined to perform recognition or verification.

To make the sequence-level decision, p* is selected such that
p*=argmaxp(k=1KwkCp,k),
(33)
where Cp,k is the total number of votes from partition k for sequence p, and wk is the weight associated with partition Qk(m). Finally, using the knowledge of the correspondence m(·) between subjects and sequences, the query video sequence Q(m) is assigned to subject i*=m(p*).

For verification, given a query video sequence and any gallery video sequence, the goal is to correctly determine whether these two belong to the same subject. The well-known receiver operating characteristic (ROC) curve, which describes relations between false acceptance rates (FARs) and true acceptance rates (TARs), is used to evaluate the performance of verification algorithms. As the TAR increases, so does the FAR. Therefore one would expect an ideal verification framework to have TARs all equal to 1 for any FARs. The ROC curves can be computed given a similarity matrix. The residual between a query Q(m) and a dictionary D(p), is used to fill in the (m,p) entry of the similarity matrix. Denoting the residual by R(m,p), we have
R(m,p)=mink{1,2,,K}Rk(m,p),
(34)
where
Rk(m,p)minl{1,2,,nk}qk,l(m)D(p)D(p)qk,l(m)2.
(35)
In other words, the minimum residual among all l{1,2,,nk}, and all k{1,2,,K} is selected as the similarity between the query video sequence Q(m) and dictionary D(p). The DFRV algorithm was extended in [45

45. Y.-C. Chen, V. M. Patel, S. Shekhar, R. Chellappa, and P. J. Phillips, “Video-based face recognition via joint sparse representation,” in IEEE International Conference on Automatic Face and Gesture Recognition, Shanghai, 2013, pp. 1–8.

] using the sparsity-based fusion framework proposed in [46

46. S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Joint sparse representation for robust multimodal biometrics recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 36, 113–126 (2014). [CrossRef]

].

To show the effectiveness of the DFRV method, we summarize several experimental results reported in [43

43. Y.-C. Chen, V. M. Patel, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition from video,” in European Conference on Computer Vision, October 2012.

]. The video challenge of Face and Ocular Challenge Series (FOCS) [47

47. A. J. O’Toole, J. Harms, S. L. Snow, D. R. Hurst, M. R. Pappas, J. H. Ayyad, and H. Abdi, “Recognizing people from dynamic and static faces and bodies: dissecting identity with a fusion approach,” Vis. Res. 51, 74–83 (2011). [CrossRef]

] is designed to match “frontal versus frontal,” “frontal versus nonfrontal,” and “nonfrontal versus nonfrontal” video sequences. We highlight some of the results on the FOCS UT Dallas video dataset using the DFRV method. The FOCS UT Dallas dataset contains 510 walking (frontal face) and 506 activity (nonfrontal face) video sequences recorded from 295 subjects with frame size 720×480 pixels. The sequences were acquired on different days. In the walking sequences, the subject is originally positioned far away from the video camera, walks toward it with a frontal pose, and finally turns away from the video camera showing the profile face. In these sequences, the subject stands and talks with another person with a nonfrontal face view to the video camera. The sequences contain normal head motions that occur during a conversation, e.g., the head turning up to 90 deg, hand raising, and/or pointing somewhere.

Leave-one-out tests on three subsets: S2 (189 subjects, 404 videos), S3 (19 subjects, 64 videos), and S4 (six subjects, 25 videos) from the UT-Dallas walking videos are conducted. Table 2 shows identification results. The DFRV algorithm has the best identification rates among all the compared algorithms.

Table 2. Identification Rates of Leave-One-Out Testing Experiments on the FOCS UT-Dallas Walking Videos; DFRV Method Performs Best [43]

table-icon
View This Table
| View All Tables

FOCS specifies a verification protocol as well: 1A (walking versus walking), 2A (activity versus walking), and 3A (activity versus activity). In these experiments, 481 walking videos and 477 activity videos are chosen as query videos. The size of target sets ranges from 109 to 135 video sequences. Figure 10 shows the ROC curves of verification experiments where the DFRV algorithm is compared with Wrapped Gaussian Common Pole (WGCP) [48

48. P. K. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, “Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 2273–2286 (2011). [CrossRef]

]. In all three experiments, the DFRV algorithm is superior to the WGCP algorithm.

Fig. 10. ROC curves of FOCS experiments on UT-Dallas video [43].

The multiple biomertic grand challenge (MBGC) video version 1 dataset (Notre Dame dataset) [51

51. P. Jonathon Phillips, P. J. Flynn, J. Ross Beveridge, W. Todd Scruggs, A. J. O’Toole, D. Bolme, K. W. Bowyer, B. A. Draper, G. H. Givens, Y. M. Lui, H. Sahibzada, J. A. Scallan III, and S. Weimer, “Overview of the multiple biometrics grand challenge,” in International Conference on Biometrics (Springer, 2009), pp. 705–714.

,52

52. Information Technology Laboratory, NIST. Multiple biomertic grand challenge, http://www.nist.gov/itl/iad/ig/mbgc.cfm.

] contains 399 walking (frontal-face) and 371 activity (profile-face) video sequences recorded of 146 subjects. Both types of sequences were collected in standard definition (SD) format (720×480 pixels) and high-definition (HD) format (1440×1080 pixels). The 399 walking sequences consist of 201 sequences in SD and 198 in HD. For the 371 walking video sequences, 185 are in SD and 186 are in HD.

Following the experiment design in [48

48. P. K. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, “Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 2273–2286 (2011). [CrossRef]

], a leave-one-out identification experiment on three subsets of the cropped face images from walking videos was conducted in [43

43. Y.-C. Chen, V. M. Patel, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition from video,” in European Conference on Computer Vision, October 2012.

]. These three subsets are S2 (subjects that have at least two video sequences: 144 subjects, 397 videos), S3 (subjects that have at least three video sequences: 55 subjects, 219 videos) and S4 (subjects that have at least four video sequences: 54 subjects, 216 videos). Table 3 lists the percentages of correct identifications for this experiment. The DFRV method outperforms the statistical-pattern recognition methods reported in [48

48. P. K. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, “Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 2273–2286 (2011). [CrossRef]

,49

49. P. K. Turaga, A. Veeraraghavan, and R. Chellappa, “Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 2008, pp. 1–8.

] and the sparse approximated nearest points (SANP) method [50

50. Y. Hu, A. S. Mian, and R. Owens, “Sparse approximated nearest points for image set classification,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2011), pp. 27–40.

].

Table 3. Identification Rates of Leave-One-Out Testing Experiments on the MBGC Walking Videos [43]

table-icon
View This Table
| View All Tables

In the second set of experiments on the MBGC dataset, videos associated for those subjects that are in at least two videos (i.e., S2) were selected. All these videos were divided into SD and HD videos, to conduct “SD versus HD” (SD as probe; HD as gallery) and “HD versus SD” (HD as probe; SD as gallery) experiments. Correct identification rates are shown in Table 4. The DFRV method performed the best and it outperformed the other methods significantly. See [43

43. Y.-C. Chen, V. M. Patel, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition from video,” in European Conference on Computer Vision, October 2012.

] for more details regarding the DFRV method and experimental results on several other dataset.

Table 4. Identification Rates of “SD versus HD” and “HD versus SD” Experiments on the MBGC Walking Video Subset S2a

table-icon
View This Table
| View All Tables

7. DICTIONARY LEARNING FROM AMBIGUOUSLY LABELED DATA

In many practical image and video applications, one has access only to ambiguously labeled data [53

53. Y.-C. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips, “Dictionary learning from ambiguously labeled data,” in IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, 2013, pp. 353–360.

,54

54. A. Shrivastava, J. K. Pillai, V. M. Patel, and R. Chellappa, “Learning discriminative dictionaries with partially labeled data,” in IEEE International Conference on Image ProcessingOrlando, 2012, pp. 3113–3116.

]. For example, given a picture with multiple faces and a caption specifying who are in the picture, the reader may not know which face goes with the names in the caption. The problem of learning identities where each example is associated with multiple labels, when only one of which is correct is often known as ambiguously labeled learning.

In [53

53. Y.-C. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips, “Dictionary learning from ambiguously labeled data,” in IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, 2013, pp. 353–360.

], a dictionary-based learning method was proposed for such ambiguously labeled multiclass classification, where each training sample has multiple labels, and only one of them is the correct label. The dictionary learning problem is solved using an iterative alternating algorithm. At each iteration of the algorithm, two alternating steps are performed: a confidence update and a dictionary update. The confidence of each sample is defined as the probability distribution on its ambiguous labels. The dictionaries are updated using either soft (EM-based) or hard decision rules. Figure 11 shows an overview of this method.

Fig. 11. Dictionary-based face recognition from ambiguously labeled data algorithm overview [53].

Let L={(xi,Li),i=1,,N} be the training data. Here, yi denotes the ith training sample, Li{1,2,,K} the corresponding multiple label set, and N the number of training samples. There are a total of K classes. The true label zi of the ith training sample is in the multilabel set Li. Let xiRd denote the lexicographically ordered vector representing the sample yi. For each feature vector yi and for each class j, we define a latent variable pi,j, which represents the confidence of yi belonging to the jth class. By definition, we have jpi,j=1, and
pi,j=0ifjLi,i=1,,N,pi,j(0,1]ifjLi,i=1,,N.
(36)
Let P be the confidence matrix with entry pi,j in the i-th row and j-th column. Define Cj to be the collection of samples in class j represented as a matrix and C=[C1,C2,,CK] be the concatenation of all samples from different classes. Similarly, let Dj be the dictionary that is learned from the data in Cj and D=[D1,D2,,DK] be the concatenation of all dictionaries.

The dictionary learning hard decision (DLHD) approach learns dictionaries directly from class matrices, {Ci}i=1K, which are determined using a hard decision for class labels for each sample yi by selecting the classes with the maximum pi,c among all c’s belonging to Li. One iteration of the algorithm consists of the following stages.

Confidence Update: We use the notation D(t), P(t) to denote the dictionary matrix and confidence matrix, respectively, in the tth iteration. Keeping the dictionary D(t) fixed, the confidence of a feature vector belonging to classes outside its label set is fixed to 0 and is not updated. To update the confidence of a sample belonging to classes in its label set, we first make the observation that a sample yi which is well represented by the dictionary of class j, should have high confidence. In other words, the confidence of a sample yi belonging to a class j should be inversely proportional to the reconstruction error that results when yi is projected onto Dj. This can be done by updating the confidence matrix P(t) as follows
pi,j(t)=αj(t)exp(eij(t)σ2)cLiαc(t)exp(eic(t)σ2),
(37)
where σ and αj(t) are parameters and
eij(t)=yiDj(t)D¯j(t)yi2
(38)
is the reconstruction error, when yi is projected onto Dj(t),jLi and D¯j(t)((Dj(t))TDj(t))1(Dj(t))T is the pseudo-inverse of Dj(t). Equation (37) is derived under the assumption that the likelihood of each sample yi is a mixture of Gaussian densities, and αj(t) is a normalized weight associated with the density of label j.

Cluster Update: Once the confidence matrix P(t) is updated, we use it to update the class matrix C(t+1). For each training sample yi, we assign it to the class ji, which gives the maximum confidence. That is,
ji=argmaxcLipi,c(t).
(39)

Dictionary Update: The updated class matrices C(t+1) are then used to train class-specific dictionaries. Given a class matrix Cj(t+1), we seek a dictionary Dj(t+1) that provides the sparsest representation for each example feature in this matrix by solving the following optimization problem
(Dj(t+1),Xj(t+1))=argminD,XCj(t+1)DXF2,subjecttoxi0T0,i,
(40)
where xi represents the ith column of X, Cj(t+1) has a matrix representation whose columns are feature vectors assigned to the j-th class at iteration (t+1). The optimization problem in Eq. (40) can be solved using the KSVD algorithm discussed earlier. See [53

53. Y.-C. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips, “Dictionary learning from ambiguously labeled data,” in IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, 2013, pp. 353–360.

] for the derivation of dictionary learning using soft decision (DLSD) as well as the performance of DLSD and DLHD on various face-recognition datasets.

8. DOMAIN ADAPTIVE DICTIONARY LEARNING

When designing dictionaries for face recognition tasks, we are often confronted with situations where conditions in the training set are different from those present during testing. For example, in the case of face recognition, more than one familiar view may be available for training. Such training faces may be obtained from a live or recorded video sequences, where a range of views are observed. However, the test images can contain conditions that are not necessarily present in the training images such as a face in a different pose. The problem of transforming a dictionary trained from one visual domain to another can be viewed as a problem of domain adaptation [55

55. Q. Qiu, V. M. Patel, P. Turaga, and R. Chellappa, “Domain adaptive dictionary learning,” in European Conference on Computer Vision (2012), Vol. 7575, pp. 631–645.

57

57. J. Ni, Q. Qiu, and R. Chellappa, “Subspace interpolation via dictionary learning for unsupervised domain adaptation,” in Computer Vision and Pattern Recognition (CVPR), Washington, D.C., 2013, pp. 692–699.

].

Several dictionary-based methods have been proposed in the literature to deal with this domain shift problem in visual recognition. A function learning framework for the task of transforming a dictionary learned from one visual domain to the other, while maintaining a domain-invariant sparse representation of a signal was proposed in [55

55. Q. Qiu, V. M. Patel, P. Turaga, and R. Chellappa, “Domain adaptive dictionary learning,” in European Conference on Computer Vision (2012), Vol. 7575, pp. 631–645.

]. Domain dictionaries are modeled by a linear or nonlinear parametric function. The dictionary function parameters and domain-invariant sparse codes are then jointly learned by solving an optimization problem. In [57

57. J. Ni, Q. Qiu, and R. Chellappa, “Subspace interpolation via dictionary learning for unsupervised domain adaptation,” in Computer Vision and Pattern Recognition (CVPR), Washington, D.C., 2013, pp. 692–699.

], a domain adaptive dictionary learning framework was proposed by generating a set of intermediate dictionaries which smoothly connect the source and target domains. One of the important properties of this approach is that it allows the synthesis of data associated with the intermediate domains while exploiting the discriminative power of generative dictionaries. The intermediate data can then be used to build a classifier for recognition under domain shifts.

In [56

56. S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa, “Generalized domain-adaptive dictionaries,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2013), pp. 361–368.

] a domain adaptive dictionary learning framework is proposed for learning a single dictionary to optimally represent both source and target data. As the features may not be correlated well in the original space, one can project data from both the domains onto a common low-dimensional space while maintaining the manifold structure of data. Learning the dictionary on a low-dimensional space makes the algorithm faster and irrelevant information in the original features can be discarded. Moreover, joint learning of dictionary and projections ensures that the common internal structure of data in both the domains is extracted, which can be represented well by sparse linear combinations of dictionary atoms.

In what follows, we briefly review the generalized domain adaptive dictionary learning framework proposed in [56

56. S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa, “Generalized domain-adaptive dictionaries,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2013), pp. 361–368.

]. An overview of this method is shown in Fig. 12.

Fig. 12. Overview of domain adaptive latent space dictionary learning framework [56].

The classical dictionary learning approach minimizes the representation error of the given set of data samples subject to a sparsity constraint in Eq. (5). Now, consider a special case, where we have data from two domains, Y1Rn1×N1 and Y2Rn2×N2. We wish to learn a shared K-atoms dictionary, DRn×K and mappings W1Rn×n1, W2Rn×n2 onto a common low-dimensional space, which will minimize the representation error in the projected space. Formally, we desire to minimize the following cost function,
C1(D,W1,W2,X1,X2)=W1Y1DX1F2+W2Y2DX2F2,
subject to sparsity constraints on X1 and X2. We further assume that rows of the projection matrices, W1 and W2, are orthogonal and normalized to unit-norm. This prevents the solution from becoming degenerate, leads to an efficient scheme for optimization, and makes the kernelization of the algorithm possible.

In order to make sure that the projections do not lose too much information available in the original domains after projecting onto the latent space, a PCA-like regularization term is added, which preserves energy in the original signal, given as
C2(W1,W2)=Y1W1TW1Y1F2+Y2W2TW2Y2F2.
It is easy to show after some algebraic manipulations that the costs C1 and C2, after ignoring the constant terms in Y, can be written as
C1(D,W˜,X˜)=W˜Y˜DX˜F2,
(41)
C2(W˜)=trace((W˜Y˜)(W˜Y˜)T),
(42)
where
W˜=[W1W2],Y˜=(Y100Y2),andX˜=[X1X2].
Hence, the overall optimization is given as
{D*,W˜*,X˜*}=argminD,W˜,X˜C1(D,P˜,X˜)+λC2(W˜),s.t.WiWiT=I,i=1,2and x˜j0T0,j,
(43)
where λ is a positive constant. See [56

56. S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa, “Generalized domain-adaptive dictionaries,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2013), pp. 361–368.

] for the details regarding the optimization of the above problem.

In order to show the effectiveness of this method, a pose alignment experiment was done in [56

56. S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa, “Generalized domain-adaptive dictionaries,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2013), pp. 361–368.

] using the CMU Multi-PIE dataset [31

31. R. Gross, I. Matthews, J. F. Cohn, T. Kanade, and S. Baker, “Multi-PIE,” Image Vis. Comput. 28, 807–813 (2010).

]. The Multi-PIE dataset [31

31. R. Gross, I. Matthews, J. F. Cohn, T. Kanade, and S. Baker, “Multi-PIE,” Image Vis. Comput. 28, 807–813 (2010).

] is a comprehensive face dataset of 337 subjects, having images taken across 15 poses, 20 illuminations, six expressions, and four different sessions. For the purpose of this experiment, 129 subjects common to Sessions 1 and 2 were used. The experiment was done on five poses, ranging from frontal to 75°. Frontal faces were taken as the source domain, while different off-frontal poses were taken as the target domains. Dictionaries were trained using illuminations {1,4,7,12,17} from the source and the target poses, in Session 1 per subject. All the illumination images from Session 2, for the target pose, were taken as probe images.

Pose alignment is challenging due to the highly nonlinear changes induced by 3D rotation of face. Images at the extreme pose of 60° were taken as the target pose. First, a shared discriminative dictionary was learned. Then, given the probe image, it was projected on the latent subspace and reconstructed using the dictionary. The reconstruction was back-projected onto the source pose domain, to give the aligned image. Figure 13 shows the synthesized images for various conditions. The best alignment is achieved when K is equal to 5. It can be seen from rows 2 and 3 that the dictionary-based method is robust even at high levels of noise and missing pixels. Moreover, de-noised and in-painted synthesized images are produced as shown in rows 2 and 3 of Fig. 13, respectively. This experiment clearly shows the effectiveness of the domain adaptive dictionary learning method for pose alignment [56

56. S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa, “Generalized domain-adaptive dictionaries,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2013), pp. 361–368.

].

Fig. 13. Examples of pose-aligned images. Synthesis in various conditions demonstrate the robustness of the domain adaptive dictionary learning method [56].

9. CONCLUSIONS AND FUTURE DIRECTIONS

In this paper, we reviewed some of the recent approaches to face recognition based on recently introduced theories of sparse representation and dictionary learning. In particular, dictionary learning algorithms for weakly supervised learning and domain adaptation for still- and video-based face recognition were reviewed. Even though the main emphasis was given to face biometrics, these methods can offer compelling solutions to other biometrics such as gait, iris [9

9. J. K. Pillai, V. M. Patel, R. Chellappa, and N. Ratha, “Secure and robust iris recognition using random projections and sparse representations,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 1877–1893 (2011). [CrossRef]

], speech, palmprint, and fingerprint, as well as for multibiometric fusion [46

46. S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Joint sparse representation for robust multimodal biometrics recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 36, 113–126 (2014). [CrossRef]

].

Although significant progress has been made, a number of challenges and issues confront the effectiveness of face recognition based on sparse representation and dictionary learning. Below we list a few.
  • Extraction of salient features is one of the main problems in face biometrics. Can we develop a nonlinear dictionary-based framework that can provide features that are invariant to some of the nonlinearities present in face biometrics (e.g., pose variations) [58

    58. Y.-C. Chen, V. M. Patel, R. Chellappa, and P. J. Phillips, “Salient views and view-dependent dictionaries for object recognition,” Comput. Vis. Image Underst. (to be published).

    ]? Attempts to present view-dependent dictionaries using the notion of view saliency. Can similar geometric and physics-based methods be developed to deal with some of the other challenges in face biometrics such as blur, low-resolution, occlusion, etc.?
  • Moving frames of a video can be modeled as a linear dynamical system (LDS). In order to capture a moving face by exploiting sparse properties of a face and the temporal correlations of frames, one can develop an adaptive dictionary learning framework based on LDS. Can an LDS-based dictionary learning framework provide better solution to some of the challenges in video-based face recognition?
  • Most of the work done so far on sparse representations and dictionary learning has been done in the context of synthesis models [59

    59. M. Elad, “Sparse and redundant representation modeling-what next?” IEEE Signal Process. Lett. 19, 922–928 (2012). [CrossRef]

    ]. Recently, there has been a lot of work reported in the literature that proposes to use an analysis model for learning sparse representations. Can an analysis cosparse model [59

    59. M. Elad, “Sparse and redundant representation modeling-what next?” IEEE Signal Process. Lett. 19, 922–928 (2012). [CrossRef]

    ] provide better understanding of some of the issues in biometrics recognition using dictionary learning?
  • Most dictionary learning algorithms solve optimization problems containing nonsmooth objective functions such as 1 norm. Efficient convex optimization algorithms have been developed to solve these problems. However, it has been shown that nonconvex optimization can dramatically outperform convex methods in theory and in practice. These methods can be implemented very efficiently. Extension of some of the methods presented in this paper using nonconvex optimization is an interesting avenue for future research.
  • Uniqueness and stability of the learned kernel dictionary have not been established. More theoretical work is needed in the area of nonlinear dictionary learning.

ACKNOWLEDGMENTS

This paper was partially supported by an ONR MURI grant N00014-08-1-0238 and an ARO MURI grant W911NF-09-1-0383. The work of VMP was partially supported by an ONR grant N00014-12-1-0124. PJP was supported by the Federal Bureau of Investigation. The identification of any commercial product or trade name does not imply endorsement or recommendation by NIST or the University of Maryland. We would like to acknowledge the contributions made by Drs. Wu Tao, Hien Van Nguyen, Sima Taheri, Soma Biswas, Mr. Ashish Shrivastava, and Mr. Sumit Shekhar for developing the dictionary-based methods discussed in this paper.

REFERENCES

1.

W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: a literature survey,” ACM Computing Surveys 35, 399–458 (2003). [CrossRef]

2.

P. J. Phillips, “Improving face recognition technology,” Computer 44, 84–86 (2011). [CrossRef]

3.

M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cogn. Neurosci. 3, 71–86 (1991).

4.

P. Belhumeur, J. Hespanda, and D. Kriegman, “Eigenfaces versus fisherfaces: recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell. 19, 711–720 (1997). [CrossRef]

5.

K. Etemad and R. Chellappa, “Discriminant analysis for recognition of human face images,” J. Opt. Soc. Am. A 14, 1724–1733 (1997). [CrossRef]

6.

M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski, “Face recognition by independent component analysis,” IEEE Trans. Neural Netw. 13, 1450–1464 (2002). [CrossRef]

7.

J. Wright, A. Y. Yang, A. A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 210–227 (2009). [CrossRef]

8.

V. M. Patel, R. Chellappa, and M. Tistarelli, “Sparse representations and random projections for robust and cancelable biometrics,” in International Conference on Control, Automation, Robotics and Vision, Guangzhou, December, 2010, pp. 1–6.

9.

J. K. Pillai, V. M. Patel, R. Chellappa, and N. Ratha, “Secure and robust iris recognition using random projections and sparse representations,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 1877–1893 (2011). [CrossRef]

10.

V. M. Patel and R. Chellappa, “Sparse representations, compressive sensing and dictionaries for pattern recognition,” in Asian Conference on Pattern Recognition (ACPR), Beijing, 2010.

11.

J. Jiang, “A literature survey on domain adaptation of statistical classifiers,” Tech. Rep. (2008).

12.

K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in European Conference on Computer Vision (Springer-Verlag, 2010), Vol. 6314, pp. 213–226.

13.

J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proc. IEEE 98, 1031–1044 (2010). [CrossRef]

14.

R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proc. IEEE 98, 1045–1057 (2010). [CrossRef]

15.

M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing (Springer, 2010).

16.

S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comp 20, 33–61 (1998). [CrossRef]

17.

K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in IEEE International Conference on Acoustic, Speech, Signal Processing, Phoenix, 1999, Vol. 5, pp. 2443–2446.

18.

M. Aharon, M. Elad, and A. M. Bruckstein, “The k-svd: an algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process. 54, 4311–4322 (2006). [CrossRef]

19.

K. Etemand and R. Chellappa, “Separability-based multiscale basis selection and feature extraction for signal and image classification,” IEEE Trans. Image Process. 7, 1453–1465 (1998). [CrossRef]

20.

J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEE Trans. Pattern Anal. Mach. Intell. 34, 791–804 (2012). [CrossRef]

21.

Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent k-svd: learning a discriminative dictionary for recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 35, 2651–2664 (2013). [CrossRef]

22.

Q. Zhang and B. Li, “Discriminative k-svd for dictionary learning in face recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010.

23.

M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discrimination dictionary learning for sparse representation,” in International Conference on Computer Vision, Barcelona, 2011, pp. 543–550.

24.

Q. Qiu, V. M. Patel, and R. Chellappa, “Information-theoretic dictionary learning for image classification,” IEEE Trans. Pattern Anal. Mach. Intell. (to be published).

25.

H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Design of nonlinear kernel dictionaries for object recognition,” IEEE Trans. Image Process. 22, 5123–5135 (2013). [CrossRef]

26.

A. Shrivastava, H. V. Nguyen, V. M. Patel, and R. Chellappa, “Design of nonlinear discriminative dictionaries for image classification,” in Asian Conference on Computer Vision (ACCV) (Springer-Verlag, 2013), pp. 660–674.

27.

L. Zhang, W.-D. Zhou, P.-C. Chang, J. Liu, Z. Yan, T. Wang, and F.-Z. Li, “Kernel sparse representation-based classifier,” IEEE Trans. Signal Process. 60, 1684–1695 (2012). [CrossRef]

28.

V. M. Patel, W. Tao, S. Biswas, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition under variable lighting and pose,” IEEE Trans. Inf. Forensics Secur. 7, 954–965 (2012). [CrossRef]

29.

S. Biswas, G. Aggarwal, and R. Chellappa, “Robust estimation of albedo for illumination-invariant matching and shape recovery,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 884–899 (2009). [CrossRef]

30.

V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphable model,” IEEE Trans. Pattern Anal. Mach. Intell. 25, 1063–1074 (2003). [CrossRef]

31.

R. Gross, I. Matthews, J. F. Cohn, T. Kanade, and S. Baker, “Multi-PIE,” Image Vis. Comput. 28, 807–813 (2010).

32.

S. Biswas and R. Chellappa, “Pose-robust albedo estimation from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010.

33.

Y.-C. Chen, C. S. Sastry, V. M. Patel, P. J. Phillips, and R. Chellappa, “In-plane rotation and scale invariant clustering using dictionaries,” IEEE Trans. Image Process. 22, 2166–2180 (2013). [CrossRef]

34.

A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell. 23, 643–660 (2001). [CrossRef]

35.

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010, pp. 3360–3367.

36.

S. Shekhar, V. M. Patel, and R. Chellappa, “Synthesis-based recognition of low resolution faces,” in International Joint Conference on Biometrics, Washington, D.C., 2011, pp. 1–6.

37.

P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. Worek, “Overview of the face recognition grand challenge,” in IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 2005, Vol. 1, pp. 947–954.

38.

B. Li, H. Chang, S. Shan, and X. Chen, “Low-resolution face recognition via coupled locality preserving mappings,” IEEE Signal Process. Lett. 17, 20–23 (2010). [CrossRef]

39.

S. Taheri, V. M. Patel, and R. Chellappa, “Component-based recognition of faces and facial expressions,” in IEEE Transactions on Affective Computing, October, 2013.

40.

E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM 58(3), 1–37 (2011). [CrossRef]

41.

P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, San Francisco, 2010, pp. 94–101.

42.

P. J. Phillips, “Matching pursuit filters applied to face identification,” IEEE Trans. Image Process. 7, 1150–1164 (1998). [CrossRef]

43.

Y.-C. Chen, V. M. Patel, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition from video,” in European Conference on Computer Vision, October 2012.

44.

P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” In IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, June, 2001, pp. 511–518.

45.

Y.-C. Chen, V. M. Patel, S. Shekhar, R. Chellappa, and P. J. Phillips, “Video-based face recognition via joint sparse representation,” in IEEE International Conference on Automatic Face and Gesture Recognition, Shanghai, 2013, pp. 1–8.

46.

S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Joint sparse representation for robust multimodal biometrics recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 36, 113–126 (2014). [CrossRef]

47.

A. J. O’Toole, J. Harms, S. L. Snow, D. R. Hurst, M. R. Pappas, J. H. Ayyad, and H. Abdi, “Recognizing people from dynamic and static faces and bodies: dissecting identity with a fusion approach,” Vis. Res. 51, 74–83 (2011). [CrossRef]

48.

P. K. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, “Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 2273–2286 (2011). [CrossRef]

49.

P. K. Turaga, A. Veeraraghavan, and R. Chellappa, “Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 2008, pp. 1–8.

50.

Y. Hu, A. S. Mian, and R. Owens, “Sparse approximated nearest points for image set classification,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2011), pp. 27–40.

51.

P. Jonathon Phillips, P. J. Flynn, J. Ross Beveridge, W. Todd Scruggs, A. J. O’Toole, D. Bolme, K. W. Bowyer, B. A. Draper, G. H. Givens, Y. M. Lui, H. Sahibzada, J. A. Scallan III, and S. Weimer, “Overview of the multiple biometrics grand challenge,” in International Conference on Biometrics (Springer, 2009), pp. 705–714.

52.

Information Technology Laboratory, NIST. Multiple biomertic grand challenge, http://www.nist.gov/itl/iad/ig/mbgc.cfm.

53.

Y.-C. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips, “Dictionary learning from ambiguously labeled data,” in IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, 2013, pp. 353–360.

54.

A. Shrivastava, J. K. Pillai, V. M. Patel, and R. Chellappa, “Learning discriminative dictionaries with partially labeled data,” in IEEE International Conference on Image ProcessingOrlando, 2012, pp. 3113–3116.

55.

Q. Qiu, V. M. Patel, P. Turaga, and R. Chellappa, “Domain adaptive dictionary learning,” in European Conference on Computer Vision (2012), Vol. 7575, pp. 631–645.

56.

S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa, “Generalized domain-adaptive dictionaries,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2013), pp. 361–368.

57.

J. Ni, Q. Qiu, and R. Chellappa, “Subspace interpolation via dictionary learning for unsupervised domain adaptation,” in Computer Vision and Pattern Recognition (CVPR), Washington, D.C., 2013, pp. 692–699.

58.

Y.-C. Chen, V. M. Patel, R. Chellappa, and P. J. Phillips, “Salient views and view-dependent dictionaries for object recognition,” Comput. Vis. Image Underst. (to be published).

59.

M. Elad, “Sparse and redundant representation modeling-what next?” IEEE Signal Process. Lett. 19, 922–928 (2012). [CrossRef]

OCIS Codes
(100.0100) Image processing : Image processing
(100.5010) Image processing : Pattern recognition
(150.0150) Machine vision : Machine vision
(150.1135) Machine vision : Algorithms

ToC Category:
Image Processing

History
Original Manuscript: January 6, 2014
Revised Manuscript: March 16, 2014
Manuscript Accepted: March 18, 2014
Published: April 25, 2014

Citation
Vishal M. Patel, Yi-Chen Chen, Rama Chellappa, and P. Jonathon Phillips, "Dictionaries for image and video-based face recognition [Invited]," J. Opt. Soc. Am. A 31, 1090-1103 (2014)
http://www.opticsinfobase.org/josaa/abstract.cfm?URI=josaa-31-5-1090


Sort:  Author  |  Year  |  Journal  |  Reset  

References

  1. W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: a literature survey,” ACM Computing Surveys 35, 399–458 (2003). [CrossRef]
  2. P. J. Phillips, “Improving face recognition technology,” Computer 44, 84–86 (2011). [CrossRef]
  3. M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cogn. Neurosci. 3, 71–86 (1991).
  4. P. Belhumeur, J. Hespanda, and D. Kriegman, “Eigenfaces versus fisherfaces: recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell. 19, 711–720 (1997). [CrossRef]
  5. K. Etemad and R. Chellappa, “Discriminant analysis for recognition of human face images,” J. Opt. Soc. Am. A 14, 1724–1733 (1997). [CrossRef]
  6. M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski, “Face recognition by independent component analysis,” IEEE Trans. Neural Netw. 13, 1450–1464 (2002). [CrossRef]
  7. J. Wright, A. Y. Yang, A. A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 210–227 (2009). [CrossRef]
  8. V. M. Patel, R. Chellappa, and M. Tistarelli, “Sparse representations and random projections for robust and cancelable biometrics,” in International Conference on Control, Automation, Robotics and Vision, Guangzhou, December, 2010, pp. 1–6.
  9. J. K. Pillai, V. M. Patel, R. Chellappa, and N. Ratha, “Secure and robust iris recognition using random projections and sparse representations,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 1877–1893 (2011). [CrossRef]
  10. V. M. Patel and R. Chellappa, “Sparse representations, compressive sensing and dictionaries for pattern recognition,” in Asian Conference on Pattern Recognition (ACPR), Beijing, 2010.
  11. J. Jiang, “A literature survey on domain adaptation of statistical classifiers,” (2008).
  12. K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in European Conference on Computer Vision (Springer-Verlag, 2010), Vol. 6314, pp. 213–226.
  13. J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proc. IEEE 98, 1031–1044 (2010). [CrossRef]
  14. R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proc. IEEE 98, 1045–1057 (2010). [CrossRef]
  15. M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing (Springer, 2010).
  16. S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comp 20, 33–61 (1998). [CrossRef]
  17. K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in IEEE International Conference on Acoustic, Speech, Signal Processing, Phoenix, 1999, Vol. 5, pp. 2443–2446.
  18. M. Aharon, M. Elad, and A. M. Bruckstein, “The k-svd: an algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process. 54, 4311–4322 (2006). [CrossRef]
  19. K. Etemand and R. Chellappa, “Separability-based multiscale basis selection and feature extraction for signal and image classification,” IEEE Trans. Image Process. 7, 1453–1465 (1998). [CrossRef]
  20. J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEE Trans. Pattern Anal. Mach. Intell. 34, 791–804 (2012). [CrossRef]
  21. Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent k-svd: learning a discriminative dictionary for recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 35, 2651–2664 (2013). [CrossRef]
  22. Q. Zhang and B. Li, “Discriminative k-svd for dictionary learning in face recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010.
  23. M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discrimination dictionary learning for sparse representation,” in International Conference on Computer Vision, Barcelona, 2011, pp. 543–550.
  24. Q. Qiu, V. M. Patel, and R. Chellappa, “Information-theoretic dictionary learning for image classification,” IEEE Trans. Pattern Anal. Mach. Intell. (to be published).
  25. H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Design of nonlinear kernel dictionaries for object recognition,” IEEE Trans. Image Process. 22, 5123–5135 (2013). [CrossRef]
  26. A. Shrivastava, H. V. Nguyen, V. M. Patel, and R. Chellappa, “Design of nonlinear discriminative dictionaries for image classification,” in Asian Conference on Computer Vision (ACCV) (Springer-Verlag, 2013), pp. 660–674.
  27. L. Zhang, W.-D. Zhou, P.-C. Chang, J. Liu, Z. Yan, T. Wang, and F.-Z. Li, “Kernel sparse representation-based classifier,” IEEE Trans. Signal Process. 60, 1684–1695 (2012). [CrossRef]
  28. V. M. Patel, W. Tao, S. Biswas, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition under variable lighting and pose,” IEEE Trans. Inf. Forensics Secur. 7, 954–965 (2012). [CrossRef]
  29. S. Biswas, G. Aggarwal, and R. Chellappa, “Robust estimation of albedo for illumination-invariant matching and shape recovery,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 884–899 (2009). [CrossRef]
  30. V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphable model,” IEEE Trans. Pattern Anal. Mach. Intell. 25, 1063–1074 (2003). [CrossRef]
  31. R. Gross, I. Matthews, J. F. Cohn, T. Kanade, and S. Baker, “Multi-PIE,” Image Vis. Comput. 28, 807–813 (2010).
  32. S. Biswas and R. Chellappa, “Pose-robust albedo estimation from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010.
  33. Y.-C. Chen, C. S. Sastry, V. M. Patel, P. J. Phillips, and R. Chellappa, “In-plane rotation and scale invariant clustering using dictionaries,” IEEE Trans. Image Process. 22, 2166–2180 (2013). [CrossRef]
  34. A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell. 23, 643–660 (2001). [CrossRef]
  35. J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010, pp. 3360–3367.
  36. S. Shekhar, V. M. Patel, and R. Chellappa, “Synthesis-based recognition of low resolution faces,” in International Joint Conference on Biometrics, Washington, D.C., 2011, pp. 1–6.
  37. P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. Worek, “Overview of the face recognition grand challenge,” in IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 2005, Vol. 1, pp. 947–954.
  38. B. Li, H. Chang, S. Shan, and X. Chen, “Low-resolution face recognition via coupled locality preserving mappings,” IEEE Signal Process. Lett. 17, 20–23 (2010). [CrossRef]
  39. S. Taheri, V. M. Patel, and R. Chellappa, “Component-based recognition of faces and facial expressions,” in IEEE Transactions on Affective Computing, October, 2013.
  40. E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM 58(3), 1–37 (2011). [CrossRef]
  41. P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, San Francisco, 2010, pp. 94–101.
  42. P. J. Phillips, “Matching pursuit filters applied to face identification,” IEEE Trans. Image Process. 7, 1150–1164 (1998). [CrossRef]
  43. Y.-C. Chen, V. M. Patel, P. J. Phillips, and R. Chellappa, “Dictionary-based face recognition from video,” in European Conference on Computer Vision, October 2012.
  44. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” In IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, June, 2001, pp. 511–518.
  45. Y.-C. Chen, V. M. Patel, S. Shekhar, R. Chellappa, and P. J. Phillips, “Video-based face recognition via joint sparse representation,” in IEEE International Conference on Automatic Face and Gesture Recognition, Shanghai, 2013, pp. 1–8.
  46. S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Joint sparse representation for robust multimodal biometrics recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 36, 113–126 (2014). [CrossRef]
  47. A. J. O’Toole, J. Harms, S. L. Snow, D. R. Hurst, M. R. Pappas, J. H. Ayyad, and H. Abdi, “Recognizing people from dynamic and static faces and bodies: dissecting identity with a fusion approach,” Vis. Res. 51, 74–83 (2011). [CrossRef]
  48. P. K. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, “Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 2273–2286 (2011). [CrossRef]
  49. P. K. Turaga, A. Veeraraghavan, and R. Chellappa, “Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 2008, pp. 1–8.
  50. Y. Hu, A. S. Mian, and R. Owens, “Sparse approximated nearest points for image set classification,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2011), pp. 27–40.
  51. P. Jonathon Phillips, P. J. Flynn, J. Ross Beveridge, W. Todd Scruggs, A. J. O’Toole, D. Bolme, K. W. Bowyer, B. A. Draper, G. H. Givens, Y. M. Lui, H. Sahibzada, J. A. Scallan, and S. Weimer, “Overview of the multiple biometrics grand challenge,” in International Conference on Biometrics (Springer, 2009), pp. 705–714.
  52. Information Technology Laboratory, NIST. Multiple biomertic grand challenge, http://www.nist.gov/itl/iad/ig/mbgc.cfm .
  53. Y.-C. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips, “Dictionary learning from ambiguously labeled data,” in IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, 2013, pp. 353–360.
  54. A. Shrivastava, J. K. Pillai, V. M. Patel, and R. Chellappa, “Learning discriminative dictionaries with partially labeled data,” in IEEE International Conference on Image ProcessingOrlando, 2012, pp. 3113–3116.
  55. Q. Qiu, V. M. Patel, P. Turaga, and R. Chellappa, “Domain adaptive dictionary learning,” in European Conference on Computer Vision (2012), Vol. 7575, pp. 631–645.
  56. S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa, “Generalized domain-adaptive dictionaries,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2013), pp. 361–368.
  57. J. Ni, Q. Qiu, and R. Chellappa, “Subspace interpolation via dictionary learning for unsupervised domain adaptation,” in Computer Vision and Pattern Recognition (CVPR), Washington, D.C., 2013, pp. 692–699.
  58. Y.-C. Chen, V. M. Patel, R. Chellappa, and P. J. Phillips, “Salient views and view-dependent dictionaries for object recognition,” Comput. Vis. Image Underst. (to be published).
  59. M. Elad, “Sparse and redundant representation modeling-what next?” IEEE Signal Process. Lett. 19, 922–928 (2012). [CrossRef]

Cited By

Alert me when this paper is cited

OSA is able to provide readers links to articles that cite this paper by participating in CrossRef's Cited-By Linking service. CrossRef includes content from more than 3000 publishers and societies. In addition to listing OSA journal articles that cite this paper, citing articles from other participating publishers will also be listed.


« Previous Article  |  Next Article »

OSA is a member of CrossRef.

CrossCheck Deposited