Generic Spatio-Temporal FR System
The generic system for spatio-temporal FR in VS is depicted in Figure 1.1. As shown in Figure 1.1, each video camera captures the scene, where the segmentation and preprocessing module first detects faces and isolates the ROIs in video frames. Then, a face track is initiated for each new person appearing in the scene. Afterwards, feature extraction/selection module extracts an invariant and discriminative set of features. Once features are extracted, they are assembled into ROI patterns and processed by the classification module. Finally, classification allows to compare probe ROI patterns against facial models of individuals enrolled to the system to generate matching scores. The outputs of classification and tracking Figure 1.1 Generic system of spatio-temporal FR in video surveillance. components are fused through spatio-temporal fusion module to achieve final detections (Chellappa et al., 2010; Pagano et al., 2012). This system is comprised of six main modules that are briefly described in the following items:
• Surveillance camera: Each surveillance camera in a distributed network of IP cameras captures the video streams of environment in its FoV that may contain one or more individuals appearing in the scene.
• Segmentation and preprocessing: The task of this module is detects faces from video frames and isolates the ROI(s). Typically, Viola-Jones (Viola & Jones, 2004) face detection algorithm is employed mostly due to its simplicity and speed. After obtaining the bounding box containing the position and pixels of face(s), histogram equalization and resizing of faces may be performed as the preprocessing step.
• Feature extraction/selection: Extracting robust features is an important step that converts ROI to a compact representation and it may improve the performance of recognition. Once the segmentation is carried out, some features are extracted from each ROI to generate the face models (for template matching). These features can be extracted from the entire face image (holistic) or local patches of it.
State-of-the-Art Still-to-Video
Face Recognition There are many systems proposed in the literature for video-based FR, but very few are specialized for FR in VS (Barr et al., 2012). Systems for FR in VS are typically modeled as a modular individual-specific detection problem, where each detector is implemented to accurately detect the individual of interest (Pagano et al., 2012). Indeed, in these modular architectures adding and removing of individuals over time can be fulfilled easily, and also setting different decision thresholds, feature subsets and classifiers can be selected for a specific individual. Multi-classifier systems (MCS) are often used for FR in VS, where the number of non-target samples outnumbered target samples of individuals of interest (Bengio & Mariéthoz, 2007). An individual-specific approach based on one- or 2-class classifiers as a modular system with one detector for per individual has been proposed in (Jain & Ross, 2002). A TCM-kNN matcher was proposed in (Li & Wechsler, 2005) to design a multi-class classifier that employs a transductive inference to generate a class prediction for open-set problems in video-based FR, whereas a rejection option is defined for individuals have not enrolled to the system. Ensembles of 2-class classifiers per target individuals were designed in (Pagano et al., 2012) as an extension of modular approaches for each individual of interest in the watch-list for video-based person re-identification task.
Thus, diversified pool of ARTMAP neural networks are co-jointly trained using dynamic particle swarm optimization based training strategy and then, some of them are selected and combined in the ROC space with Boolean combination. Another modular system was proposed based on SVM calssifiers in (Ekenel et al., 2010) for real-time FR and door monitoring in the real-world surveillance settings. Furthermore, an adaptive ensemble-based system has been proposed to self-update the facial models, where the individual-specific ensemble is updated if its recognition over a trajectory is with high confidence (De la Torre Gomerra et al., 2015). A probabilistic tracking-and-recognition approach called sequential importance sampling (Zhou et al., 2003) has been proposed for still-to-video FR by converting still-to-video into video-tovideo using frames satisfying required scale and pose criteria during tracking. Similarly, a probabilistic mixture of Gaussians learning algorithm using expectation-maximization (EM) from sets of static images is presented for video-based FR system which is partially robust to occlusion, orientation, and expression changes (Zhang & Martínez, 2004). A matching-based algorithm employing several correlation filters is proposed for still-to-video FR from a gallery of a few still images in (Xie et al., 2004), where it was assumed that the poses and viewpoints of the ROIs in video sequences are the same as corresponding training images.
To match image sets in unconstrained environments, a regularized least square regression method has been proposed in (Wang et al., 2015) based on heuristic assumptions (i.e. still faces and video frames of the same person are identical according to the identity space), as well as, synthesizing virtual face images. In addition, a point-to-set correlation learning approach has been proposed in (Huang et al., 2015) for either still-to-video or video-to-still FR tasks, where Euclidean points are matched against Riemannian elements in order to learn maximum correlations between the heterogeneous data. Recently, a Grassmann manifold learning method has been proposed in (Zhu et al., 2016) to address the still-to-video FR by generating multiple geodesic flows, to connect the subspaces constructed in between the still images and video clips. Specialized feed-forward neural network using morphing to synthetically generate variations of a reference still is trained for each target individual for watch-list surveillance, where human perceptual capability is exploited to reject previously unseen faces (Kamgar-Parsi et al., 2011).
Recently, in (Huang et al., 2013a) partial and local linear discriminant analysis has been proposed using samples containing a high-quality still and a set of low resolution video sequences of each individual for still-to-video FR as a baseline on the COX-S2V dataset. Similarly, coupling quality and geometric alignment with recognition (Huang et al., 2013b) has been proposed, where the best qualified frames from video are selected to match against well-aligned high-quality face stills with the most similar quality. Low-rank regularized sparse representation is adopted in a unified framework to interact with quality alignment, geometric alignment, and face recognition. Since the characteristics of stills and videos are different, it could be an inefficient approach to build a common discriminant space. As a result, a weighted discriminant analysis method has been proposed in (Chen et al., 2014) to learn a separate mapping for stills and videos by incorporating the intra-class compactness and inter-class separability as the learning objective.
Recently, sparse representation-based classification (SRC) methods have been shown to provide a high-level of performance in FR (Wright et al., 2009). The conventional SRC method is not capable of operating with one reference still, yet an auxiliary training set has been exploited in extended SRC (ESRC) (Deng et al., 2012) to enhance robustness to the intra-class variation. Similarly, an auxiliary training set has been exploited with the gallery set to develop a sparse variation dictionary learning (SVDL), where an adaptive projection is jointly learned to connect the generic set to the gallery set, and to construct a sparse dictionary with suffi18 cient variations of representations (Yang et al., 2013). In addition, an ESRC approach through domain adaptation (ESRC-DA) has been lately proposed in (Nourbakhsh et al., 2016) for stillto- video FR incorporating matrix factorization and dictionary learning. Despite their capability to handle the SSPP problem, they are not fully-adapted for still-to-video FR systems. Indeed, they are relatively sensitive to variations in capture conditions (e.g., considerable changes in illumination, pose, and especially occlusion). In addition, samples in the generic training set are not necessarily similar to the samples in the gallery set due to the different cameras. Hence, the intra-class variation of training set may not translate to discriminative information regarding samples in the gallery set. They may also suffer from a high computational complexity, because of the sparse coding and the large and redundant dictionaries (Deng et al., 2012; Yang et al., 2013).
|
Table des matières
CHAPTER 1 SYSTEMS FOR STILL-TO-VIDEO FACE RECOGNITION IN VIDEO SURVEILLANCE
1.1 Generic Spatio-Temporal FR System
1.1.1 State-of-the-Art Still-to-Video Face Recognition
1.1.2 Challenges
1.2 Multiple Face Representations
1.2.1 Feature Extraction Techniques
1.2.2 Patch-Based Approaches
1.2.3 Random Subspace Methods
1.3 Domain Adaptation
1.4 Ensemble-based Methods
1.4.1 Generating and Combining Classifiers
1.4.2 Classification Systems
1.4.3 Dynamic Selection and Weighting of Classifiers
1.5 Deep Learning Architectures
CHAPTER 2 EXPERIMENTAL METHODOLOGY
2.1 Video Dataset
2.2 Protocol for Validation
2.3 Performance Metrics
CHAPTER 3 ROBUST WATCH-LIST SCREENING USING DYNAMIC ENSEMBLES OF SVMS BASED ON MULTIPLE FACE REPRESENTATIONS
3.1 Dynamic Ensembles of SVMs for Still-to-Video FR
3.1.1 Enrollment Phase
3.1.1.1 Extraction of Multiple Face Representations
3.1.1.2 Generation of Diverse SVM Classifiers
3.1.2 Operational Phase
3.1.2.1 Dynamic Classifiers Selection
3.1.2.2 Spatio-Temporal Fusion
3.2 Experimental Results and Discussions
3.2.1 Experimental Protocol
3.2.2 Results and Discussion
CHAPTER 4 DYNAMIC ENSEMBLES OF EXEMPLAR-SVMS FOR STILLTO-VIDEO FACE RECOGNITION
4.1 Dynamic Individual-Specific Ee-SVMs Through Domain Adaptation
4.1.1 System Overview
4.1.2 Design Phase (First Scenario)
4.1.2.1 Patch-Wise Feature Extraction
4.1.2.2 Training Patch-Wise E-SVM Classifiers
4.1.2.3 Ranking Patch-Wise and Subspace-Wise e-SVMs
4.1.2.4 Pruning Subspaces-Wise e-SVMs
4.1.3 Design Phase (Second Scenario)
4.1.4 Operational Phase (Dynamic Classifier Selection and Weighting)
4.2 Experimental Results and Discussions
4.2.1 Experimental Protocol
4.2.2 Computational Complexity
4.3 Results and Discussion
4.3.1 Number and Size of Feature Subspaces
4.3.2 Training Schemes
4.3.3 Number of Training and Testing Stills and Trajectories
4.3.4 Design Scenarios
4.3.5 Dynamic Selection and Weighting
CHAPTER 5 DEEP FACE-FLOW AUTOENCODERS FOR STILL-TOVIDEO FACE RECOGNITION FROM A SINGLE SAMPLE PER PERSON
5.1 Face-Flow Autoencoder CNN
5.1.1 Reconstruction Network
5.1.2 Classification Network
5.1.3 Training FFA-CNN
5.2 Experimental Results and Discussions
5.2.1 Experimental Protocol
5.2.2 Experimental Results
CONCLUSION AND RECOMMENDATIONS
LIST OF PUBLICATIONS
BIBLIOGRAPHY
Télécharger le rapport complet