Patch-based Pose Invariant Face Feature Classification NME Mokoena orcid.org 0000-0002-8815-0390 Dissertation submitted in fulfilment of the requirements for the degree Master of Science in Computer and Electronic Engineering at the North-West University Supervisor: Prof ASJ Helberg Co-supervisor: Dr HD Tsague Graduation May 2018 Student number: 18047459 Abstract Real world face recognition systems are now successfully developed to recognise faces in frontal view. One of the most challenging tasks facing state-of-the-art face recognition algorithms is how to handle variations caused by the direction of the face image in terms of angles, that are between the probe and the gallery images. This research work treats the problem caused by variations in pose as a classification problem. We conduct face classification on the FERET database. Firstly, we extract the SIFT features at different scale spaces (σ); by extracting these features at different levels will help us determine which values of (σ) give a better representation of our data. Secondly, we train these features using four machine-learning algorithms: k-Nearest Neighbor (kNN), Support Vector Machine (SVM), decision trees and neural network pattern recognition. The experiments demonstrate that by increasing the blur (σ) parameter, the classification rate decreases. Keywords: Image processing, Face recognition, Pose invariant, Feature extraction, Pose classification i Acknowledgements I would like to express my deepest appreciation to The Almighty, for entrusting me with this project, His mercies are new to me every day. • Greatest appreciation to my supervisor Prof. Albert Helberg, for the support and sound advice he gave me during my studies. His patience and work ethics with students were amazing. God bless Prof dearly. • I sincerely acknowledge my co-supervisor from the CSIR Mr. Hippolyte Djonon Tsague for his patience, encouragement and eternal cheerfulness. In the same breath, I would like to acknowledge Mr. Rethabile Khutlang and Mr. Johan van der Merwe for their open door policy, great leadership from colleagues. • I am equally grateful to the DST and the CSIR for awarding me a lifetime opportunity, a gift none will take away. Thank you. • Portions of the research in this paper use the FERET database of facial images collected under the FERET program, sponsored by the DOD Counterdrug Technology Development Program Office • I thank my colleagues and friends from MDS for their support and words of encouragement, (Mahlatse and Gugu), most of all my sincere appreciation goes to Ofetswe Lebogo, a beautiful mind, great soul, always willing to help. You are with such sound knowledge of coding and willing to share and help all the time. Am grateful to have met you, God bless you dearly. • Lastly, I would like to thank my family, that has always been my pillar and support system. My mom Dimakatso Mokoena and dad Mohanoe Mokoena, my son Keorapetse ’Kay wa Mommy’ (mommy loves you so much), my brother and his wife, together with their two daughters (Orape- leng and Moleboheng). Accept my heartfelt appreciation for your prayers and support. Thank you! To God be the Glory. ii Contents Abstract i Acknowledgements ii 1 Introduction 1 1.1 Problem Statement and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6 Conclusion and Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background Theory 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Face database, facial feature points detection and feature extraction . . . . . . . . . . 8 2.2.1 Face database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Feature detection and feature extraction . . . . . . . . . . . . . . . . . . . . . . 10 2.3 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Creating a scale space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 iii CONTENTS iv 2.3.2 Locating keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Removing unreliable keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Keypoint orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Generating keypoint descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Supervised machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6.1 Supervised machine learning algorithm selection . . . . . . . . . . . . . . . . . 21 2.6.2 SVM kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 k-Nearest neighbour (k-NN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.8 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.9 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.9.1 Feed-forward artificial neural network . . . . . . . . . . . . . . . . . . . . . . . 27 2.10 Performance evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.10.1 Receiver Operating Characteristic (ROC) curve . . . . . . . . . . . . . . . . . . 29 2.10.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.11 Prior art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 Methodology 34 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.1 Patch cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Pixel brightness transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.3 Image filtering using guided filter . . . . . . . . . . . . . . . . . . . . . . . . . . 38 CONTENTS v 3.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Feature Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.1 Calculating the Mean (µ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.2 Calculate the Standard Deviation (σ) . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5.1 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5.2 k-Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.3 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.4 Neural network pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Analysis of results 46 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Supervised classification, Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.1 Classification at σ = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Classification at σ = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.3 Classification at σ = 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.4 Neural Network Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Evaluation of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5 Summary, future work and conclusion 61 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Bibliography 63 vi List of Tables 2.1 Approaches for distance metrics [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Overall Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 vii List of Figures 2.1 Different face poses from The FERET Database. . . . . . . . . . . . . . . . . . . . . . 9 2.2 FERET image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Three SIFT octaves at level three. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Laplacian of Gaussian (LoG) [23, 25]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Image gradient and keypoint descriptor [23, 25]. . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Support Vectors and margin for a simple problem. . . . . . . . . . . . . . . . . . . . . 22 2.7 Decision tree classifying from a set of attributes. . . . . . . . . . . . . . . . . . . . . . 25 2.8 Single layered artificial neuron model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 Feed-forward ANN [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 Proposed approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Eye, mouth and nose cropped patches. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Representation of the original and grey-scale images. . . . . . . . . . . . . . . . . . . . 38 3.4 Model training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5 Model training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Guided filtered image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 The DoG patches at σ = 9, G(x, y, σ). . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 viii 4.3 Keypoints localisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 Raw SIFT Descriptor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5 SIFT Descriptors normalised using linear scaling to unit variance method. . . . . . . . 50 4.6 kNN Trained Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.7 Confusion Matrix at σ = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.8 ROC Curve at σ = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.9 Confusion Matrix at σ = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.10 SVM ROC curve at σ = 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.11 SVM Confusion Matrix at σ = 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.12 Neural Network Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.13 NN validation Performance Plot at σ = 3. . . . . . . . . . . . . . . . . . . . . . . . . . 57 ix Chapter 1 Introduction Face recognition of frontal (relating to the front of the face) view images is a well researched area [1, 2, 3, 4], but recognition of face images in a non-frontal (relating to the side of the face) view remains a great challenge for face recognition technology (FRT) [5]. In comparison to other biometrics techniques (technological systems that use biological information of a subject), for example fingerprint recognition, FRT has the advantage of allowing subjects to be scanned without active response. For instance, if a FRT is mounted at the airport, as a means of a surveillance security system, it is expected that good FRT should be able to identify a known impostor even when the subject is uncooperative, i.e. the subject may be trying to hide their face. It would be a difficult task for a FRT to identify a known criminal using surveillance cameras at the airport if half of the subject’s face were not shown. This generality of situations and environment brings major problems to a FRT [6]. Recently, researchers in the field of machine learning (ML) and computer vision; have with great effort been developing algorithms that attempt to solve the problem of face recognition in non-frontal view [6]. These algorithms are referred to as pose-invariant face recognition (PIFR) algorithms, i.e. algorithms that solve the difficulty of a FRT to identify a subject who is in a non-frontal view [5]. These algorithms seek to solve the problem of face pose. Throughout this research work and the works of other researchers in FRT [6], ”pose” refers to the position of the face in the image in terms of degree angles (°) with respect to a face facing straight, e.g. -22°,+45°. According to [5], variations caused by pose bring about the problems, not limited to: • Self-occlusion: the rotation of the head results in occlusion, which causes loss of information during recognition. 1 2• Loss of semantic correspondence in 2D face images: This refers to the loss of interpretation of the face image due to changes in position. The position of facial image texture varies non-linearly because of variations caused by pose. Xiaoyang et al [7] in their survey recognise the importance of the size of a face database (data sample) when solving PIFR and face recognition in general. Collecting data samples (face images) to create a database of faces can be a tedious job, in terms of time (to collect face images), storage and also processing these images. Hence, a lot of FRT algorithms are affected in the representation (e.g. features) and the size of the database, so that in most cases; many of these algorithms experience a great decline in performance when only a few samples per subject are available in the database [7]. Furthermore, having one sample per subject (one image per person) can be a great challenge both in the real world and in theory, because a FRT would have learned only one image of a person, and fail to recognise other images in different face poses. One way to make sure that less storage is used in the database, with less processing time and while maintaining satisfactory accuracy in recognition is to apply modern intelligent tools, such as pattern recognition, machine learning (ML) and computer vision. Consequently, machine learning (ML) classification is one of the methods used in pursuit of developing PIFR algorithms so that they achieve a better performance rate, in terms of both classifying the direction of the face in the image and face recognition. Classification in statistics and ML is a problem of trying to establish into which set of categories a new instance (observation) falls, given observations whose categories are previously known (training) [8]. A classification rule or a classifier is explained as a function h which can be evaluated for any possible values of x; given the data (x1, y1), (x2, y2), ..., (xn, yn), h(x) will yield a similar classification yˆ = h(x). The primary goal of a classification learner is that it should be highly accurate given a particular test data set. The objective of this study is to implement a classification algorithm, which will classify a received input image into which pose (degree angle) it falls under. The classification model that we implemented classified the degree angle of the face, without performing face recognition. 1.1. Problem Statement and Motivation 3 1.1 Problem Statement and Motivation The face recognition process would be much easier if users could be asked to stand straight facing the scanner. In that way the probe (input) image alignment would match the gallery (database) image, only if the subject of interest is already captured and stored for verification. However, in computer vision applications such as face recognition, such images may be captured from different viewpoints, and it is difficult to classify these images with existing machine learning algorithms, because the feature points from these (input and gallery) images differ. In our research work, we address the problem of PIFR by classifying under which pose angle the input image falls. The classification is performed using a still (mugshot) face image. Our approach to face classification is similar to [9], but does not include recognition of faces, it seeks to classify the pose that the input image falls under. Similar to [9], the motivation behind implementing a classification model of face images in different poses, is the realisation that pose-invariant face recognition has a wide range of applications in machine learning and computer vision, e.g. airport surveillance. 1.2 Research Questions This research work seeks to answer the following questions: 1. How can we model a pose-invariant face image? 2. What tools (algorithms) can be used to evaluate the performance of pose-invariant features? 3. What are the conditions, such as illumination, noise and image quality, that can affect the classification (positively or negatively) of pose invariant features? 1.3 Research Goals 1. To implement a classification model that would classify a still input face image according to different poses. 2. To determine a classifier that gives high classification accuracy on patch-based face images. 1.4. Limitations 4 1.4 Limitations For the purposes of our study, we assume that the face images are mugshots of still faces, with no form of surgical procedure visible on the faces. Furthermore, we assume that the non-frontal images vary between degree angles of ±22,5°(horizontal) and ±45°(horizontal), to allow us to cut-off (crop) facial areas that we need on the face (nose, mouth and eyes). Additionally, the face images that are going to be used are not totally occluded, e.g. there is no form of clothing on the face, and the facial hair is only limited where one could see the eyes, nose and mouth. 1.5 Research Methodology The research methodology of this study is based on [9]. A pose classification algorithm was used in their face normalisation model. By classifying whether a given face is in a frontal or non-frontal pose, the performance of their model was improved. In order to meet the objectives of this study, the following outline was followed: 1. Background and literature survey - In this section we investigate and survey state-of-the-art studies in pose-invariant feature extraction methods. The intention of this section is to provide a good understanding of the basement or platform of the research goal. 2. Feature extraction - After a thorough investigation of the existing work in literature, we will look at how to better represent pose invariant features for better classification. 3. Classification models - Our data will then be run through different classification models to determine a better classification of poses. 4. Results - results will be analysed and presented. 1.6. Conclusion and Project Summary 5 1.6 Conclusion and Project Summary In this chapter we have introduced the concept of pose-invariant face recognition (PIRF), i.e. what is meant by pose with regards to face recognition and face classification. Also, we have discussed the problems that are caused by pose and machine learning classification. In the next chapter we will discuss in-depth the main concepts regarding PIFR and classification. The remaining parts of this study are organised as follows: Chapter 2 represents the background of the research work and various components for image classifi- cation. The chapter will also represent the works that has been done by other researchers, in relation to the project topic. Chapter 3 represents the proposed to this project and the algorithms behind the proposed approach. Chapter 4 presents the implementation and results analysis that were obtained from classification models that were tested. Chapter 5 presents a summary of the entire research project, conclusion and recommendations for future work. Chapter 2 Background Theory The focus of this chapter will be on the main concepts regarding how pattern recognition and ML are used to address database representation and the classification of pose-invariant face images. Fur- thermore, we will look at the application of these theoretical concepts to pose-invariant face images to increase the value of our research work. 2.1 Introduction To be able to understand how face images are represented in the database, it is important to first understand the concept of a local feature of an image in image processing. In practice, an image exists in a computer’s read-only memory (RAM), as a rectangular grid of units called pixels [10]. In a pixel, brightness and intensity play an important role because they are the cause of distinctiveness between pixels. Intensity is denoted by the value of a pixel. For example, in an eight-bit (8) greyscale image, there are 256 grey levels, meaning any one pixel in a greyscale image can have a value ranging from 0 to 255 intensity. So, brightness is a relative term, the higher the intensity the brighter the pixel. Assuming we have three pixels, X,Y and Z, with brightness values X = 25, Y = 74 and Z = 230, then Z is brighter than X and Y is darker than Z [10]. On the other hand, a true colour image consists of 24 bits, where there are 8 bits each for Red, Green and Blue, i.e. a total of 8 x 3 = 24 bits. Moreover, a local feature of an image is an image pixel which differs from its close neighbourhood (pixels in the same area) [11]. This local feature is commonly associated with changes in image pixel properties, such as texture, colour, brightness and intensity. A local feature can either be represented 6 2.1. Introduction 7 as some measurement (e.g. distance) made on an image, and transformed into a descriptor which describes an image as a whole. Or, a local feature can be represented by randomly selected small areas on an image which are called image patches, edges or even points. By the same token, one might be interested in extracting a local feature because of the following reasons [11]: • Local features provide a limited set of well localised and identifiable local points. The visual representation of a local feature is not necessarily important, what is important is to accurately extract the same features at a certain location over time. • Another reason why one might be interested in extracting local features is that there is no need for image segmentation (the process of dividing an image into smaller segments). Local features offer a resilient representation of an image without taking into consideration the actual representation of the features, where the aim is to analyse their image pixel statistics not to match features on an individual basis. • The third reason why local features are of interest is that they have a certain semantic represen- tation in a limited context of a certain application. For example, corners of the eyes, mouth and nose tip in a face image, offer a limited representation of a face rather than of the whole face. In the same context, Tinne et al [11] outlines the properties of good features as those which are (not limited to), firstly, repeatable given two images of the same scene (gallery and probe), the amount of features extracted in both images should be equally high for both images. Secondly, features should be local in order to minimize the possibility of occlusion (obstruction of the scene), this will allow simple model approximation of the geometric deformation between the two images. Thirdly, the quality of features extracted matters. The number of features extracted should be acceptable such that they provide a compact representation of the image. Fourthly, good features are invariant to large deformation and can be modelled mathematically if possible. Lastly, the features should be accurately located, with respect to scale and shape. 2.2. Face database, facial feature points detection and feature extraction 8 2.2 Face database, facial feature points detection and feature ex- traction The challenge of detecting and extracting features arises when the features which are to be extracted are pose-variant. For instance, a surveillance camera that is mounted at a border control will capture a known criminal differently, simply because the subjects are not instructed to look at the camera when they are at the border control post. The problem will arise when features that are captured by the surveillance camera are matched with those that the law enforcement have in their database. The changes in head pose will bring differences in semantic similarity between the two images, hence it is important to detect and extract pose-invariant features for feature classification. In the following subsections we will discuss the choice of a face database and methods which are used for facial feature points detection (FFPD), and also look at methods used for feature extraction in order to attain pose-invariant features. 2.2.1 Face database Before images can be properly classified or recognised, one needs a collection of images (data set) that are similar to the subject of the matter. For instance, for face classification at different angles (poses), it is required that images in the same pose as that being classified are pre-stored in the database. The choice of a database to be used for a specific research project should be based mainly on the problem at hand, i.e. whether the problem to be solved is recognising faces of different ages, or classifying different face poses. Again, the choice of a data set should depend on the property to be tested, i.e. how a certain algorithm behaves given a data set. It is recommended in research areas such as ML and pattern recognition to use a standard data set when comparing different algorithms. For face recognition, there are various sets of databases that are available to test different algorithms. Xiaoyang et al [7] analyse and discuss issues concerning face data collection, system evaluation and the influence of a small sample size with respect to FRT. In their work, the authors provide a descriptive direction of research in FRT and their suggested popular databases for testing a given algorithm. Their survey presents work from other researcher which shows that the most commonly used database (amongst others such as the Yale database) for changes in face appearance, and which has also achieved a high success rate in terms of performance rate is the Face Recognition Technology (FERET) database. Figure 2.1 shows a montage of some of the images found in the colour FERET database. The database 2.2. Face database, facial feature points detection and feature extraction 9 consists of mugshots of both genders, males and females of different races (Blacks, Whites and Asians). Also,these images are in different face pose, e.g. -22,5°and +22,5°. Figure 2.1: Different face poses from The FERET Database. The Face Recognition Technology (FERET) program is a database whose main aim is to address three issues [12]: 1. Identify future areas of research in pattern recognition and machine learning. 2. To measure algorithm performance, and also 3. To assess the state-of-the-art in face recognition. The popularity of the FERET database is due to its being a large face database. The database contains a total of 14126 face images, which includes 1199 individuals, and 365 duplicate images. On the other hand, the Yale Database B; which is also one of the popular databases in face recognition has a total of 5760 images, for 10 individuals. The main advantage of the FERET database is that images were acquired in sets of 5 to 11 per subject, under different poses (head direction). Additionally, some of the subjects have facial hair (beard), facial expressions (smile) and also images of a subject taken under different illuminations (lighting conditions). Even though the FERET database has proven to produce good performance results, the images appear to be large in dimension, which requires some pre-processing steps (resizing and segmentation) when performing face image processing. 2.2. Face database, facial feature points detection and feature extraction 10 2.2.2 Feature detection and feature extraction In some generic face recognition technologies (FRT), the process of facial feature detection and feature extraction are performed simultaneously [3]. Facial feature points detection (FFPD) refers to a process of using machine learning (ML) algorithms, to locate facial components [13]. These facial feature points describe the most often used areas on the face such as the corner of the eyes, the corner of the lips, the tip of the nose and the area around the face. Cootes et al [14] classify facial feature points in three categories, namely: • points describing parts of the face with application-dependent significance, such as the corners of the lips; • points describing application-dependent components such as the highest point along the bridge of the nose. • points that are incorporated from points of the two categories mentioned above, like those under the chin. Different numbers of facial feature points can be detected for different kinds of face models [13]. These models can consist of 17, 29 or 68 points; where the points selected can be combined to represent the shape of the face image x = (x1, ..., xN , y1, ..., yN ) TEven though more points would suggest significant information, it is time consuming for a model to detect all points. In face recognition literature, there exist a number of state-of-the-art models for feature points detection, which model the shape and appearance of the face image. Such methods include, amongst others, the statistical shape models named the Active Shape Models (ASM) [14]; Active Appearance Model (AAM) [15] which models the appearance variations of a face as a whole by minimising errors that occur from texture synthesis; and regression based models which estimate the shape from the appearance, without learning any shape or appearance model. Other methods which do not fall under any of the methods mentioned are discussed in [3, 14]. Cootes et al [15] present a model of facial appearance by combining a model of shape and texture variations. In order to build their model, they used a database of 400 face images as examples of their model, where each image is manually labelled with 68 corresponding points around the main features, i.e. around the eyes, nose and mouth, also along the face edge. They then apply algorithms 2.2. Face database, facial feature points detection and feature extraction 11 which align the sets of chosen points to determine the shape of the face images. After obtaining the appearance and a rough estimate of the position, orientation and scale of the face image, they adjust the model parameters to match the images as closely as possible. The problem with their model is that the number of selected points reduce at a lower image resolution. Also the models were used for variations caused by appearance, not for variations caused by pose. On the other hand, [16] presents statistical models of appearance that capture both shape and texture. Their models deal with variations caused by pose, on three sample data sets which consist of profile, half-profile and frontal-face images. These images then form the total data set to 234 manually labelled landmark images. Their work demonstrated that even with a small number of appearance models, a face can still be represented from a wide range of viewing angles. One way of representing a pose-variant face image is by extracting important information from the image called features. These features can either be extracted by trained models from machine learning algorithms, or by manually designing a descriptor that will perform feature extraction. Both these methods of feature extraction help in representing a face image in a compact order, thereby saving database space and reducing processing time. The success of a classification model depends mainly on the features extracted. Hence, feature extrac- tion becomes an important part of machine learning (ML). The reason behind performing this step is to reduce the input data for further processing, i.e. face classification or recognition. Additionally, the assumption is that the selected features will carry relevant information from the input data-set. There are several methods of extracting viewpoint-invariant features in face images that have been published and well researched such as [11]: • Multi-scale methods - These methods use points extracted over a range of scales as features. • Scale-invariant methods - The methods determine both the scale and location of the feature. • Affine-invariant methods - These are viewed as generalisations of the scale-invariant methods, with a different scaling factor in two orthogonal directions and without preserving angles. Moreover, algorithms that develop a descriptor manually, focus on designing a representation that is naturally invariant to pose, while accurate to the identity of subject. There are two methods for pose-invariant feature extraction. The first method is by extracting engineered features in order to 2.2. Face database, facial feature points detection and feature extraction 12 create a face description [5]. Engineered feature extraction methods are algorithms which re-establish the lost information in the image. The purpose of engineered feature extraction is to retain semantic correspondence (similarities) between images, which is lost due to pose variations. The process of retaining this information can be achieved by locating facial landmarks, i.e. the selected points on the face image that are used to construct features, e.g., corner of the eyes, the center of the eyes, the tip of the nose and the corners of the mouth. Retaining lost information on the face for engineered features can also be obtained by landmark-free algorithms. Landmarks are defined as face-image regions which contain points of interest where features are to be extracted [5, 11]. For example, locating the corners of the eyes, mouth and the center of the nose, each of these components are called face landmarks. Algorithms which extract pose-invariant features by locating facial landmarks give a better explanation of the area of the image that has been located, thereby resulting in better semantic correspondence (similarities) of images. If the same image scene is located from two images which are taken in different poses (degree of angle), chances are these located areas would not contain the same texture, scale and rotation, which will result in poor comparison of images. Algorithms which locate landmarks in pose-invariant images such as [17] and others discussed in [13], seeks to restore texture, scale and rotation properties of the selected landmarks, for better similarity in images. Using landmarks to extract pose-invariant features, Biswas et al [17] exploit low-resolution face images from a surveillance camera to correct pose. Before comparing the low-resolution probe face image, in different poses, with high-resolution frontal gallery images, facial landmarks are located, from images using a data set from the Multi-PIE database. The Multi-PIE face database was compiled by researchers at the Carnegie Mellon University. The database contains a data set with high resolution (HR) face images from different view points. Seven facial landmarks are selected, viz; the corners of the eyes, the tip of the nose and the corners of the lips. These landmarks are located by using an approach called tensor analysis, which estimates the facial landmarks and pose of the face image. These landmarks are then used to represent a face, by extracting scale-invariant descriptors discussed in the next section. Thereafter, these descriptors are combined to form one global descriptor that describes a face image. A transformation learning is performed using multidimensional scaling (MDS). The central idea of applying a transformation-based learning is to start with a simple solution to the problem, and apply transformation at each step. The largest benefit is selected to solve the problem. The learning stops when the selected transformation does not modify the data anymore, or when there are no more 2.2. Face database, facial feature points detection and feature extraction 13 transformations left to be applied to the problem. MDS transforms both low-resolution probe image and high-resolution gallery image to the same image space for better comparison. In order to learn the desired transformation, the authors in [17] use the iterative majorisation algorithm, which is an optimisation method that finds the maxima or minima of a function. In their method they did not use any pose classification method, and obtained a recognition rate of 96,1% for face images kept fixed at 5°(degree angle). On the other hand, still using high-resolution (HR) information (images with good visual information) based on pore-scale facial features for face verification, Li et al [18] proposed a landmark free method which is invariant to pose and alignment for face verification. Their method uses only one frontal image per person in the Gallery. Instead of using landmarks to detect facial components such as the eyes and nose, they use a method which detects blob-like regions (image regions that differ in image properties) called pore-scale facial features. The motivation for their work comes from a biological viewpoint, namely that different people should have similar skin quality of facial pores. These regions are likely to include fine wrinkles, hair and pores. From these regions, they extract scale-invariant face- pore keypoints. To determine feature matching of the two face images, they measure the Euclidean distance between their corresponding descriptors. Other methods include the use of learning-based features, such as [19, 20], where features are extracted using pre-trained machine learning models. These are linear models based on multi-view subspace learning [20, 21]. Extracting features for better semantic correspondence To extract and match features across different images is a common problem in pattern recognition and computer vision. It becomes simpler when extracting features of the same scene in the same scale and rotation, but more difficult when extracting matching features of the same scene from images which differ in scale and rotation [5]. In recent years, there has been a huge interest in extracting multi-scale sampling such as Local Binary Patterns (LBP) and scale-invariant Feature Transform (SIFT). LBP and SIFT are engineered features that are largely extracted for better semantic correspondence of face images. Lowe’s [22, 23] method of feature extraction named SIFT, exhibits a powerful detection and recognition rate for objects. The main advantage of this algorithm is its accuracy, scale and rotation invariance. SIFT based descriptors have been proven to do well as local descriptors in terms of scale and rotation 2.3. SIFT 14 of the scene as well as illumination. Even though feature extraction methods such as PCA-SIFT and SURF [24] out-perform SIFT in terms of processing time (given the same parameters), SIFT has been proven to find most matches of keypoints [24]. SIFT features are not only invariant to scale but also to changes in viewpoint, rotation and illumina- tion. For example, take an image of a person’s face in a straight position; meaning, facing straight towards the camera as the original image. Then change the same image and tilt the head to 25,5°and change the image size, SIFT is able to extract and compare (match) the same features, hence scale- invariant. SIFT was first introduced by Lowe [22] as a general object-recognition algorithm. The algorithm is now widely used in face recognition to extract features that are invariant to pose. Once these kinds of features are extracted, a face can be represented in a compact descriptor which is invari- ant to not only scale, rotation, pose and illumination but also affine, where affine means everything that is related to the geometry of affine spaces. An example of an affine property is the average area of a random triangle chosen inside a given triangle. In the next section we will discuss in details how SIFT features are extracted and how other researchers use this algorithm to extract pose-invariant features for face classification. 2.3 SIFT The author of the SIFT [22] algorithm illustrates the four computation stages as: scale-space extrema detection, key points localisation, orientation assignment and keypoint descriptors. For better under- standing of how to extract SIFT keypoints we will divide the stages into seven main processes (which will include sub-processes), namely, creating scale space, approximating the Laplacian of Gaussian (LoG), locating keypoints, removing unreliable keypoints, keypoint orientation assignment and for- mulate keypoints descriptors. The algorithm is patented and can only be used for academic purposes [22, 25]. 2.3.1 Creating a scale space Generally, in image processing, scale spaces are created by generating progressively blurred out images. This is done in order to identify important information on the image (edges or corners). Blurring (blur), also called smoothing [26], is an image processing operation which is commonly used to remove noise 2.3. SIFT 15 (unwanted information) from an image. To perform a blurring operation, one needs to apply a filter, i.e. a function which modifies the pixels in an image based on some function of a local neighbourhood of each pixel. Most common filters are linear: g(i, j) = ∑ k,l f(i+ k, j + j + l)h(k, l) (2.1) where an output pixel’s value g(l, j) is determine by a weighted sum of input pixel values f(i+ k, j + l) and h(k, l) are the coefficients of a filter, also called kernel [26]. Gaussian blur has been proven to be an effective blur estimator to remove noise [26], without introducing additional false details. A Gaussian filter alone, is able to reduce contrast and blur points at which the image brightness changes sharply (edges). In order to identify edges, two Gaussian filters can be used, and subtracted. This is the method that the SIFT algorithm uses to create a scale space. The SIFT algorithm performs scale spaces by taking an image, generating progressive blur, then re- sizing the image to half of the original size. This step is repeated according to the number of octaves (images of the same size but different scales) selected. For an example, if one chooses three octaves, the image selected will be blurred and re-sized on three different scale spaces, this is in order to identify edges at different scale spaces. Figure 2.2 shows an image from the FERET database before creating a scale space, while Figure 2.3 depicts the same face image after applying the SIFT algorithm on 3 octaves and 3 levels, the face image loses detail at each level, and gets blurry. The number of octaves and scales entirely depends on the size of the image. However, Lowe [22] suggested that four octaves and five levels are the ideal for the original SIFT algorithm. Having explained the logical part of the algorithm, in the next section we will discuss the mathematical part of SIFT. 2.3. SIFT 16 Figure 2.2: FERET image. Scale space mathematics We learn that the scale space function is formed by a particular expression applied to each pixel. This function is a convolution of a variable scale Gaussian, G(x, y, σ), for an input image I(x, y) [22, 25]. L(x, y, σ) = G(x, y, σ) ∗ I(x, y). (2.2) where G(x, y, σ) = 1 2 piσ2e −(x2+y2)/(2σ2). (2.3) is the actual Gaussian blur operator in two dimension. Here, • L represents a blurred image • G is a Gaussian blur operator • I is an image • (x,y) the location coordinates on the image • σ is the scale parameter, the value that represents the amount of blur. The larger the value, the greater the blur, which may result in edges not identified correctly because of the level of blur. • * denotes the convolution operation in (x,y) that determines the Gaussian blur. 2.3. SIFT 17 Figure 2.3: Three SIFT octaves at level three. Laplacian of Gaussian (LoG) After progressively creating blur images to create the scale space, the following step is to use the blurred images to generate another set of images, using a Difference of Gaussian (DoG) function. This set of images create a better platform to locate keypoints. Also, DoG is used for computational efficiency, i.e. to efficiently identify stable keypoints so as to estimate the normalised Laplacian, σ2∇2 with σ2 factor for automatic scale selection [23, 25]. Theoretically, the Laplacian of Gaussian operation takes an image, slightly blurs it, and calculates the second-order derivative on the LoG images. The process is done so as to locate edges and corners on the image. The normalised Laplacian function after convolution is represented by: O(x, y, σ) = σ2∇2G ∗ I(x, y). (2.4) where O is the normalised image and σ2∇2G represent the scale-invariant Laplacian of Gaussian. The scale space is used to efficiently generate Laplacian of Gaussian, by calculating the difference between two successive scale spaces. Figure 2.4 depicts how to get the scale space of images that are combined with Gaussian operators. The left side of the figure shows how two Gaussian successive images are selected, and subtracted from each other, and the following successive pair is subtracted and so on. 2.3. SIFT 18 These Gaussian images produce the DoG, which then creates Laplacian of Gaussian approximations which are scale-invariant[23, 25]. Figure 2.4: Laplacian of Gaussian (LoG) [23, 25]. 2.3.2 Locating keypoints Finding keypoints is divided into two parts in the SIFT algorithm, namely, locating maxima and minima in DoG images and locating subpixel maxima and minima. The maxima/minima of DoG of images are detected by comparing a pixel to its neighbours at the current scale, the scale below and the scale above it. An approximated sample point is selected if the scale is smaller than the neighbours or if the scale is larger than the neighbours [23]. The maxima and minima of scales are ’approximated’ because they are not always located exactly on a pixel, they can be located between the pixels, and the information between the pixels cannot be accessed, hence the location of the subpixel. Mathematically, subpixels are generated as [25]: D(x) = D + ∂DT ∂x 1 2 xT ∂2D ∂x2 x (2.5) where D and its derivatives are evaluated at a sample point x = (x, y, σ)T . The above is the Taylor series (expansion of a function into an infinite sum of terms) of the image calculated at the approx- imated keypoint. Upon solving the equation, we obtain the location of subpixel keypoints, which increases the chances of algorithm stability as well as the chances of matching images. 2.4. Keypoint orientation 19 2.3.3 Removing unreliable keypoints Some of the keypoints selected above are of no use, viz., keypoints that lie at the edge of the image and those which are low in contrast. The solution is to discard them. Firstly, the low contrast keypoints are removed by the magnitude of the intensity. That is, if the magnitude of the current pixel’s intensity (of the DoG image) is less than a certain threshold, then the keypoint is rejected. Secondly, removing keypoints which lie along the edges requires calculating two gradients at a given keypoint. As such, the image around which keypoints lie can either be a corner (meaning both gradients will be big), or an edge (where the perpendicular gradient will be big and the gradient along the edge will be small), or the image can be a flat region where, both gradients are small. Corners make good keypoints, so the idea is to extract corners. To check if a point is a corner or not, Lowe [23, 25] computes something similar to Harris corner detector [27], the Hessian matrix at a given location and scale. Efficiency here is increased by calculating the ration of two values [23, 25]. Tr(H) = Dxx +Dyy = α+ β (2.6) Det(H) = DxxDyy − (Dxy)2 = αβ (2.7) where H is a 2 x 2 Hessian matrix that computes the principal curvatures (maximum and minimum of the normal curvature at a given point on a surface) at the location and scale of a keypoint, α the eigenvalues with the largest magnitude and β gives the eigenvalues with the smaller magnitude. 2.4 Keypoint orientation Now that strong keypoints have been selected, and they have been tested to be scale-invariant, the next step is to assign an orientation to these keypoints. The main idea of this stage is to collect the magnitude and the gradient direction of each keypoint. Therefore, the size of the orientation region depends on the scale, where orientation region is the region on the image that provides rotation invariance. The bigger the scale the bigger the orientation region [25]. Let L be the Gaussian smooth image, where m(x, y) denotes the magnitude and θ(x, y) the orientation of each image sample L(x, y). Then, orientation and gradient magnitude are calculated using: 2.5. Generating keypoint descriptor 20 m(x, y) = √ (L(x+ 1, y)− L(x− 1, y))2 + (L(x, y + 1)− L(x, y − 1))2. (2.8) θ(x, y) = tan− 1((L(x, y + 1)− L(x, y − 1))/(L(x+ 1, y)− L(x− 1, y))). (2.9) 2.5 Generating keypoint descriptor The idea behind this step is to generate a unique description for each keypoint, which is easy to calculate and to compare to other keypoints. The first step in constructing a keypoint is to create a 16 x 16 window around the keypoint, then further divide this window into sixteen 4 x 4 windows. Figure 2.5 shows the second step, which is a 4 x 4 window in which gradient and magnitude are calculated. Then from this 4 x 4 window a histogram of 8 bins is generated. Each bin corresponds to 0-44°, 45-89°and so on [23]. Finally, the gradient orientation of 4 x 4 block is put into the 8 bins to create a normalised 128 dimension descriptor. Figure 2.5: Image gradient and keypoint descriptor [23, 25]. Our view from the above discussed is that the SIFT algorithm is computationally expensive, some of the calculations are intense, such as creating scale space and locating keypoints. Nonetheless, the algorithm has been proven to be scale, orientation and viewpoint-invariant. 2.6 Supervised machine learning Supervised machine learning involves classification algorithms that reason from given class labels (cor- responding outputs), to produce general hypotheses in order to make predictions on future instances [8]. The resulting classifier from these supervised methods is then tested by assigning class labels, 2.6. Supervised machine learning 21 where the values of the predictor features are known, but the class label values are unknown. The classifier’s performance evaluation is most of the time based on the prediction accuracy [8, 28]. Pre- diction accuracy is calculated by the percentage of correct predictions divided by the total number of predictions. Supervised classifiers can be trained using one of three techniques, namely: • Using all data for training or no validation method, then compute the error rate on the same data. This method of validation can be computationally expensive, but can come in handy when the most accurate error rate is required. Also, the estimated error rate for this technique can be unrealistically low. • Another technique is to split the training set into two portions, two thirds are used for training and the other third will be used for testing and performance estimation. • In cross validation, the training set is divided into equal portions and mutually exclusive subsets. For each subset, the classifier is then trained on the union of all the other subsets. The error of the classifier is then estimated by the average error rate of each subset. 2.6.1 Supervised machine learning algorithm selection The option of choosing a classifier is mainly based on the prediction accuracy, which is explained as the percentage of correct predictions divided by the total number of predictions [8]. If a supervised learning algorithm’s error rate evaluation is unsatisfactory, a number of factors that are affecting the error rate must be investigated, possibly [8]: • A large training set is needed; • the selected algorithm is unsuitable for the kind of data used; • parameter tuning is needed; • the dimensionality of the problem is too high. There are two ways in which one may select a supervised ML algorithm: Firstly, if there is adequate data supply, a number of training sets of size N can be sampled, then run the selected algorithms on each of the training sets. Thereafter estimate the difference in accuracy for each classifier on a large data set, such that the mean(average) total of these differences is an estimate of the expected difference 2.6. Supervised machine learning 22 in generalization error across all possible training sets of size N , and the variance is an estimate of the classifier in the total set. Secondly, in common cases, a supervised ML algorithm can be selected by performing statistical comparison of the accuracies of the trained classifiers on a selected data set [8]. One of the disadvantages of supervised ML algorithms is that these techniques are costly and time consuming in terms of data labelling [29]. However, the principal advantage of these techniques is that an operator can detect errors and correct them where necessary. The notion behind Support Vector Machines (SVM’s) is the distance from the decision surface to the closest data point which is known as a margin, i.e. two data classes are separated by a hyperplane. An upper bound of the expected generalisation of data has been proven to maximise the margin, also creating the largest possible distance between the separating hyperplane and the the instance on both sides of the plane [8]. Figure 2.6 shows that the function of the SVM is to search for a decision surface that is maximally far away from the data points. The margin of the classifier is determined by the distance from the decision surface to the closest point. The decision function of the SVM is defined by a subset (data points) of the data which defines the location of the separator. The six red and blue data points which are on the separator are called support vectors, the rest of the data points take no part in determining the decision surface [8]. It is for this reason that we chose the SVM as one of the supervised method for this research work. Figure 2.6: Support Vectors and margin for a simple problem. Furthermore, the training data set is said to be linearly separable when a pair (w, b) exists such that [8]: 2.7. k-Nearest neighbour (k-NN) 23 wᵀxi + b ≥ 1, for all xi ∈ ρ wᵀxi + b ≤ −1, for all xi ∈ ν (2.10) where w is the weight vector (quantity with magnitude and direction), b is the bias (decision hyper- plane) or −b the threshold and xi is the i-th component of x. Moreover, SVM’s are able to learn tasks where the number of features is larger than the number of training instances. Thus the complexity of the SVM algorithm is unaffected by the number of features encountered in the training data [8]. 2.6.2 SVM kernel functions Most of the real-world problems consist of non-separable data, for which there is no hyperplane that can successfully separate the negative from the positive data [29]. Kernel functions are a class of functions which are responsible for solving such problems. These classes of functions map the input data to a higher-dimensional space which is called the transformed feature space. The kernel functions calculate the inner products directly from the feature space. Selecting a kernel function is a crucial step, as it defines the transformed feature space into which the training set will be classified. The SVM will still operate accurately as long as the kernel function is working properly [8]. The process of selecting a kernel setting is closely similar to choosing hidden nodes in neural networks, which we will look at in the succeeding sections. Some of the kernels are Gaussian (also known as radial basis) functions and polynomial functions. 2.7 k-Nearest neighbour (k-NN) k-Nearest Neighbor (k-NN) also called lazy learning, is instant-based supervised learning. The instant- based algorithms are the kinds of learning methods which delay the process of generalisation until classification is performed. Informally, k-NN is defined as an instant-based learning (IBL) algorithm that stores the entire training set into memory, such that when a test set is presented, k-NN selects the k nearest training instances (data points) to the particular test instance. From these k nearest instance that have been selected, k-NN predicts the dominating class of the k instances as the class 2.7. k-Nearest neighbour (k-NN) 24 which falls under the test instance [30]. In other words, the instance within the data set will closely be approximated to other instances with similar properties [8, 31]. The distance is determined by a distance metric. Table 2.1 shows some of kNN distance metrics: Euclidean:D(x, y) = ( m∑ i=1 |xi − yi|r )1/r Camberra: D(x, y) = m∑ i=1 |xi−yi| |xi+yi| Manhattan: D(x, y) = m∑ i=1 |xi − yi| Table 2.1: Approaches for distance metrics [8]. where r is 1 or 2, and where x = (x1, x2, ..., xm) and y = (y1, y2, ..., ym) are vectors. Some of the distance metrics that are used in k-NN are tabulated in Table 2.1. The sole purpose of a distance metric is to minimise the distance between similar classes and to maximise the distance between different classes [8]. Moreover, the choice of k affects the performance of a k-NN algorithm, since incorrect classification of query instances is possible because of the following reasons [8]: • The region defining the class or some part of the class is so small, that the instance of that particular class overpowers the region. This problem can be solved by choosing a smaller k. • Noise present on the query instance also causes misclassification, because the noisy instance overpowers the information to be classified. A higher k can solve this problem. To support the above two facts, Okamoto and Yugami [30] present their analyses using instance-based learning (IBL) algorithms. The authors used a k-NN classifier to learn the algorithm’s behaviour when handling noise. The tests were done on three different types of noise, namely, irrelevant attribute noise, relevant attribute noise and class attribute noise. These attributes are the information sources that the quality of data can be classified under. Where relevant and irrelevant attributes noise are Boolean-attributes noise. Their analyses were presented on both noisy and no noise domains in order to analyse the learning behaviour of the k-NN algorithm. In their analyses, they deal with m-of-n concepts, (where m is threshold value in target concepts and n is number of relevant attributes), whose positives are defined by having m or more than n relevant attributes. They also include the size of training set, the number of irrelevant and relevant attributes, the probability x of attribute, the threshold value and the value of k. In order to analyse the effects of noise, they plot the predicted 2.8. Decision trees 25 behaviour of k-NN in each domain. The effect of each domain characteristic on the expected accuracy of k-NN was predicted. In addition the required number of instances to achieve a particular accuracy was also predicted. Lastly they presented the most favourable value of k against the training set. Their study shows that the expected accuracy of the k-NN algorithm noticeably decreases as each value of k increases where noise is present. 2.8 Decision trees Decision trees are supervised classifiers that learn simple decision rules from the data points, in order to predict the value of the target. The goal of these models is to sort instances based on features for classification [8, 32]. These learning methods select which variables are important for classification instances, as well as indicate to which class a particular instance belong. Figure 2.7: Decision tree classifying from a set of attributes. Figure 2.7 is a decision tree that classifies from a set of attributes (poses). Each level on the decision tree divides the data according to different specified attributes. The main aim of a decision tree is to achieve a perfect classification score with the minimal number of decisions to be made. Therefore, the feature that best classifies the training data is said to be the root node of the tree [8]. Even though there is no one method that is known to be best in finding the feature that best divides data points, information gain is proven to be one of the popular methods that is used to find the feature that best divides the data points [8]. In machine learning the concept of information gain is used to define the measure of impurity in a data set. Information gain increases with the average purity of a particular subset, i.e., it involves partitioning data into subsets which contain instances of similar values. Further advantages of decision trees are: 2.9. Neural Networks 26 • Decision trees are simple to understand. • They are able to handle categorical and numerical data. • It is possible to validate a model of a decision tree using statistical data. • Decision trees require little data preparation. A decision tree can be interpreted into a set of different rules. This is achieved by creating a separate rule for each path. Rules are created from the root to the leaf of the tree [8]. In classification, each class is represented by a disjunctive normal form (DNF). A DNF means, any Boolean condition that can be used to classify the data and can be expressed in ∨ ”or” or ∧ ”and”. Therefore, a k-DNF expression is of the form: (X1∧X2∧...∧X2n)∨(Xm+1∧Xn+2∧...X2n)∨...∨(X(k−1)n+1∧X(k−1)n+2∧...∧Xkn), where k represents the number of disjunctions (joining of two statements with the connector ∨ ”OR”), n the number of conjunctions (joining of two statements with the connector ∧ ”AND”)in every disjunction and Xn is defined over alphabet X1, X2, ..., Xj∪ ∼ X1,∼ X2, ...,∼ Xj [8]. The goal of learning a particular set of rules is to build the smallest rule set that matches the training data. 2.9 Neural Networks Neural Networks (NN) machine learning algorithms are based on the notion of perceptrons. A per- ceptron can be seen as a computerised machine, which is built to represent a human brain in order to recognise objects. They are an attempt of modelling the data capability of a nervous system [33]. Figure 2.8 depicts a single layered artificial neuron model. If x1 up to xn are input feature values, w1 up to wn are prediction vectors or connection weights, and t is the adjustable threshold, then the sum of weighted inputs is computed using perceptron: if ∑ i wixi ≥ t then y = 1 else (if ∑ i wixi < t) then y = 0 (2.11) The common way used for a single-layered perceptron algorithm to learn a linearly separable pattern, is to run the algorithm over and over through a given training set, until it finds the prediction vector which is correct on all of the training set. The prediction rule will then be used to predict the test 2.9. Neural Networks 27 Figure 2.8: Single layered artificial neuron model. set [8]. However, if the instances are not linearly separable, Artificial Neural Networks (ANN) are used to try and solve the problem of inseparability. ANN also known as multilayered perceptrons, are neural networks of a large number of neurons (units) that are joined in a pattern of connection. These neurons are divided into [8, 34]: • input nodes - whose function is to receive information that is to be processed. • output nodes - these are neurons which are the results of processing. • hidden nodes - these are neurons in between input and output. 2.9.1 Feed-forward artificial neural network The feed-forward ANN is a multilayered perceptron which allows the signals to travel from input to output [8]. Figure 2.9 illustrates a feed-forward ANN where input-output mapping is determined by training a network on a set of paired data. The connection weights between neurons are fixed, then the network is used to determine classification of new instances [8, 34]. Mathematically, let the activation function (output) be aij of the j th neuron in the ith layer, where the input vector is determined by the jth element of a1j . Moreover, to determine the activation value at all the output neurons, the signal travels through the network during the classification process [8]. Every input node consists of an activation value, which represents a feature external to the net. These inputs then send the activation value to every hidden node which is connected, where each of the hidden nodes will calculate its own activation value. Then the signal is passed on to the output. The activation value for each receiving node is 2.9. Neural Networks 28 Figure 2.9: Feed-forward ANN [8]. calculated according to its activation function. Thereafter, the function calculates the sum of all the contributions of the sending nodes, where the contribution of a node is defined as the weight of the connection between the sending and the receiving node x the sending nodes activation value [8]: The sum is further modified by adjusting the activation sum to a value between 0 and 1, and/or setting the activation value to 0, except if a threshold is reached for that particular sum. Additionally, equation 2.12 is used to establish a relation between the next layer’s input and its previous layer [8]: aij = σ (∑ k (wijk.a i−1 k ) + b i j ) (2.12) where: • σ is the activation function, • wijk is the weight from the k th neuron in the (i− 1)th layer to the jth neuron in the ith layer, • bij represents the bias of the j th neuron of the ith, • aij is the activation value of the i th layer. In order to determine how well the network performs with respect to its given training samples and expected inputs a cost function is used. The cost function C is a single-valued number and not a vector, it is of the form: C(W,B, Sr, Er) (2.13) 2.10. Performance evaluation metrics 29 where W is the neural network weights, B is the network’s biases, Sr is the network’s input of a single training sample and Er is the desired output of the training sample. 2.10 Performance evaluation metrics There exist different types of metrics that are implemented in order to evaluate algorithm performance. In this section we look at proposed methods for evaluation of performance of the classification models to be trained. 2.10.1 Receiver Operating Characteristic (ROC) curve The Receiver Operating Characteristic (ROC) curve was initially developed by an engineer during the World War II; its main purpose was to detect enemy objects in the battlefields [35]. Over the years, this method has been widely implemented in fields such as data-mining and machine learning, as a graphical representation of the performance of a particular classification algorithm. The aim of the ROC curve is to [35, 36]. • Compare the efficiency of two or more classification algorithms; • Study the inter-observer variability when two or more observers measure the same continuous instances; • Find an optimal cut-off point to least misclassify the two group subjects, and • Evaluate the discriminatory ability of a continuous marker to correctly assign into a two group classification. 2.10.2 Confusion Matrix Confusion Matrix is one of the evaluation measures that can be extracted from a ROC curve. The confusion matrix is the evaluation measure for binary classification problem, i.e. a classification problem which has two classes, positive or negative. Positive and negative classes can be further characterised as true positive, false negative and false positive, true negative respectively. Table 2.2 tabulates a confusion matrix with the following evaluation measures: 2.11. Prior art 30 Predicted Class True Class Positive Negative Positive True Positive False Negative Negative False Positive True Negative Table 2.2: Confusion Matrix. • True positive (TP) - These are instances correctly predicted in relation to the positive class. • False negative (FN) - Instances predicted as negative, whereas their true class is positive. • False positive (FP) - Instances predicted as positive, when they are actually of the negative class. • True negative (TN) - Instances predicted correctly as negative class. Se = |TP | |TP |+ |FN | Spe = |TN | |FP |+ |TN | (2.14) where Se and Spe are sensitivity and specificity of a classification model respectively. These are statistical measures in ML models to determine the usefulness of a test [35]. In terms of face pose, Se is the probability that a subject’s face image will be correctly classified under its rightful pose class, and Sp is the probability that an incorrect face pose will be truly classified as incorrect based on the trained classes. 2.11 Prior art It has been proven that variation caused by pose forms one of the factors that affect the full func- tionality of an FRT [3]. In fact Abiantun et al [37] in their work, give as reason that the problem of pose remains overlooked in real-world applications. This section looks at some of the research work which attempts to learn pose-invariant face features, in order to achieve the full functionality of an FRT system [38]. The motivation behind the selection of prior art in this section is to learn about the methods that are used in order to solve the problem of pose. Additionally, to learn the different 2.11. Prior art 31 degrees of angle (°) that pose variation images have been tested on, the different databases that are used to solve variations caused by pose, and also how pose-invariant images are being represented in different methods. Yan et al [39] proposed a novel learning framework which uses Multi-task Learning approach (MTL). MTL approaches in classification/regression, aims at solving multiple task at the same time. These methods simultaneously learn classification models for a given task, for example methods which classify head pose. For a given head pose (45°), their method learns appearance across partitions which operates on a grid. Additionally, it also learns partition-specific appearance varia- tions. To select the appropriate partitioning of the head pose, they use two graphs from a trained appearance similarity model, one which uses grid partitions based on camera geometry and the other which uses head pose classes, to efficiently cluster appearance-wise grid partitions. The difference of this approach from our approach is that [39] employs the MTL head-pose classification method under target motion. On the other hand, one study has shown that pose-invariant features can be achieved by Principal Components. Kumar et al [40] in their study use Principal Components as features. Taking a large set of features, Principal Components involve a smaller number of dissimilar features. They use the eigen vectors of the covariance matrix as their basis vectors for these Principal Components. In order for them to learn similarities, learn dissimilarities and reduce the dimensionality of these Principal Components features, they make use of PCA. Furthermore, the significance of a feature is determined by the eigen vectors that have the most energy as interpreted from their corresponding eigen values. The images are then projected into a reduced vector space, viz. the vector space that is formed only by the selected significant eigen vectors. Finally, their feature vector is formed by the projection coefficients. Only then do they estimate a linear transformation function from the feature vector which will generate a frontal-face feature from a non-frontal face. The work that is done by [40] uses the Indian Institute of Technology Kanpur (IITK) Database, where each subject has images in poses: 0° (being a frontal pose), ±15° and as well as ±30°. The IITK database is a multimodal project which is sponsored by the Ministry of Communication and Information Technology, New Delhi. In this project IITK develops other biometric systems such as fingerprint and iris recognition, including face recognition. Finally, [40] uses 50 images to learn the transformation function and 59 for testing. To learn pose-invariant features, Zhang et al [41] use a high-level feature learning scheme, which incorporates random faces (RF) and sparse many-to-one encoders (SME). Autoencoders are methods which are used for unsupervised learning in artificial neural networks (ANN), for efficiency in coding 2.11. Prior art 32 in order to learn representations (features) on a particular data set. They first select facial landmarks on face images at ±45°and ±75°. Then according to the similarities defined by the landmarks, they use an engineered feature extraction method to extract pose-invariant facial features. Their method of feature extraction is based on a single-hidden-layer neural network, which they use to extract discriminative features, and train an input facial image that is in different poses (many). In contrast to the traditional feed-forward neural network, where the hidden layer is the output of a weight function followed by an activation function, their model uses the hidden layer as pose-invariant high- level feature representation, assuming that images of the same subject in different poses share the same high-level feature representation. They then use these features to compare the input to gallery of face images which are in the frontal pose, to confirm the identity of the received probe image. A similar method of extracting engineered features was earlier used in [42]. Ensuring better semantic correspondence across poses in face images is achieved by combining landmark- level and component-level features in [43], through extracting Multi-Directional Multi-Level Dual- Cross Patterns (MDML-DCPs). Using the first derivative of the Gaussian operator, MDML-DCP computes the face image by reducing the impact of illumination. DCP’s are descriptors that are used to represent a face image. These descriptors are based on the textural structure of a human face. In order to build a pose-invariant face representation they follow three steps, namely, image filtering, local sampling and pattern encoding. In order to perform local sampling and pattern encoding, they use Dual-Cross Patterns (DCPs). These patterns are formed by landmarks from facial components, i.e. eyebrows, eyes, nose, mouth and the forehead. To encode the patterns that will form a face representation, they combine textural information and DCPs belonging to the same facial component. Furthermore, to construct a pose-invariant face representation, they use MDML-DCPs to convert grey- scale face images into multi-directional gradients that are invariant to illumination. The conversion is performed by the first derivative of the Gaussian operator. The next step is to build pose-invariant descriptors by normalising the face images using geometric rectification based on the similarity and affine transformation. The reason they use a similarity transformation is that it restores the original information of facial components and; facial contours; and also retains their configuration, while the use of affine transformation reduces differences in appearances caused by pose variations [43]. Their method proves that by combining both component-level and holistic-level features the representation generated is complementary and appropriate, thereby promoting robustness to variations. The reason is that component-level features are independent of changes in face pose; for they focus on a single fa- cial component, whereas even though holistic-level is sensitive to changes in pose, it captures complete 2.12. Conclusion 33 information on both facial contour and facial components. 2.12 Conclusion From the methods discussed in this chapter, our method performs classification of input face images as the first step towards correcting the pose of the face image. This chapter looked at the fundamental terminology and calculations regarding the extraction of features and feature classification learners. Also, it looked at some of the work done by researchers to extract pose-invariant features. In the next chapter we will go through the methods that were used in this research work in order to achieve a representation for pose-invariant face images. Chapter 3 Methodology 3.1 Introduction This chapter specifies the formal steps leading to image pre-processing, feature extraction, feature normalisation, supervised classification and comparison and evaluation of classification models. The methodology of this study is based on the model of Ho et al [9]; for face pose classification. Unlike [9], the purpose of this study is not to use Markov random fields for matching face images, but to implement a classification model that will classify a still face image according to different poses. The chapter further explains corresponding architectural illustrations. Our approach focuses on creating a pose-estimation model by classifying images in different poses. Figure 3.1 illustrates a flowchart that shows the proposed methodology of this research study. First an image dataset will be constructed from the FERET database. These images will be in different poses and randomly selected. Secondly, image pre-processing will be performed to improve and enhance image data, in order to remove unwanted distortions, such as noise. The next step will be to extract features in order to create representation of the face images that are going to be used in this research work. Following the features extraction will be the normalisation step, where the range of the extracted features will be standardised. After normalisation, the normalised features will be divided into training data and test data, then a supervised classification model will be prepared which is required to make predictions (of poses). Finally, from these trained models, a decision will be made on which model best classifies our data. The model that best classifies our data will be selected to be the pose classification model. 34 3.2. Pre-processing 35 Figure 3.1: Proposed approach. 3.2 Pre-processing Pre-processing is a process that includes several functions. The main purpose of pre-processing is to improve or enhance the image (in most cases presented in pixels) data that cause distortion, for both input and output, and it is also important for further processing of the images [29]. Figure 2.1 in the previous chapter, presents some of the FERET images that are used in this research work. One of the main advantages of the FERET dataset is that it is available as open source, and although the images are not of high resolution (HR), there are a reasonable number of face images (14126) that address the problem of face pose. For this research work, a hundred images from each of the four different poses (P1 - P4) are selected to build a classifier, giving a total of 400 images for both training and testing. The computation time is directly proportional to the number of images used. The images range from qr (+22,5°) quarter right, ql (-22,5°)quarter left, random images where subject faces to photographer’s left, rd (-45°) and random images where the subject faces to the photographer’s right, ra (+45°). Furthermore, rd and ra are random but consistent images whose precise angles are unknown. For example, both the first and the second images Figure 2.1 are of the same pose, qr. The images are in colour and of size 768 x 512 pixels. These images form the basis of our dataset. 3.2.1 Patch cropping As seen from the montage in Figure 2.1, and from the size of the images mentioned in the previous paragraph, these images are relatively large (mugshots, showing the shoulder area) to be directly pro- cessed, hence we do not use the whole face area (global), but select only the local landmarks (eyes, nose and mouth). Also, because of the chosen face pose (non-frontal), we could not automatically detect 3.2. Pre-processing 36 the landmarks using the Viola-Jones algorithm [44], the first object detection framework proposed by Paul Viola and Michael Jones, that provided competitive object detection rates in real life. Besides object detection, this algorithm is also used for frontal face detection. As a result, we used a snipping tool [45], to capture clips of an image. This tool helps to manually select the patches that are desired (eyes, nose and mouth). Our patch size was not fixed, the size varied from patch to patch, i.e. one subject’s patch size of the mouth is not the same size as the next subject’s patch size. The idea is to find a rectangle that covers the area of the chosen landmark, as seen in Figure 3.2, without cropping any of the landmarks around it. Algorithm 1 shows the described procedure of cropping the patches from the image Algorithm 1 Patch cropping face image← PatchSize 1: if PatchSize = m x n matrix (width × height) then imageCropping = save patch 2: else imageCropping = discard patch 3: end if We then used the image set function from Matrix Laboratory (MATLAB). This function is for loading the patches on to MATLAB, where they will be processed as a collection of image datasets rather than individually. MATLAB is a tool that scientists and engineers use to analyse and design systems. This tool has been used in different research areas including machine learning, computer vision, image processing and robotics. The main reason MATLAB was selected to pursue this research work is that it provides a number of useful tools with regard to image processing. Furthermore, MATLAB provides a platform that is interactive, i.e. one is able to monitor images by means of visual display. In addition to image processing tools, MATLAB offers statistical as well as classification tools. When loaded on to MATLAB, these patches are then converted to matrices, the format in which MATLAB reads images. The smallest patch that is loaded in the image set is of size 39 x 164 x 3 pixel unit8, where: • 39 x 164 (width x height) pixel represents the size of the patch. • x 3 represents the RGB colour space, i.e. the patches are first loaded in true colour defined by the three chromaticities of red, green and blue primaries. • unit8 is a MATLAB data type, which stands for unsigned integers, i.e. the values contained in a particular matrix are all positive. 3.2. Pre-processing 37 Figure 3.2: Eye, mouth and nose cropped patches. 3.2.2 Pixel brightness transformation When programming an image, we need to consider that pixels have width and height therefore we use the formula: • Assume an image with a given WIDTH and HEIGHT, measured in pixels. • We know that the pixel array has a total number of elements equalling WIDTH * HEIGHT. • For any given X, Y point in the image window, the location in our pixel array is: LOCATION = X + Y * WIDTH Furthermore, from the pixel properties discussed in the previous chapter, we further learned that image enhancement covers pixel brightness transformation, where pixel brightness has two classes, namely, brightness corrections and grey-scale transformation. Brightness correction changes the brightness of the image without taking into consideration the position of the image pixel, and also retains the luminance of the image. Grey-scale simply applies the irreversible conversion (brightness) of a photo, rather than applying adjustment (changing the photo completely). The main aim of grey-scaling the images is to attain images with equally distributed brightness levels over the whole brightness scale. The distinction between black and white (BW) and grey-scale images is that BW have only two pixel colours, i.e. black and white pixels; whereas, grey-scale intensity is stored as an 8-bit integer, i.e. 256 possible shades of grey from black and white. The transformation function is given as: s = T (r) (3.1) where s is the pixel value of the output image, r is the pixel value of the input image and T is a transformation function that maps each value of r to each value of s. 3.2. Pre-processing 38 In this study, we used the linear grey level transformation. For the linear transform we used a simple identity transformation function, so as to enhance our images in order to provide more detail about the image, and also for better contrast. We had initially selected the power - law transformation, but later observed that we did not extract as many strong keypoints as we would desire to, for better classification, i.e. some of the patches contained a lot of 0’s thereby resulting in less information about the patches. Nonetheless, in identity transition the mapping is direct for each value of the input image to the output image, resulting in the same input image and output image. Figure 3.4 shows a representation of a true colour and a grey-scaling image. The two images are still the same in position and scale even after grey-scale, the difference is the brightness of the images. Figure 3.3: Representation of the original and grey-scale images. 3.2.3 Image filtering using guided filter Guided image filtering has been proven to be efficient and also effective in a larger diversity of machine learning and computer vision applications [46]. We chose the guided filter as a method of filtering because it allows edges to be preserved on the patches, considering that in our feature extraction method, we are required to compute image edges. Also this method of image filtering has proven to have better behaviours near edges, and it can transfer the structure of the guidance image (be it the input, output or a different image). The guided filter uses the contents of the guidance image to influence the filtering, by creating a less smoothed and more structured output image. In this research work, the filtering output patch at a pixel i is a weighted average of: qi = ∑ j Wxy(I)ρj (3.2) where x and y are the pixels at a given point, and I is the guidance image whose function is a filter kernel Wij and is independent of ρ. 3.3. Feature extraction 39 3.3 Feature extraction For this work, we selected the SIFT algorithm to convert the face patches into keypoints for better database representation. The reason we chose this algorithm is that we are not only getting the parameters x and y (the position of a pixel) and t which is the scale of the image, but SIFT finds keypoints which are rotation, viewpoint and illumination invariant. Algorithm 2 shows a step-by-step method of how this algorithm was used in this work to create a descriptor. While the original scale parameter (σ value) of SIFT is zero (0), we extract the SIFT keypoints at three different scale parameters; σ = 3, σ = 6 and σ = 9. This is done so that we observe the value of σ where we are able to extract enough keypoints for accurate classification of face pose, i.e. the level of blur that can give us better classification of face images at different poses. Algorithm 2 Compute SIFT descriptor SET octave SET level SET σ procedure –Create scale space for each octave do Create Gaussian blur intervals Create DoG intervals Compute edges for each interval procedure –Finding Keypoints Search each octave for stable extrema Create keypoints at dominant orientations of extrema procedure –Generate a descriptor for each keypoint do Rotate sample grid to match keypoint orientation Sample the rotated region and create descriptor Save descriptor The goal of the above process is to determine the value of σ that will be suitable for our dataset, which will assist in extracting meaningful descriptors for each patch. Again in chapter 2 we learned that to remove unwanted keypoints, i.e. those which are low in contrast and those which lie at the edge of the image, a certain threshold has to be set that will determine the magnitude of the current pixel. We will again regulate the threshold value for better localisation of keypoints. Lastly, to create a 128-dimension descriptor, we keep the original window at 16 x 16 around the keypoint and divide the window into sixteen 4 x 4 windows. Therefore, the magnitude (m(xi, yi)) and orientation (θ(xi, yi)) of each patch sample (pixel) will be calculated as shown in equations 2.8 and 3.4. Feature Normalisation 40 2.9 for all classes (±22, 5°,±45°). The objective of this feature extraction method is to extract pose invariant features, by locating meaningful keypoints and constructing significant descriptors which can also be used for texture classification in future research work. 3.4 Feature Normalisation Feature normalisation is a step which is performed during the image processing stage. This method is used to standardise the data from the extracted features. Each feature component is normalised to a range of [0 1]. There are various normalisation methods used which are scale invariant such as residual data, mean data and standard deviation data. The normalisation procedure used in this research work is the Linear scaling to unit variance. The main purpose of choosing this procedure over others is to transform the feature component x to a random variable with mean 0 and unit variance. The procedure taken to achieve scaled features is described in subsequent sections: 3.4.1 Calculating the Mean (µ) Calculating the Mean (µ) means we calculate the average of all the feature vectors extracted from each patch. We have 100 patches per pose, i.e. P1 - P4, the mean is calculated per pose as the average of features (100 x 128), 100 images each having 128 feature vectors. We then use the formula: X¯j = ∑n i=1Xij n (3.3) where: • n is the keypoint of all the patches, n = 100; • i is each keypoint in a feature vector per patch (column), such that i = 1, ..., n; • j is each keypoint in a feature vector per patch (row), j = 1, ..., 128; • Xij is the extracted keypoint at position ij . In other words, µ of a pose class will be a feature vector of 128 elements representing the mean of any given class of the four classes we have. The purpose of this step is to find the average per pose class. The next step is to calculate the spread of features per class. 3.4. Feature Normalisation 41 3.4.2 Calculate the Standard Deviation (σ) The definition of Standard Deviation is given as, the average distance from the mean of the data set to a point [47]. Here we calculate how spread out our features are, by calculating the squares of the distance from each data point, to the mean of the set. Thereafter, we added all the calculated squares, divided by n− 1, then took the positive square root. The formula used is: sj = √∑ (Xij − X¯ij)2 n− 1 (3.4) Where: • n is the keypoint of all the patches, n = 100 • i is each keypoint in a feature vector per patch (column), such that i = 1, ..., n • j is each keypoint in a feature vector per patch (row), j = 1, ..., 128 • sj is the keypoint at the jth position • Xij is the raw feature • X¯ij is the mean at a position Similar to the mean, the σ of a pose will be a 128-dimensional feature vector representing the Standard Deviation of any given class of the four classes we have. The next step following the two discussed subsections, is to normalise our features vector for each patch by using equation 3.5. After determining the mean and standard deviation of the four poses, we then calculate the normalised feature vector for each patch: x˜i = |xi − µj | σj (3.5) In the above equation, we let: • xi be the extracted keypoint at the ithposition; • µj be the calculated mean at the jth position; 3.5. Supervised Classification 42 • σj be the calculated standard deviation at the jth position; • x˜i be the normalised keypoint at the ith position. 3.5 Supervised Classification We use a multiclass classification method for training our models. In a multiclass classification, the given inputs X represent a matrix of predictors, and a set of n training observations, ~xi, which consist of p-dimensional vector of features each, where each column represents one predictor and each row represents one observation. For each training observation, there exist a corresponding class label, yi ∈ {1,−1}, represented by Y which is an array of class labels (P1 - P4) where each row corresponds to the value of corresponding xi. Therefore, n normalised SIFT feature vectors of dimension 128 that contain training observations of (~xi, yi) which represent the feature at position i and the corresponding class label at yi. Furthermore, test observations which will be used at a later stage in order to test the classification performance of our model are represented by ~x∗ = (x∗1, ..., x∗p). With this information, our goal was to classify into which pose (°) the input image falls. 3.5.1 Support Vector Machine (SVM) Essentially, there are two techniques in which one may train a multiclass classifier, namely, One- against-All (1AA) and One-Against-One (1A1), where 1A1 is a classifier that performs pairwise com- parison between n classes, and each classifier trained on two of these classes, giving a total of N(N-1)2 models. On the other hand, 1AA builds model with N different binary classifiers [48]. Upon testing our data on both these techniques, we opted to use the 1AA technique. Our reason for choosing this approach was not only based on the data, but that it was the most commonly used approach [48]. In addition we had observed that there was no significant difference in validation accuracy percentage when we trained our data on both models. Hence we opted to train the data using the One-Against-All approach. Figure 3.4; tabulates how each model is trained and how a class is tested against other classes. For instance, when Pose 1 is trained, all other poses become negative and Pose 1 becomes positive, and so on. Furthermore, we used an adjustable kernel in order to be able to adjust the kernel scale to improve the model learning. 3.5. Supervised Classification 43 Figure 3.4: Model training. 3.5.2 k-Nearest Neighbours We used k-Nearest Neighbours (k-NN) to solve a classification problem. The first step will be to determine the value of k. In Chapter 2; we learned that with an increase in the k value, the separating boundary becomes smoother; the intention is to increase the value of k until we get a smoother boundary. Also, from the same value of k; we can determine the two parameters which are crucial for our research work, namely, the validation error rate and the training error rate. These error rates together with the training curve will help us determine a suitable value of k. 3.5.3 Decision trees Considering that our data are categorical, i.e. either ”true” or ”false”, the aim is to use a decision tree algorithm that performs binary splits. The decision tree that is going to be trained will set apart the descriptor to be trained based on all the four poses that are chosen. Figure 2.7 in the previous chapter depicts the structure of the decision-tree model for this research work. The normalised SIFT descriptors will be the root node of the entire population on the tree. Thus, our root node will have 400 samples of dimension 128. This will then be further divided into four nodes (±22,5°and ±45°) for training, and be labeled accordingly. The first step will be to set limitations on the tree size to ensure that there is no over-fitting while modelling a decision tree. The over-fitting problem can be eliminated by using various parameters 3.5. Supervised Classification 44 which are used to define a decision tree. Parameters that we have to consider include: • Minimum required number of observations to split a node. Minimising the observations will also help in avoiding the problem of over-fitting our model. • Controlling the depth of the tree. Using K-fold validation, we would determine at which value of K we obtain better classification without over-fitting our model. Also considering that higher values of K will allow our model to learn similarities of face pose. • Maximum number of surrogates per node. Considering that in a binary tree, a depth of n; would produce a maximum of 2n nodes, this will help to determine the maximum number of surrogates per node. • Maximum features to consider when splitting the tree. 3.5.4 Neural network pattern recognition In order to train the extracted features using neural networks, we chose a two-layer feed-forward neural network. The reason we chose this classification algorithm is that it can perform the classification of vectors with less computational complexity, yet achieve satisfactory classification accuracy, if the network is given enough neurons in its hidden layer. Figure 3.5 depicts our network architecture. The network will consist of an input layer; its inputs are SIFT descriptors from the four different poses that we chose, i.e. ± 22,5 ° and ± 45°. The data will then be passed through the next layer which is a hidden layer that consists of a sigmoid activation function. The sigmoid activation function takes the mathematical form of θ(x) = 1 (1+e−x) which makes learning of connectors in a neural network easier. The sigmoid function is used in neural networks and machine learning because it uses less computational power to process the neurons in a network. In our work, the sigmoid function is incorporated with softmax output neurons, which is a function used in the final layer of our neural network classifier. The figure also shows the weights denoted by (W ). These weights determine connection from the current neuron to the next layer’s neuron, each connection carrying different weights. There is no connection among neurons of the same layer. To ensure better performance of our network, we plan to increase the number of neurons during the training phase. 3.6. Conclusion 45 Figure 3.5: Model training. 3.6 Conclusion In this chapter we looked at mathematical equations and algorithms which forms part of the method- ology for this research work. In the following chapter we will look at the performance of our data, when methods discussed above are applied, these results will also be analysed. Chapter 4 Analysis of results 4.1 Introduction This chapter discusses the experiments that were performed to determine which angle class the input face falls. The main purpose of our experiments was twofold, namely: • To determine the angle of the input; thereby accomplishing the research objective (1). • To determine the classifier that gave the highest classification accuracy on patch-based face images; thereby accomplishing the research objective (2). Given the type of problem we are addressing (face pose), model evaluation was not only done to learn how well a specific model could predict a given pose, but also to learn how well to choose between different models, by tuning different parameters. Furthermore, to quantify the model’s performance, by using both classification accuracy and a given metric. 4.2 Pre-processing The dataset used for experiments, which were conducted in this study are randomly chosen mugshots from the FERET database. The database is distributed by the National Institute of Standards and Technology (NIST), in order for researchers to develop better automated face recognition algorithms, technology and techniques. Also, this database is used to test and evaluate different face recognition 46 4.3. Feature extraction 47 algorithms. The release agreement for the download of the FERET database can be downloaded at the following link: https://www.nist.gov/itl/iad/image-group/color-feret-database. The mugshots were selected by going through the images in the database, and selecting the images that have all required poses per subject. The subject’s set of images had to have all four poses used to create training classes for this research work, i.e. ql, qr, ra and rd face poses. The mugshots that were excluded are those which had some or none of the poses we selected for the training set. The selected mugshots are from subjects of different races, i.e. Black, White and Asian. The subjects have wrinkles, are bearded (facial hair) and some of the subject’s’ eyes are half open. Essentially, these subjects are in four different poses, ±22,5°and ± 45°, i.e. P1 = -22,5°, P2 = +22,5°, P3 = +45°and P4 = -45°. Upon obtaining the patches using Algorithm 1 in Chapter 3, pre-processing was performed. Firstly, the patches were set to the same size (200 x 200 (wxh)), in order to be able to combine them. The re-sizing was further useful when we extracted features, as any size lower than 200 x 200 was not sufficient for extraction of features for some patches. After performing a guided filter by applying equation 3.2, we obtain an image as shown in Figure 4.1. This image is a patch for one subject in which features will be extracted in order to form a representation. Figure 4.1: Guided filtered image. 4.3 Feature extraction Initially, using the equation 2.2 discussed in Chapter 2, L(xi, yi, σ) = G(xi, yi, σ) ∗ I(xi, yi) to create a scale space and Gaussian blurs, we set octave and level to the original Lowe’s values, which are 4 and 5 respectively. The problem was that when keypoints were extracted, there were more fields of the 4.3. Feature extraction 48 128-descriptor that contained ’0′s’. As a matter of fact, we realised that the classification percentage drops drastically when classification is performed, and also the algorithm takes long to execute, giving an average of 14.41 seconds elapsed time at run-time. We then decreased both values (octave and level) to 3 at each σ value (3,6,9). This change of the original SIFT algorithm gave us better representation of L(x, y, σ) also giving us the average run-time of 1.14 seconds, thereby computing Difference of Gaussian (DoG) images and locating keypoints more efficiently. Following the procedure in Algorithm 2, the SIFT keypoints used in this research work were extracted at different scale parameters (σ), as a result, creating different scale spaces, i.e. scale space are created at (σ = 3, σ = 6 and σ = 9) for each pose. The choice of different scale parameters was performed so as to identify the ideal scale space for our dataset. For instance, in Figure 4.2 σ is set to 9 to extract features for Pose 1, 2, 3 and 4, thereby obtaining the patches in Figure 4.2, where σ = 9, i.e. G(x, y, σ) Difference of Gaussian patches. These patches are a result of both equations 2.2 and 2.3, where a Gaussian blur operator G(x, y, σ) is applied to obtain the image in Figure 4.2. Figure 4.2: The DoG patches at σ = 9, G(x, y, σ). Furthermore, the next step is to locate keypoints by using the DoG patches obtained in Figure 4.2. We apply equation 2.5 in order to locate the subpixel keypoints for each patch, and also apply equation 2.6 and 2.7 for accurate localisation of keypoints on the DoG patches. This process of localisation of keypoint was depicted using the G(x, y, σ) patches for clear visualisation in Figures 4.3 (a) and (b). Figure 4.3 (a) shows the low contrast keypoints that were initially located in green colour. Some of these keypoints are either poorly localised or low in contrast therefore we needed to eliminate them. So, looking closely at Figure 4.3 (b) we observe that some of the keypoints have fallen off, only the ones that are accurately localised (in blue colour) remain. These keypoints are the ones which were used to construct a pose invariant descriptor for better classification of face images. Moreover, the 4.3. Feature extraction 49 threshold value for accurate localisation of meaningful keypoints is set to be 0,1, thereby allowing elimination of poorly constructed keypoints or those with low contrast. These keypoints were useful for construction of a descriptor. The reason we chose 0,1 was that we were unable to construct a descriptor at a higher threshold value because of our patch size. (a) Low contrast (b) Accurate keypoints Figure 4.3: Keypoints localisation. Even though the SIFT descriptor that we extracted was normalised, it had uneven keypoints, i.e. some keypoints were slightly if not much higher than others, i.e. negative values were extracted; as a result, the positive keypoints appeared to be more significant in weight than negatives. Initially, we took raw SIFT keypoints and put them in a scatter plot; we obtained a scatter plot as depicted in Figure 4.4. The horizontal values represent the x-axis, while the vertical values represent the y-axis of the raw SIFT descriptor. As seen in the figure, the values of the descriptors are well between 0 and 1, meaning they are normalised. These features were trained using a random SVM classifier, with pre-defined parameters. We obtained a percentage of 29,8% classification accuracy. As a consequence of the above classification accuracy, we then applied linear scaling to unit variance normalisation method discussed in the previous chapter. Upon applying this method, we obtained data as shown on the scatter plot in Figure 4.5. We then took these descriptors and trained them with the same SVM classifier the raw descriptors were trained with. The classification percentage obtained with the same pre-defined parameters of the classifier was 54,3%. Our observation regarding this increase in the classification percentage is that, the classifier learns better with scattered keypoints, as opposed to when the keypoints are clustered in one place as in the raw keypoints, where keypoints are clustered just below 1 and most of the extracted keypoints were ’0’. The objective of the normalisation process was to scale the keypoints to have an equal mean and standard deviation using equation 3.3. By normalising the descriptor, we ensure that each keypoint that was extracted and accurately localised, is equally significant for better classification of patches. 4.3. Feature extraction 50 Figure 4.4: Raw SIFT Descriptor. Figure 4.5: SIFT Descriptors normalised using linear scaling to unit variance method. 4.4. Supervised classification, Comparison 51 4.4 Supervised classification, Comparison This section and the proceeding sub-sections will present pose classification results of the experiments performed in this study. We will present different results for different classifiers, obtained from ex- tracting SIFT features at different scale spaces, i.e. σ (3, 6 and 9). Firstly, the training data sets are given as inputs to the classification algorithms in order to develop a classification model. To obtain the training and test set, we used a K-fold cross validation method for all data sets, because when using this method it does not matter how the data gets to be divided, a data point gets in the training set (k-1) times, also we can obtain an indication of how well the learner will perform when given new inputs to predict. As a result, we used 10-fold of the total data to form a training set across all data sets. The remaining set was used for validation. Initially, all four classes (face poses) were trained on all different types of classifiers, then we chose a classifier that performed better given a specific pose. The choice of the classifier depends mainly on the performance of a trained classification model. The results were analysed according to different levels of σ values as shown in Table 4.1; together with the kernels and kernel parameters that obtained the highest classification results. As well as their respective performance metrics, evaluating the sensitivity (the ability of a classifier to correctly identify those which belong to a class) and specificity (the ability of a classifier to correctly classify those elements which do not belong to the group). 4.4.1 Classification at σ = 3 Features were extracted and trained at σ = 3, normalised using linear scaling to unit variance. These features were trained using distance-based measures shown in Figure 4.6. The Hamming distance metric shows a better classification rate as compared to other distance metrics that were trained, obtaining a 98,8% classification rate, this means the likelihood that a face pose will correctly be classified for the trained pose is high. However, the classification rate on its own does not communicate much about the data that was classified. We then used a 4 x 4 confusion matrix (because there are four classes) as shown in Figure 4.7. The confusion matrix shows that for both Pose 1 and 2 the classifier predicted 99 features (in green) correctly. These are True Positives which correspond to 99% accuracy, whereas one image (in red) was misclassified (False Negative) as pose three for both classes. Similarly, the confusion matrix also shows that 98 features from Pose 3 and 4 were correctly classified, corresponding to the 98% and two of the features were misclassified, corresponding to the 2% of the 4.4. Supervised classification, Comparison 52 data. Figure 4.6: kNN Trained Models. Figure 4.7: Confusion Matrix at σ = 3. 4.4. Supervised classification, Comparison 53 4.4.2 Classification at σ = 6 Features which were extracted at σ = 6 were classified using decision trees. With maximum of 10 splits and 10 surrogates, we obtained a classification rate of 97%. The observation made is that, even though we increased the number of splits, we still obtained the same classification percentage. Figure 4.8 shows a plot of the ROC curve for the trained tree, where we plot the TP (y-axis) against the FP (x-axis). The interpretation is that the closer the curve falls to the y-axis and to the top line of the ROC space, the better classification accuracy the trained classifier gives. The Confusion matrix shown in Figure 4.9 shows that for Pose 1, 95 features were correctly classified, which corresponds to the 95%. However, three of the 100 features were misclassified under Pose 3 and two (2) were misclassified under Pose 4. For Pose 2, 99 features were correctly classified, and one feature was misclassified under Pose 1. Three features which belong in Pose 3 and two features which belong in Pose 4 were also misclassified under Pose 1. Figure 4.8: ROC Curve at σ = 6. 4.4. Supervised classification, Comparison 54 Figure 4.9: Confusion Matrix at σ = 6. 4.4.3 Classification at σ = 9 An SVM classifier was used to train features which were extracted at σ = 9. The kernel scale of the classifier was adjusted depending on the classifier’s performance. At kernel scale 4 we obtained a classification rate of 72%, any number above caused a decrease in the classification rate. Even though we used 25-fold of validation division, the plot from the obtained ROC curve (Figure 4.10); shows that the model was not learning very well, as the graph is not as smooth as the previously discussed data, where the sigma value was low. At the same time, the confusion matrix (Figure 4.11) also shows (in red blocks) a lot of misclassified classes, which proves that at σ = 9, our data set is not performing well, even after training the data set with the previous classifiers (decision tree and k-NN). 4.4. Supervised classification, Comparison 55 Figure 4.10: SVM ROC curve at σ = 9. Figure 4.11: SVM Confusion Matrix at σ = 9. 4.4.4 Neural Network Pattern Recognition Next, as an alternative to SVM and decision trees, the normalised SIFT features were trained using a two-layer feed-forward neural network. The input data set for each consists of a 400 x 128 matrix, which is 400 subjects by 128 SIFT feature vector for each patch and the target is a 4 x 128 matrix where each column consists of 0’s and 1’s. The presence of the 1 in a row indicates which of the poses the patch falls under. We initially trained all the poses per σ value. Features which were extracted 4.4. Supervised classification, Comparison 56 at σ = 3 gave better results irrespective of a small training data set. Below are two ROC curve graphs which show how well the model learned. Figure (a) shows the blue, green, yellow and red lines representing the four classes that we trained. It shows how the model learns better on pose 1 and 3 (±45°), as it covers over 0.7 (70%) of the area under the curve (AUC). When the AUC is at 1 (one) it means the model is learning well. The further the lines move away from the left border, the more difficult it gets for the model to learn. Furthermore, the curves also show that the probability of the model to misclassify a face pose is high, but the curves are not below the diagonal line, which may have suggested that the model is performing poorly. (a) ROC curve for Training set. (b) ROC curve for overall network. An equally significant aspect is the confusion matrix on Figure 4.12 which shows that after 23 iterations the performance of our model stood at 80%. While pose 2 obtained a good classification rate, there was no pose that was correctly matched for pose 3, having a misclassification rate of 2,3% in red. Lastly, the performance plot in Figure 4.13 shows the error against the number of iterations for training, validation and test data, this error is based on the total system performance rate in Figure 4.12. It also show how the network’s mean square error drops continually while it learns. The lines (blue, red and green) shows the decreasing error as the network learns. Training and validation converge at a point, indicating how well the network will generalise to new data. 4.4. Supervised classification, Comparison 57 Figure 4.12: Neural Network Confusion Matrix. Figure 4.13: NN validation Performance Plot at σ = 3. 4.5. Evaluation of algorithms 58 4.5 Evaluation of algorithms The Confusion matrix used above is helpful and useful to understand the performance of our classifier, given a set of features. However, it is not a model evaluation metric, we need to be able to choose a better performing model between models that were trained, and determine which model classifies our features better. In order to evaluate the performance of the trained models, we used Sensitivity (True Positive rate) and Specificity (True Negative rate). These performance evaluation metrics were calculated using the formula 2.14 in Chapter 2. The aim is to get both rates high, in that way we knew that the model would correctly classify the pose. Firstly, evaluation of Specificity and Sensitivity was performed for each class, i.e. one against all. Using the information from each class, the second step was to calculate the overall performance of a given classification model, using real numbers not percentages. At σ = 3, Sensitivity was 0.98 and Spe = 0.98. Whereas, σ = 6 obtained Se = 0.94 and Spe = 0.97. Lastly, σ = 9 obtained Se = 0.64 and Spe = 0.64. From these evaluations we concluded that kNN performs better where σ = 3. While on the other hand, SVM at σ = 9, gave us undesirable classification rate, with the features that were extracted. Table 4.1 evidently shows a summary of the results obtained, where using kNN a classification rate of 98% was obtained as the better performing classifier. 4.6. Summary 59 Classification Learners SVM Sigma (σ) Validation Division Kernel function Kernel Scale Classification rate 9 25 folds Quadratic 4 72% Decision Tree Sigma (σ) Validation Division Split method Max. splits Classification rate 6 10 folds Gini’s di- versity index 10 96,8% kNN Sigma (σ) Validation Division Distance Metric Num. of Neighbors Classification rate 3 10 folds Hamming distance 3 98,8% NN Pattern Recognition Sigma (σ) Validation Division Activation Function Epoch Classification rate 3 90 samples training, 19 samples validation and testing respectively Sigmoid and soft- max 23 itera- tions 79% Table 4.1: Overall Classification Performance 4.6 Summary This chapter presented different steps of image processing that were performed in this work. Firstly, we manually created the patches by selecting the eyes, nose and mouth. These patches were then combined to form one image; this was to ensure that we had sufficient information to describe a face at a specific pose. Following this step pre-processing was performed, to remove interfering noise on the image. Secondly, features were extracted with different scale values using the SIFT algorithm. The extraction of these features at different scale values was done to determine the scale value that gives us better classification of our features. Our observation on feature extraction was that we could not extract enough features with the original octave and level of 4 and 5 respectively. So, we changed the 4.6. Summary 60 octave and level values to 3. Most of the features with low contrast and features along the edges were eliminated. Even though SIFT extracts normalised features, we obtained a low classification accuracy. The next step was to normalise the features using linear scaling to unit variance. Upon feature normalisation, the features were trained using SVM, decision trees, kNN and neural network pattern recognition. At sigma (σ) = 3, for both kNN and neural network pattern recognition a classification accuracy percentage of 98,8% and 78% was obtained respectively. This means our features performed a better classification at σ = 3 on these classifiers. Furthermore, σ = 6 decision trees appeared to give us a better classification accuracy, as we obtained 96.8% classification accuracy. Lastly, at the higher sigma value, where σ = 9, the SVM obtained a 72% classification rate, which was the highest classification rate we obtained at a high sigma. Having trained all the features, we observed that the classifiers perform better at a lower sigma value, in our case where σ = 3. The next section will give the summary of our study, suggestions for future work and concludes the research done. Chapter 5 Summary, future work and conclusion This section gives the overall summary of the study conducted, followed by possible future study, and lastly a brief conclusion. 5.1 Summary Firstly, we reiterate that the main goals of this research work were: 1. To implement a classification model that would classify a still input face image according to different poses. 2. To determine a classifier that gives high classification accuracy on patch-based face images. Having considered our research goals, which ultimately lead to solving research questions which posed in the first chapter, we trained four classification models. Below is a summary of the accomplishment of our research goals, starting with the first research goal: • Firstly, we manually collected face images in four different poses (± 22,5° and ± 45°). From these images we manually selected landmarks (eyes, nose and mouth) which we called the patches. These patches were combined to form one image of a single subject (automated), in this way we would obtain a compact representation with less computational power. 61 5.2. Future work 62 • The next steps were all automated, with pre-processing of patches being performed so as to remove noise and to rescale the patches to the same patch size. • Thereafter, the SIFT features were extracted from these patches at different scale spaces (3, 6, 9) so as to observe the performance of our classifiers. • The next step was to normalise the features to ensure standardised feature vectors. • Four machine learning models based on SVM, k-NN, decision trees and neural network pattern recognition were trained to test the accuracy of the extracted features. • The overall classification results obtained by using the Hamming distance metric in the kNN classifier, obtained a classification rate of 98,8%. We then conclude that k-NN is the suitable machine learning algorithm that can accurately classify a face pose at ± 22,5° and ± 45°. The classification results in [9] were not fully discussed, but it was mentioned that the classification algorithm that was used improved the performance of their normalisation model. In the literature discussed in chapter 2, there was no face classification model implemented before recognition of faces was done. We therefore believe that pose classification models are an important models to be integrated in face recognition models. 5.2 Future work The following is recommended for further study based on the findings from this study: 1. Lateral pose variations as opposed to vertical pose variation and combinations. 2. Effects of lighting, resolution and age. 3. To further explore the features that were extracted to determine a function that will transform a non-frontal face image into a frontal face image for face recognition in different viewpoints. 5.3 Conclusion The results of this research work show that all the classifiers performed differently, and also for some classifiers performance is highly dependent on the training data set. For example, a NN with more 5.3. Conclusion 63 data set performs better classification. Nonetheless, k-NN gave good results irrespective of the size of the data set. In this research work, we demonstrated that with k-NN at σ = 3 we achieved a better face classification rate of 98% for pose invariant features. At σ = 9 SVM performed worst obtaining a classification rate of 72%. Our conclusion is that pose invariant feature extraction for the images used on this research work gives better classification at σ = 3 using kNN to train the features. These findings will help towards recognition of faces in different poses for real world applications. The ability of an FRT to classify images in different poses will also help improve the accuracy in classification. Bibliography [1] Matthew A Turk and Alex P Pentland. Face recognition using eigenfaces. In Computer Vision and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE Computer Society Conference on, pages 586–591. IEEE, 1991. [2] Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face description with local binary pat- terns: Application to face recognition. IEEE transactions on pattern analysis and machine intel- ligence, 28(12):2037–2041, 2006. [3] Wenyi Zhao, Rama Chellappa, P Jonathon Phillips, and Azriel Rosenfeld. Face recognition: A literature survey. ACM computing surveys (CSUR), 35(4):399–458, 2003. [4] John Wright, Allen Y Yang, Arvind Ganesh, S Shankar Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence, 31(2): 210–227, 2009. [5] Changxing Ding and Dacheng Tao. A comprehensive survey on pose-invariant face recognition. ACM Transactions on Intelligent Systems and Technology (TIST), 7(3):37, 2016. [6] Xiaozheng Zhang and Yongsheng Gao. Face recognition across pose: A review. Pattern Recogni- tion, 42(11):2876–2896, 2009. [7] Xiaoyang Tan, Songcan Chen, Zhi-Hua Zhou, and Fuyan Zhang. Face recognition from a single image per person: A survey. Pattern recognition, 39(9):1725–1745, 2006. [8] Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. Supervised machine learning: A review of classification techniques, 2007. [9] Huy Tho Ho and Rama Chellappa. Pose-invariant face recognition using markov random fields. IEEE transactions on image processing, 22(4):1573–1584, 2013. 64 BIBLIOGRAPHY 65 [10] Michael P Ekstrom. Digital image processing techniques, volume 2. Academic Press, 2012. [11] Tinne Tuytelaars and Krystian Mikolajczyk. Local invariant feature detectors: a survey. Foun- dations and trends® in computer graphics and vision, 3(3):177–280, 2008. [12] P Jonathon Phillips, Hyeonjoon Moon, Syed A Rizvi, and Patrick J Rauss. The feret evaluation methodology for face-recognition algorithms. IEEE Transactions on pattern analysis and machine intelligence, 22(10):1090–1104, 2000. [13] Nannan Wang, Xinbo Gao, Dacheng Tao, Heng Yang, and Xuelong Li. Facial feature point detection: A comprehensive survey. Neurocomputing, 2017. [14] Timothy F Cootes, Christopher J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and application. Computer vision and image understanding, 61(1):38–59, 1995. [15] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001. [16] Timothy F Cootes, Gavin V Wheeler, Kevin N Walker, and Christopher J Taylor. View-based active appearance models. Image and vision computing, 20(9):657–664, 2002. [17] Soma Biswas, Gaurav Aggarwal, Patrick J Flynn, and Kevin W Bowyer. Pose-robust recognition of low-resolution face images. IEEE transactions on pattern analysis and machine intelligence, 35(12):3037–3049, 2013. [18] Dong Li, Huiling Zhou, and Kin-Man Lam. High-resolution face verification using pore-scale facial features. IEEE transactions on image processing, 24(8):2317–2327, 2015. [19] Zhenyao Zhu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning identity-preserving face space. In Proceedings of the IEEE International Conference on Computer Vision, pages 113–120, 2013. [20] Meina Kan, Shiguang Shan, Hong Chang, and Xilin Chen. Stacked progressive auto-encoders (spae) for face recognition across poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1883–1890, 2014. [21] Simon Prince, Peng Li, Yun Fu, Umar Mohammed, and James Elder. Probabilistic models for BIBLIOGRAPHY 66 inference about identity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34 (1):144–157, 2012. [22] David G Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999. [23] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004. [24] Luo Juan and Oubong Gwun. A comparison of sift, pca-sift and surf. International Journal of Image Processing (IJIP), 3(4):143–152, 2009. [25] Cong Geng and Xudong Jiang. Sift features for face recognition. In Computer Science and Information Technology, 2009. ICCSIT 2009. 2nd IEEE International Conference on, pages 598– 602. IEEE, 2009. [26] Richard Szeliski. Computer vision: algorithms and applications. Springer Science & Business Media, 2010. [27] Konstantinos G Derpanis. The harris corner detector. York University, 2004. [28] Xiaojin Zhu. Semi-supervised learning literature survey. 2005. [29] Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image pre-processing. In Image Processing, Analysis and Machine Vision, pages 56–111. Springer, 1993. [30] Seishi Okamoto and Nobuhiro Yugami. Effects of domain characteristics on instance-based learn- ing algorithms. Theoretical Computer Science, 298(1):207–233, 2003. [31] Christopher JC Burges. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2):121–167, 1998. [32] Yongheng Zhao and Yanxia Zhang. Comparison of decision tree methods for finding active objects. Advances in Space Research, 41(12):1955–1959, 2008. [33] Rau´l Rojas. Neural networks: a systematic introduction. Springer Science & Business Media, 2013. BIBLIOGRAPHY 67 [34] Guoqiang Peter Zhang. Neural networks for classification: a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 30(4):451–462, 2000. [35] Luzia Gonc¸alves, Ana Subtil, M Rosa´rio Oliveira, and PZ Bermudez. Roc curve estimation: An overview. REVSTAT–Statistical Journal, 12(1):1–20, 2014. [36] JV Candy and EF Breitfeller. Receiver operating characteristic (roc) curves: An analysis tool for detection performance. LLNL Report: LLNL-TR-642693, 2013. [37] Ramzi Abiantun, Utsav Prabhu, and Marios Savvides. Sparse feature extraction for pose-tolerant face recognition. IEEE transactions on pattern analysis and machine intelligence, 36(10):2061– 2073, 2014. [38] Soma Biswas, Gaurav Aggarwal, Patrick J Flynn, and Kevin W Bowyer. Pose-robust recognition of low-resolution face images. IEEE transactions on pattern analysis and machine intelligence, 35(12):3037–3049, 2013. [39] Yan Yan, Elisa Ricci, Ramanathan Subramanian, Oswald Lanz, and Nicu Sebe. No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In Proceedings of the IEEE international conference on computer vision, pages 1177–1184, 2013. [40] Jeet Kumar, Aditya Nigam, Surya Prakash, and Phalguni Gupta. Pose Invariant Face Recogni- tion. PhD thesis, Indian Institute of Technology Kanpur, 2011. [41] Yizhe Zhang, Ming Shao, Edward K Wong, and Yun Fu. Random faces guided sparse many-to-one encoder for pose-invariant face recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 2416–2423, 2013. [42] Simon JD Prince, James H Elder, Jonathan Warrell, and Fatima M Felisberti. Tied factor analysis for face recognition across large pose differences. IEEE Transactions on pattern analysis and machine intelligence, 30(6):970–984, 2008. [43] Changxing Ding, Jonghyun Choi, Dacheng Tao, and Larry S Davis. Multi-directional multi- level dual-cross patterns for robust face recognition. IEEE transactions on pattern analysis and machine intelligence, 38(3):518–531, 2016. [44] Paul Viola and Michael J Jones. Robust real-time face detection. International journal of com- puter vision, 57(2):137–154, 2004. BIBLIOGRAPHY 68 [45] M Milton Joe and Dr B Ramakrishnan. A survey of various security issues in online social networks. International Journal of Computer Networks and Applications, 1(1):11–14, 2014. [46] Kaiming He, Jian Sun, and Xiaoou Tang. Guided image filtering. IEEE transactions on pattern analysis and machine intelligence, 35(6):1397–1409, 2013. [47] Lindsay I Smith et al. A tutorial on principal components analysis. Cornell University, USA, 51 (52):65, 2002. [48] Ryan Rifkin. Multiclass classification. Lecture Notes, Spring08. MIT, USA, page 59, 2008.