Face recognition is an important task in both academia and industry. With the development of deep convolutional neural networks, many deep face recognition methods have been proposed and have achieved remarkable results. However, these methods show great diversity among their datasets, network architectures, loss functions, and parameter learning strategies. For those who want to apply these technologies to establish a deep face recognition system, it is bewildering to evaluate which improvements are more suitable and effective. This study systematically summarizes and evaluates the state-of-the-art face recognition methods. However, unlike general reviews, on the basis of a survey, this study presents a comprehensive evaluation framework and measures the effects of multifarious settings in five components, including data augmentation, network architecture, loss function, network training, and model compression. Based on the experimental results, the influences of these five components on the deep face recognition are summarized. In terms of the datasets, a high sample-identity ratio is conducive to generalization, but it leads to increased difficulty for the training to converge. For the network architecture, deep ResNet has an advantage over other designs. Various normalization operations in the network are also necessary. For the loss function, whose performance is closely related to network design and training conditions. The angle-margin loss has a higher upper bound performance, but the traditional Euclidean-margin loss has a stable performance in limited training condition and shallower network. In terms of the training strategy, the step-declining learning rate and large batch size are recommended for recognition tasks. Furthermore, this study compares several popular model compression methods and shows that MobileNet has advantages over the others in terms of both compression ratio and robustness. Finally, a detailed list of recommended settings is provided.
Face recognition (FR) has a wide range of applications, such as security and electronic payments. It has drawn much attention in computer vision in recent decades. In the early stage, many traditional methods encounter bottlenecks in performance due to the limitations of computing power and model capability[1, 2]. With the advent of deep convolutional neural networks (DCNNs) and increased hardware capability, these restrictions have been rapidly eliminated, and many DCNN-based FR methods have been proposed[3–7]. However, these DCNN-based methods show great diversity among their implementation settings, which makes it difficult to determine which settings of a specific method are worth learning. For instance, Parkhi et al. designed a VGGNetbased model trained with face images from 2,622 identities, and Schroff et al. purposed the GoogLeNet-based FaceNet, but trained with images from 8M different identities. Although the latter model achieved a better result than the former, we cannot simply assert that GoogLeNet-based models are more suitable for face feature extraction than VGGNet-based models. Moreover, even within a single method, the effectiveness of operations is difficult to confirm. Taking Arcface as an example, the original paper stated that Arcface Loss was favorable to network training, but it lacked comparison experiments. Thus, it remains questionable whether that loss function is better than the conventional Center Loss. A complete FR system has a few fixed components. Figure 1 shows the general pipeline, including the detection, alignment, feature extraction, and similarity calculation[13, 14]. In this article, we focus on the feature extraction, which is a key factor for improving the performance of FR systems (Figure 1 can describe both traditional and DCNN-based recognition systems. We primarily discusses the latter in this paper. Unless otherwise specified, the term “model” in the following refers to the DCNN model). In this paper, we analyze the process of model design, and summarize five components that have great influences on the final performance, including data augmentation, network architecture, loss functions, network training, and model compression.