Abstract
1- Introduction
2- Related work
3- Deep eigen-filters and DEFNet for feature representation
4- Experiments and results
5- Analysis on the strategy of proposed deep eigen-filters approach
6- Conclusion
References
Abstract
Training deep convolutional neural networks (CNNs) often requires high computational cost and a large number of learnable parameters. To overcome this limitation, one solution is computing predefined convolution kernels from training data. In this paper, we propose a novel three-stage approach for filter learning alternatively. It learns filters in multiple structures including standard filters, channel-wise filters and point-wise filters which are inspired from variations of CNNs’ convolution operations. By analyzing the linear combination between learned filters and original convolution kernels in pre-trained CNNs, the reconstruction error is minimized to determine the most representative filters from the filter bank. These filters are used to build a network followed by HOG-based feature extraction for feature representation. The proposed approach shows competitive performance on color face recognition compared with other deep CNNs-based methods. Besides, it provides a perspective of interpreting CNNs by introducing the concepts of advanced convolutional layers to unsupervised filter learning.
Introduction
With the development of deep learning in recent years, deep neural networks, especially deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in many image-based applications [1], e.g., image classification [2, 3], face recognition [4, 5], fine-grained image categorization [6, 7] and depth estimation [8, 9]. Compared with traditional visual recognition methods, CNNs have the advantage of learning both low-level and high-level feature representations automatically instead of designing hand-crafted feature descriptors [10, 11]. Due to these powerful features, CNNs have revolutionized the computer vision community and become one of the most popular tools in many visual recognition tasks [7, 12, 13]. Generally, CNNs are made up of three types of layers, i.e. convolutional layers, pooling layers, and fullyconnected layers. The features are extracted by stacking many convolutional layers on top of each other, and backpropagation starts from the loss function and goes back to the input in order to learn the weights and biases contained in the layers. However, how this kind of mechanism works on images remains an open question and yet needs to be explored. Besides, learning powerful feature representations requires a large amount of labeled training data otherwise the performance may deteriorate [14, 15], whereas training data in practical applications are often not readily available. To solve these problems, some researchers propose learning convolutional layers alternatively independent of training data. In [16], ScatNet was proposed by using wavelet transforms to represent convolutional filters. These predefined wavelet transforms are cascaded with nonlinear and pooling operations to build a multilayer convolutional network. Therefore, no learning is needed in computing image representation. Different from ScatNet, researchers in [17] introduced a structured receptive field network that combines the flexible learning property of CNNs and the fixed basis filters.