In this paper, we describe a novel deep convolutional neural network (CNN) that is deeper and wider than other existing deep networks for hyperspectral image classification. Unlike current state-of-the-art approaches in CNN-based hyperspectral image classification, the proposed network, called contextual deep CNN, can optimally explore local contextual interactions by jointly exploiting local spatio-spectral relationships of neighboring individual pixel vectors. The joint exploitation of the spatio-spectral information is achieved by a multi-scale convolutional filter bank used as an initial component of the proposed CNN pipeline. The initial spatial and spectral feature maps obtained from the multi-scale filter bank are then combined together to form a joint spatio-spectral feature map. The joint feature map representing rich spectral and spatial properties of the hyperspectral image is then fed through a fully convolutional network that eventually predicts the corresponding label of each pixel vector. The proposed approach is tested on three benchmark datasets: the Indian Pines dataset, the Salinas dataset and the University of Pavia dataset. Performance comparison shows enhanced classification performance of the proposed approach over the current state-of-the-art on the three datasets.
RECENTLY, deep convolutional neural networks (DCNN) have been extensively used for a wide range of visual perception tasks, such as object detection/classification, action/activity recognition, etc. Behind the remarkable success of DCNN on image/video anlaytics are its unique capabilities of extracting underlying nonlinear structures of image data as well as discerning the categories of semantic data contents by jointly optimizing parameters of multiple layers together.
Lately, there have been increasing efforts to use deep learning based approaches for hyperspectral image (HSI) classification –. However, in reality, large scale HSI datasets are not currently commonly available, which leads to sub-optimal learning of DCNN with large numbers of parameters due to the lack of enough training samples. The limited access to large scale hyperspectral data has been preventing existing CNNbased approaches for HSI classification – from leveraging deeper and wider networks that can potentially better exploit very rich spectral and spatial information contained in hypersepctral images. Therefore, current state-of-the-art CNNbased approaches mostly focus on using small-scale networks with relatively fewer numbers of layers and nodes in each layer at the expense of a decrease in performance. Deeper and wider mean using relatively larger numbers of layers (depth) and nodes in each layer (width), respectively. Accordingly, the reduction of the spectral dimension of the hyperspectral images is in general initially performed to fit the input data into the small-scale networks by using techniques, such as principal component analysis (PCA) , balanced local discriminant embedding (BLDE) , pairwise constraint discriminant analysis and nonnegative sparse divergence (PCDA-NSD) , etc. However, leveraging large-scale networks is still desirable to jointly exploit underlying nonlinear spectral and spatial structures of hyperspectral data residing in a high dimensional feature space. In the proposed work, we aim to build a deeper and wider network given limited amounts of hypersectral data that can jointly exploit spectral and spatial information together. To tackle issues associated with training a large scale network on limited amounts of data, we leverage a recently introduced concept of “residual learning”, which has demonstrated the ability to significantly enhance the train efficiency of large scale networks. The residual learning  basically reformulates the learning of subgroups of layers called modules in such a way that each module is optimized by the residual signal, which is the difference between the desired output and the module input, as shown in Figure 1a. It is shown that the residual structure of the networks allows for considerable increase in depth and width of the network leading to enhanced learning and eventually improved generation performance. Therefore, the proposed network does not require pre-processing of dimensionality reduction of the input data as opposed to the current state-of-the art techiniques.
To achieve the state-of-the art performance for HSI classification, it is essential that spectral and spatial features are jointly exploited. As can be seen in –, , , the current state-of-the-art approaches for deep learning based HSI classification fall short of fully exploiting spectral and spatial information together. The two different types of information, spectral and spatial, are more or less acquired separately from pre-processing and then processed together for feature extraction and classification in , . Hu et al.  also failed to jointly process the spectral and spatial information by only using individual spectral pixel vectors as input to the CNN. In this paper, inspired by , we propose a novel deep learning based approach that uses fully convolutional layers (FCN)  to better exploit spectral and spatial information from hyperspectral data. At the initial stage of the proposed deep CNN, a multi-scale convolutional filter bank conceptually similar to the “inception module” in  is simultaneously scanned through local regions of hyperspectral images generating initial spatial and spectral feature maps. The multi-scale filter bank is basically used to exploit various local spatial structures as well as local spectral correlations. The initial spatial and spectral feature maps generated by applying the filter bank are then combined together to form a joint spatio-spectral feature map, which contains rich spatio-spectral characteristics of hyperspectral pixel vectors. The joint feature map is in turn used as input to subsequent layers that finally predict the labels of the corresponding hyperspectral pixel vectors.
The proposed network1 is an end-to-end network, which is optimized and tested all together without additional pre- and post-processing. The proposed network is a fully convolutional network (FCN)  (Figure 1c) to take input hyperspectral images of arbitrary size and does not use any subsampling (pooling) layers that would otherwise result in the output with different size than the input; this means that the network can process hyperspectral images with arbitrary sizes. In this work, we evaluate the proposed network on three benchmark datasets with different sizes (145×145 pixels for the Indian Pines dataset, 610×340 pixels for the University of Pavia dataset, and 512×217 for the Salinas dataset). The proposed network is composed of three key components; a novel fully convolutional network, a multi-scale filter bank, and residual learning as illustrated in Figure 1. Performance comparison shows enhanced classification performance of the proposed network over the current state-of-the-art on the three datasets. The main contributions of this paper are as follows:
• We introduce the deeper and wider network with the help of “residual learning” to overcome sub-optimality in network performance caused primarily by limited amounts of training samples.
• We present a novel deep CNN architecture that can jointly optimize the spectral and spatial information of hyperspectral images.
• The proposed work is one of the first attempts to successfully use a very deep fully convolutional neural network for hyperspectral classification.
The remainder of this paper is organized as follows. In Section II, related works are described. Details of the proposed network are explained in Section III. Performance comparisons among the proposed network and current sate-of-the-art approaches are described in Section IV. The paper is concluded in Section V.
II. RELATED WORKS
A. Going deeper with Deep CNN for object detection/classification
LeCun, et al. introduced the first deep CNN called LeNet5  consisting of two convolutional layers, two fully connected layers, and one Gaussian connection layer with additional several layers for pooling. With the recent advent of large scale image databases and advanced computational technology, relatively deeper and wider networks, such as AlexNet , began to be constructed on large scale image datasets, such as ImageNet . AlexNet used five convolutional layers with three subsequent fully connected layers. Simonyan and Zisserman  significantly increased the depth of Deep CNN, called VGG-16, with 16 convolutional layers. Szegedy et al.  introduced a 22 layer deep network called GoogLeNet, by using multi-scale processing, which is realized by using a concept of “inception module.” He et al.  built a network substantially deeper than those used previously by using a novel learning approach called “residual learning”, which can significantly improve training efficiency of deep networks.
B. Deep CNN for Hyperspectral Image Classification
A large number of approaches have been developed to tackle HSI classification problems , –. Recently, kernel methods, such as multiple kernel learning –, have been widely used primarily because they can enable a classifier to learn a complex decision boundary with only a few parameters. This boundary is built by projecting the data onto a highdimensional reproducing kernel Hilbert space . This makes it suitable for exploiting dataset with limited training samples. However, recent advance of deep learning-based approaches has shown drastic performance improvements because of its capabilities that can exploit complex local nonlinear structures of images using many layers of convolutional filters. To date, several deep learning-based approaches – have been developed for HSI classification. But few have achieved breakthrough performance due mainly to sub-optimal learning caused by the lack of enough training samples and the use of relatively small scale networks.