Abstract
I. Introduction
II. Related Work
III. Efficient Stereo Matching Network
IV. Experiments
V. Conclusion
Authors
Figures
References
Abstract
Deep convolutional neural networks (CNNs) have shown great potential to provide accurate depth estimation based on stereo images. Previous work has focused on developing robust stereo matching architectures, while little attention has been paid on improving the network efficiency. In this paper, we propose an efficient Siamese CNN architecture that combines the low resolution disparity estimation and the depth discontinuity aware super-resolution. Specifically, we propose to construct, filter and perform regression on a low resolution cost volume through the designed stereo matching backbone network. A fast depth discontinuity aware super-resolution subnetwork is proposed for upsampling the low resolution disparity map to the desired resolution. Under the guidance of the intensity edge features extracted from the left color image, depth edge residuals are hierarchically learned to refine the upsampled depth map. A delayed upsampling structure is designed to ensure that the computational complexity is proportional to the spatial size of the input disparity map. We also propose to supervise the first derivative loss of the predicted disparity map that makes the network adaptively aware of the depth discontinuity edges. Experiments show that the proposed stereo matching network achieves a comparable prediction accuracy and much faster running speed compared with state-of-the-art methods.
Introduction
Depth estimated from stereo images has been the core information for vision-based practical applications, such as obstacle avoidance for robot navigation [1], 3D scene reconstruction for augmented and virtual reality system [2], and 3D visual object tracking and location [3], [4]. Given a pair of pre-rectified stereo images, the target of stereo matching is to accurately compute a disparity value for each pixel in the reference image. According to the taxonomy concluded by Scharstein et al. [5], traditional stereo matching algorithms typically include four consecutively performed steps: matching cost computation, cost aggregation, disparity computation and disparity refinement. In recent years, with the rapid development of deep learning, lots of convolutional neural network (CNN) based methods have been proposed to solve the stereo matching problem, since the milestone work of MC-CNN [6]. Early deep stereo networks are designed to learn similarity metrics from a large set of cropped image patches [6]–[10]. Regularization or global optimization approaches, such as semi-global matching (SGM) [11], left-right consistency checks and Markov Random Field (MRF) [10], are formulated as post-processing models. Later, many deep stereo networks attempt to directly learn various stereo matching regression functions end-to-end without the need of adding post-processing.