Feature learning for RGB-D scene understanding
Date of Issue2016-05-26
School of Computer Engineering
Scene understanding is an important and fundamental problem in computer vision and is critical in applications of robotics and augmented reality. Scene understanding includes many tasks such as scene labeling, object recognition and scene classification. Most previous scene understanding methods focus on outdoor scenes. In contrast, indoor scene understanding is more challenging, due to poor illumination and cluttered objects. With the wide availability of affordable RGB-D cameras such as Kinect, huge changes have been made to indoor scene analysis due to the rich 3D geometry information provided by depth measurements. Feature extraction is the key part for scene understanding tasks. Most of the early methods extract hand-crafted features. However, the performance of such feature extractors highly depends on variations in hand-crafting and combinations. The designing process requires empirical understanding of data, thus hard to systematically extend to different modalities. In addition, the hand-crafted features usually capture a subset of recognition cues from raw data, which might ignore some useful information. Thus, in this research, we focus on feature learning with raw data as input. Particularly, we explore feature learning on three different tasks of indoor scene understanding using RGB-D input: Scene labeling: The aim is to densely assign a category label (e.g. table, TV) to each pixel in an image. Inspired by the success of unsupervised feature learning, we start by adapting the existing unsupervised feature learning technique to directly learn features from RGB-D images. Typically, better performance could be achieved by further applying feature encoding over the learned features to build "bag of words" type of features. However, feature learning and feature encoding are performed separately, which may result in suboptimal solution. We propose to jointly optimize these two processes to derive more discriminative features. Object recognition: Most of the feature learning methods for RGB-D object recognition either learn the features for individual modalities independently, or treat RGB-D simply as undifferentiated four-channel data, which cannot adequately exploit the complementary relationship between the two modalities. To address the above issues, we propose a general Convolutional Neural Networks (CNN) based multi-modal learning method for RGB-D object recognition. Our multi-modal layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities. Scene classification: Methods for scene classification task to leverage local information share a similar pipeline: first densely extracting CNN features from different locations and scales of an image, and then using an encoding method. However, for state-of-the-art feature encoding techniques such as Fisher Vector (FV), since their components in Gaussian Mixture Model (GMM) are derived from densely sampled local features, many components are likely to be noisy and not informative. Such noisy property of local features has not been well considered in the existing works. Further considering the FV features from different modalities, we propose a modality and component aware feature fusion framework for RGB-D scene classification. In this thesis, various experiments have been constructed to evaluate the performance of the proposed techniques in comparison to the state-of-the-art methods on different RGB-D databases. Encouraging results show that the proposed techniques significantly boost the performance in the studied scene understanding tasks.
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision