Improved deep learning techniques for recognition and labeling
Abrar Hamzeh Saad Abdul Nabi
Date of Issue2017
School of Electrical and Electronic Engineering
Representing images in robust, discriminative and informative features is deemed to be crucial for good recognition performance on many fundamental computer vision problems such as semantic attribute prediction and semantic image segmentation. Researchers tend to rely on hand-crafted features like SIFT and HOG which lack the ability to be data-adaptive, hence, feature learning becomes more favorable in the last few years. Various deep learning methods such as CNNs and RNNs achieve impressive state-of-the-art performance in almost all vision fields. In this thesis, we present various improved deep feature learning models that mainly utilize the idea behind sharing knowledge between feature classifiers and representations. We evaluate our proposed deep models on various competitive benchmarks addressing the problems of semantic attribute prediction, RGB-D and RGB semantic segmentation. Our models achieve substantial improvements over the baseline methods and comparable performance to (and in many settings, outperform) other state-of-the-art works. In the first part of the thesis, we tackle the binary semantic attribute prediction. Usually, the lack of training data for some semantic attributes is a main struggle to learn adequate informative attribute feature representations. Hence, we propose a new joint multi-task deep CNNs model to allow different CNNs to simultaneously share visual knowledge. Each CNN is assigned the task to learn the representation of a single binary semantic attribute. Sharing knowledge allow under-sampled CNN classifiers to leverage shared statistics from other CNN classifiers to further improve their performance. Natural grouping of attributes is applied such that attributes in the same group are encouraged to share more knowledge and attributes in different groups will generally compete, and consequently share less knowledge. We evaluate our multi-task CNNs model on two popular semantic attribute prediction datasets and outperform other state-of-the-art methods. In the second part of the thesis and different than attributes we address a more challenging recognition problem that requires to classify detailed semantics; i.e. pixels in input images. Semantic segmentation (also known as scene labeling) is a sequence-sequence prediction task (pixels-labels), and it is quite important to leverage relevant contextual information to enhance the performance of pixel classification. We address two main problems in semantic segmentation which are RGB-D and RGB scene labeling. In RGB-D, an extra input data source is provided which is the Depth maps and to be leveraged during labeling. First, we propose a multi-modal RNNs to tackle RGB-D scene labeling. Our proposed deep model allow multiple RNNs to share knowledge from different data modalities. The multi-modal RNNs exploit the Depth maps beside the RGB color channels to provide more informative contextual cues for local pixel classification. It simultaneously performs training of two RNNs that are crossly connected through information transfer layers. The transfer layers are learned to adaptively extract relevant cross-modality features and in the same time retain model-specific features within each modality. We evaluate our model on two popular RGB-D scene labeling datasets and achieve competitive performance to other state-of-the-art methods and outperform the baseline models significantly. In the third part of this thesis, we find that contextual information contained within the RGB image are indeed very informative for local pixel classification. Thus, we tackle the problem of RGB semantic segmentation. We propose an improved framework of the Fully Convolutional Network (FCN) and allow it to generate contextually-aware local feature representations for each pixel patch in the input image. We introduce an episodic attention-based memory network to achieve the goal. We present a unified framework that mainly consists of FCN and an attention-based memory module with feedback connections to perform context selection and refinement. The attention-based memory module is formed of feed-forward neural network and attention-based model for context selection, and an RNN with feedback mechanism for context aggregation and refinement over multiple iterations (episodes). The full model produces context-aware representation for each target pixel patch by aggregating the activated context and its original local representation produced by the convolution layers. We evaluate our model on three competitive scene labeling benchmarks and achieve impressive performance. Finally, in the last part of the thesis, we continue to address RGB scene labeling and different from all recent works that are about improving CNN performance with the aid of other models, we propose an orthogonal architecture to improve the internal design of the FCN. Our main focus is to allow FCN to independently generate richer contextually-aware feature representations. Many recent works have proposed efficient versions of deep CNNs that mainly contains various designs of forward shortcut connections; i.e. skip connections between the input and the output layers. Lower-level convolutional layers have higher resolutions (location information) or more detailed feature maps that can improve the higher-level layers that lack these details. In terms of training, skip connections create direct shorter paths to the supervisory error signal thus provide more efficient propagation and better training. We observe and study the influence of the backward skip connections which are in inverse direction to the forward shortcuts; paths from the high-level layers to the low-level layers. Aggregating the higher-level feature maps in low-level layers allow the low-level layer features to be contextually consistent in higher-level abstraction. To achieve this which is indeed opposed to the nature of the feed-forward networks, we propose a new fully convolutional model that consists mainly of a pair of networks. A `Slave' network is dedicated to provide the backward connections from its top layers to the `Master' network bottom layers. The Master network contains both forward and backward skip connections and is able to generate contextually-aware features. We also evaluate our proposed new model on three most popular scene labeling benchmarks and achieve competitive performance to other state-of-the-art methods.
DRNTU::Engineering::Electrical and electronic engineering