Scene segmentation with deep neural networks
Date of Issue2018
School of Electrical and Electronic Engineering
In this thesis, we address the challenging task of scene segmentation, which generally refers to parsing a scene image into a set of coherent semantic regions. Scene segmentation requires multi-level scene understanding ranging from high-level semantic recognition to low-level boundary detection. Therefore, high-level contextual modeling as well as low-level information incorporation are both important in the functionality of scene segmentation models. In addition, semantic classes appear disproportionally in scene images, which further complicates the learning process of segmentation models. In this thesis, we focus on addressing these challenges inherent to scene segmentation. More specifically, we mainly explore and discuss how effective high-level context should be incorporated in scene semantic segmentation. In the mean time, we discuss how to effectively retain low-level information that are essential in accurate low-level semantic parsing. Besides, we discuss and explore the issue of class frequency imbalance, which is severe and significant in scene semantic segmentation. Accordingly, we present three works, each of which explore and address parts of these challenges individually. First of all, we adopt Patch-Convolutional Neural Network (Patch-CNN, which is trained from scratch with image patches) to be our parametric model for local patch analysis. Based on the occurrence frequency distribution of classes, an ensemble of CNNs (CNN-Ensemble) are learned, in which each CNN component is trained with different patch sampling strategy and each one focuses on learning different and complementary visual patterns. We notice that different CNN component performs significantly differently at parsing low-frequent semantic classes. Importantly, we observe a significant performance boost of CNN-Ensemble for parsing low-frequent classes. Quantitatively, a significant 14% and 8% boost is achieved in terms of average class accuracy by CNN-Enasemble in comparison with single CNNs. Furthermore, we leverage the global scene semantics to alleviate the local ambiguity. The global scene constraint is practically estimated via an non-parametric framework. In the end, the integration of local and global beliefs gives rise to the class likelihood of pixels, based on which maximum marginal inference is performed to generate the label prediction maps. As our experiments demonstrate, the incorporation of global context significantly enhance the visual quality of prediction maps. Specifically, it improves approximately 5% and 8% in terms of overall pixel accuracy on SiftFlow and Barcelona dataset respectively. Even without any post-processing, the proposed algorithm achieves very competitive results on public scene segmentation benchmarks. Secondly, we discuss how to effectively capture the rich contextual dependencies over image regions. Specifically, we propose Directed Acyclic Graph - Recurrent Neural Networks (DAG-RNN) to perform context aggregation over locally connected feature maps. More specifically, DAG-RNN is placed on top of pre-trained CNN (feature extractor) to embed context into local features so that their representative capability can be enhanced. In comparison with plain CNN (as in Fully Convolutioal Networks - FCN), DAG-RNN is empirically found to be significantly more effective at aggregating context. Therefore, DAG-RNN demonstrates noticeably performance superiority over FCNs on scene segmentation. Besides, DAG-RNN entails dramatically less parameters as well as demands fewer computation operations, which makes DAG-RNN more favorable to be potentially applied on resource- constrained embedded devices. Overall, the incorporation of DAG-RNN brings in a significant 3% and 4% IOU (Intersection of Union) boost over competitive FCN baseline on SiftFlow and Pascal-Context dataset respectively. Meanwhile, in order to address the imbalanced class occurrence frequency distribution, we propose a novel class-weighted loss to train the segmentation network . The loss distributes reasonably higher attention weights to infrequent classes during network training, which is essential to boost their parsing performance. In our ablation experiments, we observe a significant 3.2% ACA and 1.2% IOU disparity between DAG-RNN trained with and without the proposed loss, which clearly demonstrates its effectiveness of improving segmentation performance for rare classes. We evaluate our segmentation network on three challenging public scene segmentation benchmarks: Sift Flow, Pascal-Context and COCO Stuff. On top of them, we achieve state-of-the-art segmentation performance. Finally, considering that scene segmentation demands multi-level visual understanding ranging from low-level (e.g. boundary detection) to high-level (e.g. general object recognition). We firstly propose and place a convolutional context network (CCN) on top of pre-trained CNNs, which is used to aggregate contexts for high-level feature maps. We also discuss the limitations of current parameterization of skip layers, which are used to retain low-level information. By slightly modifying the parametrization of skip layers, we demonstrate that segmentation network with our skip layers delivers a very promising network architecture. In order to retain as much detailed low-level information as possible from pre-trained CNN, we introduce ``dense skip" network architecture. We name our segmentation network improved as fully convolutional network (IFCN) based on its significantly enhanced structure over FCN. We carry out careful ablation studies to justify each contribution individually. Overall, IFCN outperforms the competitive FCN baseline by a significant margin of 7% and 8.8% IOU on the challenging ImageNet (ADE20k) and Pascal-Context dataset respectively. We also compare their qualitative segmentation maps, and demonstrates how IFCN achieves robust high-level as well as low-level parsing results. Without bells and whistles, IFCN achieves state-of-the-arts on ADE20K, Pascal Context and Pascal VOC 2012 segmentation datasets. In summary, we have explored three key aspects that are of great importance to enhance scene segmentation performance: incorporating high-level context (contextual modelling), retaining low-level information as well as boosting recognition performance for rare classes. Step by step, we come up with a state-of-the-art segmentation network - IFCN that integrates the best configuration (of these three aspects) to our best research endeavours.
DRNTU::Engineering::Electrical and electronic engineering