Deep learning for video segmentation
Tan, Clement Xian Ren
Date of Issue2019-04-28
School of Computer Science and Engineering
In recent years, CNNs have been the methodology used in multiple Computer Vision tasks. Although CNNs are avant-garde in image classification or object detection challenges, there are several limitations to them when it comes to semantic segmentation. When training the model, the resultant feature maps are usually coarse. Moreover, a typical evaluation process for the state-of-the-art DeepLab model takes approximately seven to eight FPS and is not suitable for real-time applications such as self-driving cars. This final year project seeks to evaluate the effectiveness of the atrous convolutions and atrous spatial pyramid pooling module on CNNs for the task of semantic segmentation. Before diving directly into the training of the CNN architectures, the analysis was done on the feature extractors and semantic segmentation architectures that will be used in the project. Next, the DeepLabV2, DeepLabV3 and dilated MobileNetV2 architectures were trained and evaluated on the Computer Vision and Pattern Recognition (CVPR) Workshop on Autonomous Driving (WAD) 2018 Berkeley DeepDrive dataset. In addition, the Cityscapes and a Singapore video will be used to visualize the drivable road segmentations. The DeepLabV3 and DeepLabV2 models used in this project achieved 84.30% and 78.83% validation mIOU respectively and these findings suggest that the atrous convolution and atrous spatial pooling module boosts the mIOU accuracy substantially and it may be reused in several other image classification architectures. These upsampling methodologies were incorporated into the MobileNetV2 which then achieved 76.10% validation mIOU and the trade-off between the accuracy and efficiency between the DeepLabV2 and MobileNetV2 architectures are discussed.
DRNTU::Engineering::Computer science and engineering
Final Year Project (FYP)
Nanyang Technological University