Discovering class-specific visual patterns for visual recognition
Date of Issue2017-07-26
School of Electrical and Electronic Engineering
Similar to frequent patterns in data mining, visual pattern refers to a recurring composition of visual contents in images or videos, such as repetitive texture regions, common objects among images, or similar actions among videos. Such visual patterns capture the recurrence nature of visual data and can represent the essence of the visual data. Finding such visual patterns is critical to image and video data analysis. In spite of the recent successes of unsupervised mining of representative visual patterns in unlabeled visual data, for visual recognition tasks, the unsupervised mined visual patterns are often not discriminative enough to distinguish among different classes. One natural way to overcome this limitation is to leverage supervised learning and discover class-specific visual patterns, which is the focus of this thesis. Particularly, we target at discovering the following visual patterns of different structures: (1) class-specific local spatial patterns, e.g., local texture structure that can help differentiate different object images; (2) class-specific spatial layout patterns, e.g., spatial grid patterns that can help differentiate different scene images; (3) class-specific visual pattern of compositional structures, e.g., conjunction (AND) and disjunction (OR) forms of individual visual features that can help differentiate different scene images and action videos. To discover the above-mentioned class-specific visual patterns, this thesis is composed by the following technical works. In the first work, we propose to mine mid-level visual phrases from low-level visual primitives, e.g., local image patches or regions, by leveraging local spatial context of visual primitives, multi-feature fusion of visual primitives, and also the weaklysupervised image label information. In the second work, we propose to discover class-specific spatial layouts for each scene category by casting a l1-regularized max-margin optimization problem. In the third work, we propose a novel branch-and-bound based co-occurrence pattern mining algorithm that can directly mine both optimal conjunctions (AND) and disjunctions (OR) of individual features at arbitrary orders simultaneously with minimum classification error for boosting algorithm. Similar to the third work, in the fourth work we aim to discover highorder AND/OR patterns of skeleton features from depth camera for action recognition. We also propose to integrate the discovered AND/OR patterns in an attention LSTM model for temporal modeling to improve action recognition performance. Compared with unsupervised visual pattern discovery, which usually separates the step of pattern discovery and classification, our method can provide a joint learning of visual pattern discovery and visual recognition. Also, different from conventional visual recognition which emphasize purely on the classification performance, our class-specific visual patterns target more on capturing the essence of difference visual classes, such that we not only can recognize the visual classes, but also can explain and understand why they are different visual classes, thanks to the discovered class-specific visual patterns.