Learning representations for human re-identification
Rama Varior, Rahul
Date of Issue2017-05-12
School of Electrical and Electronic Engineering
This thesis addresses the problem of Human Re-Identification, the task of associating pedestrians over multiple camera views. Human re-identification particularly an inter- esting problem due to its applications in visual surveillance. Given a probe image of a subject, the objective is to identify a set of matching images of the same subject from a gallery set which are mostly captured by a different camera. Instead of manually search- ing through images captured by various cameras, it is desirable to automate the human re-identification as it can save enormous amount of manual labor. However, human re-identification is fundamentally a challenging problem due to cluttered backgrounds, ambiguity in visual appearance, variations in illumination, pose and view-point. The goal of this thesis is to present various feature learning architectures in different perspectives to tackle the aforementioned challenges in human re-identification. Public places are equipped with several thousands of surveillance cameras capturing videos round the clock. Since no biometric aspects such as fingerprint, GAIT or facial cues are accessible from the surveillance videos, visual appearance is the main cue for re- identifying pedestrians. Intuitively, to distinguish pedestrians from surveillance videos, color features can be an important aspect. However, varying illumination and environ- mental conditions pose a great challenge as the perceived color of the subject may vary. In existing researches, color features are used as it is, i.e. features are extracted from raw pixel values or weakly corrected pixels. In the first part of the thesis, an invariant color feature learning framework is presented to efficiently map and encode the weakly corrected pixel values in an invariant space where the representations of similar colors are close to each other. In the second part of the thesis, contextual information is incorporated into the local features. Conventional features are extracted locally and independent of other regions. However, such features lack the global context of the image. Therefore, it is desirable to incorporate the contextual information to the local features. In order to encode such information, a variant of the Recurrent Neural Network architecture called Long Short- Term Memory (LSTM) cells are used. The sophisticated gating mechanisms inside the LSTM cells has the flexibility to selectively propagate the relevant contextual information to the rest of the network. To eliminate the need for hand-crafted features, an end-to-end trainable Siamese Convolutional Neural Network (S-CNN) architecture is first proposed. However, in con- ventional S-CNN architectures, the representations of the images are compared only at the final stage when the feature representations mature. In this setting, the network is at risk of failing to capture and propagate subtle local patterns that can distinguish pos- itive pairs from hard-negative pairs. Therefore, a novel gating mechanism is modeled to selectively boost and propagate such common patterns from the middle layers to the final layers of the network. Extensive experimental evaluation and comparisons with baseline algorithms demonstrate the effectiveness of the proposed feature learning models.