Challenging issues in classification problems : sparisty control, key instance detection, and imbalanced data
Date of Issue2013
School of Computer Engineering
Centre for Computational Intelligence
This thesis deals with the difficulties in classification problems caused by three types of sparsity characteristics - feature, label, and instance sparsity. First, feature spar- sity is usually used as prior knowledge by inducing parameter sparsity of the learned model. We show that only an appropriate degree of parameter sparsity is beneficial, and both over-sparsity and under-sparsity are harmful for classification. Second, label sparsity means that only a fraction of training instances are labeled, which causes fail- ure of classic classification methods in these cases. Third, instance sparsity is caused by imbalanced composition of different categories, and instances from one category significantly outnumber the ones from the other. This always makes the classification boundary biased towards the majority category. Consequently, three contributions - sparsity control, key instance detection, and imbal- anced classification - are presented to address these challenges. Sparsity control aims to regularize the sparsity of model parameter at an appropriate level according to the intrinsic feature sparsity in data. It is proposed based on the ob- servation that this sparsity is not always desirable in real problems, and only a proper de- gree of sparsity is beneficial. To address this issue, we propose a novel probit classifier using generalized Gaussian scale mixture (GGSM) priors that can adjust the induced sparsity by tuning the shape parameter of GGSM, and consequently provide either a sparse or non-sparse solution based on the intrinsic feature sparsity. Model learning is carried out by an efficient modified maximum a posteriori estimation. We show rela- tionships of the proposed approach to the previous methods. We also study different types of likelihood working with the GGSM priors in a kernel-based setup, based on which an improved kernel-based approach is presented. Experiments demonstrate that the proposed method has better or comparable performance in both linear and non-linear classification.
DRNTU::Engineering::Computer science and engineering::Computer applications::Computers in other systems