Semi-supervised clustering techniques for categorization of text documents
Date of Issue2015
School of Electrical and Electronic Engineering
Nowadays, data mining becomes a very important research filed for knowledge discovery process. Among various data mining techniques, we focus on studying how a small amount of prior knowledge can be effectively incorporated into some popular clustering techniques, not only to improve the existing models, but also to develop novel semi-supervised clustering methods, especially for the categorization of high dimensional text documents. To be more specific, our objective is to investigate into some of the key performance characteristics of a good semi-supervised clustering method, and make them achievable in our purposed methods, such as: how to effectively incorporate the knowledge to guide the cluster search, accurately capture the underlying structure of the data, and achieve high capability of handling overlaps etc. In other words, our final goal is to develop some simple, fast and highly applicable clustering methods which aim to achieve good results with high quality and effectiveness with the help of the prior knowledge. Fuzzy co-clustering (FCC) is a type of clustering approaches that has shown its capability for handling high dimensional textual data categorization by simultaneously grouping the documents and words into some co-clusters. Meanwhile, for a document which intuitively spans multiple topics, FCC is also able to capture the degree of memberships of that document to each topic. Under the FCC framework, we proposed three different semi-supervised approaches, namely Semi-Supervised Fuzzy Co-clustering with Labelling (SS-FCL), Semi-Supervised Fuzzy Co-clustering with Constraints (SS-FCC) and Dual Semi-Supervised Heuristic Fuzzy Co-clustering with Ruspini’s condition (DSS-HFCR), respectively. Two types of prior knowledge in the forms of class labels and pair-wise constraints from the document domain are incorporated into SS-FCL and SS-FCC through different additional supervised constraint terms. Other than the categorization results of the documents, these two approaches also generate a group of work ranking clusters, which are useful in other data mining techniques, such as text summarization. Meanwhile, a heuristic dual-partitioning based approach called DSS-HFCR is also proposed in order to make full use of the available prior knowledge in terms of pair-wise constraints from both document and word domain. Moreover, DSS-HFCR can be directly downgraded to a simplified version if the prior knowledge is available from only a single domain Through extensive experimental vi study on a number of benchmark textual datasets, we demonstrate how these approaches make good use of the knowledge to guide the cluster search during the clustering process for a better performance in terms of both accuracy and efficiency. Some useful guidelines for parameter selection are also discussed. A case study of sentiment data analysis by applying DSS-HFCR demonstrates the strength of our proposed methods in the specific application area. FCC model is suitable for handling large sparse textual datasets, as it avoids applying an explicit similarity measure between two documents. However, similarity measure is still one of the most essential factors in many discriminative clustering approaches, and most of the these approaches still make use of only a single reference point (viewpoint) i.e. the origin for the similarity assessment. It is interesting and challenging to explore more effective similarity measures for high dimensional textual data, especially when some prior knowledge is available to the user. For the second part of the thesis, inspired by a recently proposed multi-viewpoint based similarity measure (MVS) , we introduce another novel semi-supervised clustering framework, which is able to utilize multiple appropriate viewpoints for a more informative and effective similarity assessment by incorporating two types of knowledge. With the help of a small number of class labels or pair-wise constraints in the dataset, we formulate two MVS measures, and subsequently propose two new MVS-based clustering approaches: Label-based Clustering with Multi-Viewpoint based Similarity (namely LMVS) and Pair-wise Constraints-based Clustering with Multi-Viewpoint based Similarity (namely PMVS). Comparing with the existing semi-supervised clustering techniques, the key strength of LMVS and PMVS is a more effective similarity measure can be directly formulated in the MVS manner with the help of the knowledge, and immediately applied to clustering, rather than learned by an independent distance metric learning process before the real clustering process is carried out. Some validity tests are conducted to show the strength of the measures, and systematical theoretical analysis is also provided to explain how the prior knowledge is utilized for both similarity enhancement and search-guiding purpose during the clustering process. At the same time, some potential issues of MVSC reported in  can be successfully addressed. At last, extensive experimental study on a large number of benchmark textual datasets are presented to demonstrate the effectiveness and verify the merit of LMVS & PMVS Clustering, compared with other start-of-the-art semi-supervised clustering/learning approaches.