Representation learning for sentences and documents
Date of Issue2017-09-18
School of Electrical and Electronic Engineering
With exponential growth of the Internet, more than one exabyte of data is cre- ated on the Internet each day. Among various kinds of data, text pose a large portion and become important in people's daily life. Text mining or natural lan- guage processing (NLP) aim to draw insights from these text data by computers, which are mainly based on machine learning. It is known that data representation determines the upper bound of the performance of machine learning algorithms. When it comes to natural language, how to define fixed-length representation for variable-length natural language units is the keystone for various tasks in text mining including text classification, document clustering and so on. In the area of NLP, one widely adopted representation is Bag-of-Words model. However, it suffers a lot from its intrinsic extreme sparsity and is unable to cap- ture the semantic information behind text data. Some previous research focus on feature extraction methods to derive dense and low-dimensional vectors for original BoW vectors. Recently, with the development of deep learning models, representation learning for different levels of text units has been redefined. Word embeddings attempt to encode semantic and syntactic information of words into a low-dimensional and dense vector space. For sentences and documents, multi-layer neural networks including recursive, recurrent and convolution neural networks take sequences of words as input, learn and perform compositions over word embeddings to derive distributed representation. My research in this thesis aims at learning representation for sentences and documents. The models introduced in this the- sis are all developed under these two above frameworks: BoW-based models and neural compositional models. These models in this thesis are compared with var- ious state-of-arts models among comprehensive experiments including sentiment analysis, topic categorization and so on. We also investigate the incorporation of domain knowledge into text representation learning to address one specific task: cyberbullying detection. Our proposed approaches are able to learn robust and discriminative representations of bullying messages. Chapter 1 is an introductory chapter that focuses on the motivation behind repre- sentation learning for text data. Chapter 2 covers the relevant techniques in this field including two major research directions: feature extraction over Bag-of-Words features and neural compositional models based on word embeddings. Chapter 3 investigates the efficiency problems in feature extraction for high-dimensional data. We propose a semi-random projection framework called SRP, which takes the merit of random feature sampling of random projection, but employs learning mechanism in the determination of the transformation matrix, with the goal of achieving a good balance between computational efficiency and classification accuracy in text cate- gorization tasks. Chapter 4 focuses on neural compositional models for sentence embeddings. These standard works always train various forms of neural networks on top of word embeddings directly. However, the multi-sense natures of words are totally neglected by taking single prototype word embeddings as input. Hence, we explore the integration of knowledge learned from topic model into neural sen- tence models. With the help of Latent Dirichlet Allocation, the topic-specific sense at word level before composition and sentence-level after composition are concatenated with the general embeddings. Two different neural sentence models including Convolutional Neural Networks (CNN) and Long-short Term Memory Networks (LSTMs) are investigated to develop two corresponding methods includ- ing Topic-Aware CNN and Topic-Aware LSTMs. Comprehensive experiments over five sentence classification tasks including sentiment analysis and topic categoriza- tion are conducted. Chapter 5 relieves the BoW models assumptions that words are independent. We introduce fuzzy logic into the conventional BoW model and develop Fuzzy BoW (FBoW) and Fuzzy Bag-of-WordCluster (FBoWC) models. In FBoW, fuzzy mapping from words in documents to basis terms is adopted. In FBoWC, fuzzy logic is adopted in not only mapping, but also selection of basis terms. To implement fuzzy logic, word embeddings are utilized to measure semantic similarities among words and construct fuzzy membership functions of basis terms in BoW space over words in the task-specific corpus. We verify the performance of our approaches through seven multi-label document categorization tasks. In Chap- ter 6, we explore the application of text mining on cyberbullying detection. In this meaningful research area, the critical issue is robust and discriminative numeri- cal representation learning of text messages. We propose two new representation learning methods to tackle this problem. One is named embeddings-enhanced bag- of-words model (EBoW) and the other is named semantic-enhanced marginalized stacked denoising autoencoder (smSDA). Comprehensive experiments on two pub- lic cyberbullying corpora (Twitter and MySpace) are conducted, and the results show that our proposed approaches outperform other baseline text representation learning methods. Summarization of our work and future directions are given in Chapter 7.