Concept-based embeddings for natural language processing
Date of Issue2018
School of Computer Science and Engineering
Rolls-Royce@NTU Corporate Lab
Concepts are critical semantics capturing the high-level knowledge of human language. As a way to go beyond the word-level analysis, representing and leveraging the concept-level information is an important add-on to existing natural language processing (NLP) systems. More specifically, the concepts are critical for understanding opinions of people. For example, people express their opinion towards particular entities such as products or sentiment aspects in online reviews, where these entities are mentions of concepts rather than just words. As compared with words, the mentions of abstract concepts may be compounded phrases (either consecutive or non-consecutive) that are likely to form a large vocabulary. Furthermore, there might be semantic properties (e.g., relations or attributes) attached to the concepts, which increases the dimensionality of concepts. In short, using concepts is faced with the curse of dimensionality. On the other hand, information from only a single level does not suffice for a thorough understanding of human language, and meaningful representation is required at any point to encode the correlation and dependency between abstract concepts and words. In this thesis, we thus focus on effectively leveraging and integrating information from concept-level as well as word-level via projecting concepts and words into a lower dimensional space while retaining most critical semantics. In a broad context of opinion understanding system, we investigate the use of the fused embedding for several core NLP tasks: named entity detection and classification, automatic speech recognition reranking, and targeted sentiment analysis. We first propose a novel method to inject the entity-based information into a word embedding space. The word embeddings are learned from a set of named entity features instead of merely contextual words. We demonstrate that the new word embedding is a better feature representation for detecting and classifying named entities from the stream of telephone conversations. Apart from learning input feature embeddings, we then explore encoding the entity types (i.e., concept categories) in a label embedding space. Our label embeddings mainly leverage two types of information: label hierarchy and label prototype. Since our label embedding is computed prior to the training process, it has exactly the same computation complexity at run-time. We evaluate the resulting label embeddings on multiple large-scale datasets built for the task of fine-grained named entity typing. As compared with the state-of-the-art methods, our label embedding method can achieve superior performance. Next, we demonstrate that a binary embedding of the named entities can help reranking the speech-to-text hypothesis. Named entities are encoded using a Restricted Boltzmann Machine (RBM) and used as a prior knowledge in the discriminative reranking model. We also extend the training of RBM to work with speech recognition hypothesis. Finally, we investigate the problem of using embeddings of commonsense concepts for the task of targeted sentiment analysis. The task is also entity-centered. Namely, given a targeted entity in a sentence, the task is to resolve the correct aspects categories and corresponding sentiment polarity of the target. We propose a new computation structure of Long Short-Term Memory (LSTM) that can more effectively incorporate the embeddings of commonsense knowledge. In summary, this thesis proposes novel solutions of representing and leveraging concept-level and word-level information in a series of NLP tasks that are key to understanding the opinion of people.
DRNTU::Engineering::Computer science and engineering