Automatic taxonomy construction from textual documents
Luu, Anh Tuan
Date of Issue2017-04-07
School of Computer Science and Engineering
The explosion of unstructured text data makes it difficult to find information for our interests. To provide access to information effectively, it is important to organize the unstructured data in a structured and meaningful manner. Taxonomies, which serve as the backbone for structured knowledge, are useful for many NLP applications such as question answering and document clustering by organizing domain knowledge into a hierarchy of ‘is-a’ relations between terms. Currently, there have been an increasing number of public hand-crafted taxonomies available such as WordNet and Freebase. However, it will be more effective to use taxonomies that are created specifically for the domain of interest in practice rather than re-using existing taxonomies created for other tasks or domains. As such, we often face the challenge of creating a brand new taxonomy for a specific domain from scratch. In this thesis, we propose an effective framework for automatic domain-specific taxonomy construction from textual documents, which consists of three steps, namely domain term extraction, taxonomic relation identification and taxonomy induction. Domain term extraction aims to extract the relevant domain terms from a given text collection of specific domain. Taxonomic relation identification aims to identify the taxonomic relations (i.e. ‘is-a’ relations) among domain terms. Taxonomy induction aims to construct the taxonomy structure from the identified taxonomic relations. We use the big data approach which combines linguistics, statistical and deep learning methods to address the challenges in these steps. The main contributions of our research are summarized as follows: - We proposed a Web-based method to extract domain terms from a given text collection. From that, we proposed a method to use the contextual information of the terms in syntactic structures to detect taxonomic relations across sentence boundary. In addition, we also proposed a novel graph-based algorithm to organize the extracted taxonomic relations into an optimal taxonomy tree. The experimental results show that the proposed method is well complementary to the previous methods of linguistic pattern matching and significantly improves recall and F-measure. - We studied two important aspects that can greatly affect the performance of taxonomy construction method. The first one is on the trustiness of individual source texts, which is important to filter out incorrect relations from unreliable sources. The second one is on the collective evidence from synonyms and contrastive terms, where synonyms provide additional supports to taxonomic relation identification, while contrastive terms may contradict them. We proposed an approach to incorporate these features into taxonomy construction, which can improve the performance on F-measure by up to 4%-10%. - We proposed a time-aware approach to extract and integrate temporal information into the process of identifying taxonomic relations, by employing a timestamp contribution function to measure the evidence scores of source texts at a particular time. Experimental results show that our proposed approach outperforms the state-of-the-art methods on F-measure by up to 7%-20%. Furthermore, the proposed approach can incrementally and continuously update the taxonomy by adding fresh relations from new data and removing outdated relations, using a proposed information decay function. It thus avoids rebuilding the whole structure from scratch for every update and maintains the taxonomy up-to-date in order to keep up with the latest information trends quickly. - We proposed a novel unsupervised approach to construct taxonomies based on word embedding clustering, using the following three word embedding measures: semantic clusters, taxonomic centroids and relative distances from the root, for identifying the semantic relationships between terms and their hypernyms. Our proposed approach significantly outperforms the state-of-the-art methods in terms of recall and F-measure. - We proposed an approach to learn word embeddings for taxonomic relations based on the contextual words between the hypernym and hyponym using a dynamic weighting neural network. Our proposed approach significantly outperforms the state-of-the-art methods by 9% to 13% in terms of accuracy for both general and specific domain datasets.