Active learning with applications in biomedical document annotation
Date of Issue2017-05-18
School of Computer Science and Engineering
Bioinformatics Research Centre
The rapidly increasing volume of published biomedical research lit- erature is challenging individual biomedical researchers to keep up to date with all the latest development in their own research ﬁelds. With the advancement of natural language processing and Seman- tic Web technologies, more and more sophisticated biomedical natu- ral language processing systems are becoming available to lessen the burden of information overload individual biomedical researchers face. However, the building and evaluation of such biomedical text mining systems need a critical requirement of the manually anno- tated corpora; yet the construction process of the corpora is time- consuming and expensive as it requires much yet tedious effort from human annotators. Active learning is an approach to resolve this issue and aid annota- tors to reduce the time and effort needed for the corpus annotation process. In the widely used passive learning method, the documents are randomly and independently selected from the underlying dis- tribution, while in active learning method, a selection module is al- lowed to, repeatedly, query un-annotated documents in order to sin- gle out the most informative document to be manually annotated and to update its learned rules to achieve the overall maximized ef- ﬁciency. In this study, we ﬁrst propose a document scoring based active learn- ing method for ontological event extraction. Our method can signif- icantly reduce the amount of annotated corpora to saturate event ex- traction performance, compared to random selection of corpora for annotation, which is the common practice, and previous active learn- ing methods for corpus selection. We evaluated the performance of all the active learning methods using the TEES event extraction system against the BioNLP Shared Tasks datasets, showing that our method can help the system achieve its previously reported perfor- mance only with 60%-70% of the original training data. We then propose a committee-based active learning method for the event extraction and named entity recognition. The method is based on two systems as follows: We ﬁrst employ an event extraction sys- tem to ﬁlter potential false negatives among unlabeled documents, where the system does not extract any event. We then adopt a statis- tical method to rank the potential false negatives of unlabeled docu- ments 1) by using a language model that measures the probabilities of the expression of multiple events in documents and 2) by using a named entity recognition system that locates the named entities that can be event arguments (e.g., proteins). The proposed method fur- ther deals with unknown words in test data by using word similarity measures. We also apply our active learning method for the task of named entity recognition. We evaluate the proposed method against the BioNLP Shared Tasks datasets, and show that our method can achieve better performance than such previous methods as entropy and Gibbs error based methods and a conventional committee-based method. We also show that the incorporation of named entity recog- nition into the active learning for event extraction and the unknown word handling further improves the active learning method. In addi- tion, the adaptation of the active learning method into named entity recognition tasks also improves the document selection for manual annotation of named entities. Finally, we propose a novel clustering based active learning method for the biomedical NER task. We show that the underlying NER sys- tem using the proposed method outperforms those with other state of the art active learning methods, including density, Gibbs error and entropy based approaches, as well as the random selection. We com- pare variations of our proposed method and ﬁnd the optimal design of the active learning method, which is to use the vector representa- tion of named entities, and to select documents that are ‘representa- tive’ and ‘informative’, as well as to use the Shared Nearest Neighbor (SNN) clustering approach. In particular, the optimal variant of the proposed method achieves a deﬁciency gain of 36.3% over random selection. The proposed active learning method is a promising research direc- tion and we will conduct further research to exploit the full potential of this method.