Budget efficient online active learning and its applications
Date of Issue2017-01-24
Interdisciplinary Graduate School (IGS)
NTU-UBC Research Center of Excellence in Active Living for the Elderly
Online Active Learning (OAL) has been an important research area in machine learning, which aims to minimize the number of labeled instances and maximize the predictive performance meanwhile. OAL has both the efficiency and effectiveness of online learning and the labeling frugality of active learning. Due to these advantages, OAL has been widely used in real-world large-scale applications, such as information retrieval, data mining, recommendation system, and so on. However, there are still several problems existing in current OAL designs. First, in the online learning with expert advice setting, most of the exiting OAL algorithms assume that all the experts are comparably reliable, which is usually not true in reality. For example, noisy workers are quite common in the crowdsourcing platforms. To correct this weak assumption, this study proposes two robust online active learning algorithms, which not only consider the predictions of experts on current instance, but also consider the cumulative performance of experts on past instances. To validate the proposed algorithms, a series of experiments are conducted, in which the results show that the proposed algorithms greatly outperform the state-of-the-art existing algorithms and can achieve robust performance both in the normal and noisy scenarios. Second, to obtain reliable labels in crowdsourcing, most of the algorithms either require a set of golden questions to filter out the noisy workers, or require several labels for each instance. These types of requirements are labeling costly both in terms of money and time. To save costs, a framework of Active Crowdsourcing for Annotation (ACA) is proposed based on the online learning with expert advice. The proposed framework consists of two main components: “Who to label” and “When to query”. The first component actively allocates the instance to reliable workers to gain labels, and the second component actively decides which instance is worthy to be a golden question. The empirical studies both on simulated and real-world crowdsourcing datasets show that the proposed framework can robustly learn the reliability of each worker and wisely allocate the task to more reliable workers. Third, in the typical online learning setting, most of the OAL algorithms adopt the margin-based query strategies, which usually assume that the model is well trained and the margin value is accurate. However, this assumption is often not true in reality, such as in the early training phrase. To alleviate this assumption, a second-order based online active learning algorithm is proposed, which considers not only the margin value of current instance, but also the confidence value of the current model. To validate the efficacy of the proposed algorithm, a theoretical mistake bound is provided and a set of empirical studies are conducted on real-world datasets. Both the theoretical and empirical studies show that our proposed second-order based algorithm can achieve the best performance in terms of accuracy. Last, for the online relative similarity learning problem, most of the studies assume that there are large-scale labeled triplets. However, labeling datasets are usually costly and time consuming, especially for the large-scale similarity learning problems. To reduce the high computation cost, this study proposes two online active relative similarity learning algorithms: (i) first-order based Passive-Aggressive Active Similarity learning (PAAS); (ii) second-order based Confidence-Weighted Active Similarity learning (CWAS). In order to validate the effectiveness of our algorithms, the proposed algorithms are firstly theoretically analyzed, and then empirically evaluated on several real-world applications. The experiments show that the proposed PAAS and CWAS algorithms can greatly reduce the labeling cost in the relative similarity learning process. In sum, to tackle the critical challenges of existing OAL algorithms, this study proposes four main OAL algorithms, most of them are theoretically sound algorithms. And all of the proposed algorithms are carefully evaluated on a large number of large-scale real-world applications and achieved promising results. Although promising results have been generated from this study, the proposed OAL algorithms are far from perfect. In future, there are several directions to study, such as OAL for concept drifting problems, distributed OAL algorithms, OAL for crowdsourcing and so on.