Skip to Main content Skip to Navigation
Theses

Active Learning Methods for Interactive Exploration on Large Databases

Enhui Huang 1
1 CEDAR - Rich Data Analytics at Cloud Scale
LIX - Laboratoire d'informatique de l'École polytechnique [Palaiseau], Inria Saclay - Ile de France
Abstract : Faced with an increasing gap between fast growth of data and limited human ability to comprehend data, data analytics tools are now in high demand in many applications across a broad set of domains. In particular, for interactive data exploration systems, an "explore-by-example" framework, which aims to assist the user in performing highly effective data exploration while minimizing the human effort, is becoming increasingly popular. However, the state-of-the-art explore-by-example systems still require a large number of labeled examples to achieve the desired accuracy and cannot handle noisy labels. To address both the slow convergence problem and the label noise problem, in this thesis, we cast the explore-by-example problem in a principled "active learning" framework, and bring the properties of important classes of the user interest to bear on the design of new algorithms and optimizations for active learning-based data exploration. While recent work proposed a polytope-based data space model to filter examples returned by a traditional active learner and to compute an accuracy-based stopping criterion for active learning, our first contribution is to combine the polytope-based model and a traditional active learner into a new Dual-Space Model (\dsm), jointly offering the prediction and sample acquisition functionalities and thereby expediting model convergence. We also provide a set of techniques to improve the interactive performance such that the per-iteration time is maintained within 1 to 2 seconds. Evaluation results using real-world datasets and user interest patterns show that the accuracy of our system is on average 26\% higher than the state-of-the-art explore-by-example systems. Our second contribution is to overcome the slow convergence problem when data exploration is performed in high dimensions. We factorize the high-dimensional space into a set of low-dimensional spaces, build a \dsm\ in each subspace and combine them to be a factorized \dsm\ (\dsmf). We further prove that \dsmf\ achieves a better accuracy-based stopping criterion than \dsm. In case that the user may start exploration with more attributes than those needed in the final model, we propose an online feature selection method that adaptively selects the top-k relevant attributes. Experimental results using real-world workloads show that our system significantly outperforms the state-of-the-art explore-by-example systems in accuracy (19x$\sim$64x accuracy improvement) and convergence speed while maintaining the per-iteration time within 1 to 2 seconds and our online feature selection method improves accuracy from nearly 0 without feature selection, to above 80\%. %We formally define the class of queries that \dsmf\ supports, the decision function it utilizes, and prove that it achieves a better accuracy-based stopping criterion than \dsm. Last but not least, we address the label noise problem when the labels provided by the user are potentially corrupted. We enhance the robustness of our system by integrating advanced data cleansing methods and a refinement of the polytope-based model into \dsmf. The resulting algorithm, referred to as the Robust Dual-Space Model (\rdsmf), is compared to traditional active learning for evaluation since the state-of-the-art explore-by-example systems fail to deal with noisy labels. Experimental results using real-world datasets and user interest patterns show that our proposed algorithm substantially outperforms traditional active learning in accuracy (up to 22x accuracy improvement) while achieving reasonable efficiency for interactive data exploration (1 to 3 seconds per iteration).
Complete list of metadata

https://hal.inria.fr/tel-03339951
Contributor : Enhui Huang Connect in order to contact the contributor
Submitted on : Thursday, September 9, 2021 - 6:13:14 PM
Last modification on : Tuesday, October 12, 2021 - 3:52:12 AM

File

Enhui_thesis_final.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-03339951, version 1

Citation

Enhui Huang. Active Learning Methods for Interactive Exploration on Large Databases. Computer Science [cs]. Institut Polytechnique de Paris, 2021. English. ⟨tel-03339951v1⟩

Share

Metrics

Record views

61

Files downloads

63