Skip to Main content Skip to Navigation
Conference papers

Efficiently identifying disguised nulls in heterogeneous text data

Théo Bouganim 1 Helena Galhardas 2 Ioana Manolescu 1
1 CEDAR - Rich Data Analytics at Cloud Scale
LIX - Laboratoire d'informatique de l'École polytechnique [Palaiseau], Inria Saclay - Ile de France
Abstract : Digital data is produced in many data models, ranging from highly structured (typically relational) to semi-structured models (XML, JSON) to various graph formats (RDF, property graphs) or text. Most real-world datasets contain a certain amount of null values, denoting missing, unknown or unapplicable information. While some data models allow representing nulls by special tokens, socalled disguised nulls are also frequently encountered: these are values that are not syntactically speaking nulls, but which do, nevertheless, denote the absence, unavailability or unapplicability of the information. This paper describes our ongoing work toward detecting disguised nulls in textual data, encountered in ConnectionLens graphs. Driven by journalistic applications, we focus for now on large, semistructured datasets, where most or all data values are freeform text. We show that the state-of-the-art methods for detecting nulls in relational databases, mostly tailored towards numerical data, do not detect disguised nulls efficiently on such data. Then, we present two alternative methods: (i) leveraging Information Extraction, and (ii) text embeddings and classification. We detail their performance-precision trade-offs on real-world datasets.
Complete list of metadata

https://hal.inria.fr/hal-03347947
Contributor : Théo Bouganim Connect in order to contact the contributor
Submitted on : Friday, September 17, 2021 - 4:04:42 PM
Last modification on : Sunday, September 19, 2021 - 3:24:43 AM

File

Efficiently_identifying_disgui...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03347947, version 1

Citation

Théo Bouganim, Helena Galhardas, Ioana Manolescu. Efficiently identifying disguised nulls in heterogeneous text data. BDA (Conférence sur la Gestion de Données – Principles, Technologies et Applications), Oct 2021, Paris, France. ⟨hal-03347947⟩

Share

Metrics

Record views

35

Files downloads

98