Fingerprinting and Building Large Reproducible Datasets

Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies. In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted. We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.

Mots clés

dataset reproducibility empirical studies open science dataset reproducibility empirical studies open science

Domaines

Informatique [cs]

Fichier principal

Fingerprinting_and_Building_Large_Reproducible_Datasets.pdf (784.79 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Romain Lefeuvre : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04132604

Soumis le : lundi 19 juin 2023-11:42:10

Dernière modification le : jeudi 1 février 2024-03:28:13

Archivage à long terme le : mercredi 20 septembre 2023-18:23:25

Dates et versions

hal-04132604 , version 1 (19-06-2023)

Identifiants

HAL Id : hal-04132604 , version 1
DOI : 10.5281/zenodo.7989955

Citer

Romain Lefeuvre, Jessie Galasso, Benoit Combemale, Houari Sahraoui, Stefano Zacchiroli. Fingerprinting and Building Large Reproducible Datasets. ACM REP '23: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability, 2023, ⟨10.5281/zenodo.7989955⟩. ⟨hal-04132604⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA PARISTECH LIRMM CENTRALESUPELEC INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC MIPS UNIV-MONTPELLIER UNIV-RENNES LTCI INFRES ACES DIG IP_PARIS UR1-MATH-NUM

168 Consultations

51 Téléchargements