Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code

Yiming Sun; Daniel M. German; Stefano Zacchiroli

doi:10.1007/s10664-023-10317-8

Article Dans Une Revue Empirical Software Engineering Année : 2023

Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code

(1) , (1) , (2, 3)

1
2
3

Yiming Sun

Fonction : Auteur
PersonId : 1255672

University of Victoria [Canada]

Daniel M. German

Fonction : Auteur
PersonId : 982549

University of Victoria [Canada]

Stefano Zacchiroli

Fonction : Auteur
PersonId : 15184
IdHAL : stefano-zacchiroli
ORCID : 0000-0002-4576-136X
IdRef : 176946942

Autonomic and Critical Embedded Systems

Département Informatique et Réseaux

Résumé

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers—such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination. By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI's 244 K packages we found 11.2 M different global identifiers (classes and method/function names—with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases. We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.

Mots clés

software provenance source code tracking identifiers open source software python

Domaines

Génie logiciel [cs.SE]

Fichier principal

main.pdf (1.16 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Stefano Zacchiroli : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04101937

Soumis le : dimanche 21 mai 2023-20:47:31

Dernière modification le : mercredi 23 octobre 2024-10:30:04

Dates et versions

hal-04101937 , version 1 (21-05-2023)

Identifiants

HAL Id : hal-04101937 , version 1
ARXIV : 2305.14837
DOI : 10.1007/s10664-023-10317-8

Citer

Yiming Sun, Daniel M. German, Stefano Zacchiroli. Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code. Empirical Software Engineering, inPress, ⟨10.1007/s10664-023-10317-8⟩. ⟨hal-04101937⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

PARISTECH LTCI INFRES ACES IP_PARIS INSTITUT-MINES-TELECOM

112 Consultations

91 Téléchargements

Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager