Efficient, robust and effective rank aggregation for massive biological datasets

Massive biological datasets are available in various sources. To answer a biological question (e.g., ''which are the genes involved in a given disease?''), life scientists query and mine such datasets using various techniques. Each technique provides a list of results usually ranked by importance (e.g., a list of ranked genes). Combining the results obtained by various techniques, that is, combining ranked lists of elements into one list of elements is of paramount importance to help life scientists make the most of various results and prioritize further investigations. Rank aggregation techniques are particularly well-fitted with this context as they take in a set of rankings and provide a consensus, that is, a single ranking which is the ''closest'' to the input rankings. However, (i) the problem of rank aggregation is NP-hard in most cases (using an exact algorithm is currently not possible for more than a few dozens of elements) and (ii) several (possibly very different) exact solutions can be obtained. As answer to (i), many heuristics and approximation algorithms have been proposed. However, heuristics cannot guarantee how far from an exact solution the consensus ranking will be, and the approximation ratio of approximation algorithms dedicated to the problem is fairly high (not less than 3/2). No solution has yet been proposed to help true-users dealing with the problem encountered in point (ii). In this paper we present a complete system able to perform rank aggregation of massive biological datasets. Our solution is efficient as it is based on an original partitioning method making it possible to compute a high-quality consensus using an exact algorithm in a large number of cases. Our solution is robust as it clearly identifies for the user groups of elements whose relative order is the same in any optimal solution. These features provide answers to points (i) and (ii) and lie in mathematical bases offering guarantees on the computed result. Also, our solution is effective as it has been implemented into a real tool, ConquR-BioV2 is used for the life science community, and evaluated at large-scale using a very large number of datasets.

Mots clés

Rank aggregation Consensus ranking Massive biological datasets Kemeny rule

Domaines

Complexité [cs.CC] Bio-informatique [q-bio.QM] Algorithme et structure de données [cs.DS]

Fichier principal

FGCS2021_rankaggregation.pdf (1.16 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Laurent Bulteau : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03388443

Soumis le : lundi 13 décembre 2021-16:15:17

Dernière modification le : lundi 1 juillet 2024-15:00:35

Archivage à long terme le : lundi 14 mars 2022-19:21:29

Dates et versions

hal-03388443 , version 1 (13-12-2021)

Identifiants

HAL Id : hal-03388443 , version 1
DOI : 10.1016/j.future.2021.06.013

Citer

Pierre Andrieu, Bryan Brancotte, Laurent Bulteau, Sarah Cohen-Boulakia, Alain Denise, et al.. Efficient, robust and effective rank aggregation for massive biological datasets. Future Generation Computer Systems, 2021, 124, pp.406-421. ⟨10.1016/j.future.2021.06.013⟩. ⟨hal-03388443⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

PASTEUR CEA ENPC CNRS INRIA PARISTECH LIGM LIGM_MOA CENTRALESUPELEC CEA-UPSAY I2BC UNIV-PARIS-SACLAY JOLIOT CEA-DRF LISN GS-ENGINEERING GS-COMPUTER-SCIENCE GS-BIOSPHERA GS-LIFE-SCIENCES-HEALTH GS-HEALTH-DRUG-SCIENCES INSTITUT-SCIENCES-LUMIERE LISN-BIOINFO UNIV-EIFFEL LIGM_ADA FRANCE-GENOMIQUE BIOINFO_BIOSTAT_HUB

136 Consultations

128 Téléchargements