Tabular and Deep Learning of Whittle Index

The Whittle index policy is a heuristic that has shown remarkable good performance (with guaranted asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABP). In this paper we present QWI and QWINN, two algorithms capable of learning the Whittle indices for the total discounted criterion. The key feature is the usage of two timescales , a faster one to update the state-action Q-values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q-values on the faster timescale , which is able to extrapolate information from one state to another and scales naturally to large state-space environments. Numerical computations show that QWI and QWINN converge much faster than the standard Q-learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.

La politique de l'indice de Whittle est une heuristique qui a montré des performances remarquables (avec une optimalité asymptotique garantie) lorsqu'elle est appliquée à la classe de problèmes connus sous le nom de problèmes de bandits multi-bras sans repos (RMABP). Dans cet article, nous présentons QWI et QWINN, deux algorithmes capables d'apprendre les indices de Whittle pour le critère de décote totale. La caractéristique principale est l'utilisation de deux échelles de temps, une plus rapide pour mettre à jour les Q-values état-action, et une relativement plus lente pour mettre à jour les indices de Whittle. Dans notre principal résultat théorique, nous montrons que QWI, qui est une implémentation tabulaire, converge vers les vrais indices de Whittle. Nous présentons ensuite QWINN, une adaptation de l'algorithme QWI utilisant des réseaux neuronaux pour calculer les Q-values sur l'échelle de temps la plus rapide, qui est capable d'extrapoler l'information d'un état à un autre et qui s'adapte naturellement aux grands espaces d'état. Les calculs numériques montrent que QWI et QWINN convergent beaucoup plus rapidement que l'algorithme standard de Q-learning, le Q-learning approximatif basé sur les réseaux neuronaux et d'autres algorithmes de pointe.

Mots clés

Machine learning Reinforcement Learning Whittle Index Markov Decision Problem Multi-armed Restless Bandit

Domaines

Machine Learning [stat.ML] Intelligence artificielle [cs.AI]

Fichier principal

Tabular and Deep Learning of Whittle Index.pdf (6.36 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Francisco Robledo : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03767324

Soumis le : lundi 5 septembre 2022-11:57:15

Dernière modification le : mardi 30 avril 2024-13:52:41

Archivage à long terme le : mardi 6 décembre 2022-18:03:57

Dates et versions

hal-03767324 , version 1 (05-09-2022)

Licence

Paternité

Identifiants

HAL Id : hal-03767324 , version 1

Citer

Francisco Robledo, Vivek S Borkar, Urtzi Ayesta, Konstantin Avrachenkov. Tabular and Deep Learning of Whittle Index. EWRL 2022 - 15th European Workshop of Reinforcement Learning, Sep 2022, Milan, Italy. ⟨hal-03767324⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS INRIA UNIV-PAU LMA-PAU INSMI UT1-CAPITOLE INRIA2 UNIV-COTEDAZUR IRIT IRIT-RMESS IRIT-ASR TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

513 Consultations

98 Téléchargements