Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions

In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the stationary distribution of the MDP belongs to an exponential family that is parametrized by policy parameters, we can improve existing policy-gradient methods for average-reward RL. Our key identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs), that enable policy gradient estimation without relying on value-function approximation in the aforementioned setting. This contrasts with other common policy-gradient algorithms such as actor–critic methods. We first show that policy-gradient with SAGE locally converges, including in cases when the objective function is nonconvex, presents multiple maximizers, and the state space of the MDP is not finite. Under appropriate assumptions such as starting sufficiently close to a maximizer, the policy under stochastic gradient ascent (SGA) with SAGE has an overwhelming probability of converging to the associated optimal policy. Other key assumptions are that a local Lyapunov function exists, and a nondegeneracy property of the Hessian of the objective function holds locally around a maximizer. Furthermore, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor–critic method. We specifically focus on several examples inspired from stochastic networks, queueing systems, and models derived from statistical physics, where parametrizable exponential families are commonplace. Our results demonstrate that a SAGE-based method finds close–to–optimal policies faster than an actor–critic method.

Mots clés

reinforcement learning policy-gradient method exponential families product-form stationary distribution stochastic approximation

Domaines

Apprentissage [cs.LG] Performance et fiabilité [cs.PF] Optimisation et contrôle [math.OC] Probabilités [math.PR]

Fichier principal

paper.pdf (6.85 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Céline Comte : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04329790

Soumis le : vendredi 14 juin 2024-17:12:40

Dernière modification le : mercredi 19 juin 2024-03:21:51

Dates et versions

hal-04329790 , version 1 (07-12-2023)

hal-04329790 , version 2 (14-06-2024)

Identifiants

HAL Id : hal-04329790 , version 2
ARXIV : 2312.02804

Citer

Céline Comte, Matthieu Jonckheere, Jaron Sanders, Albert Senen-Cerda. Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions. 2024. ⟨hal-04329790v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS INSA-TOULOUSE LAAS LAAS-SARA UT1-CAPITOLE LAAS-RESEAUX-ET-COMMUNICATIONS TDS-MACS INSA-GROUPE LAAS-RISC IRIT IRIT-RMESS IRIT-ASR TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

1160 Consultations

159 Téléchargements