Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions

Stochastic networks and queueing systems often lead to Markov decision processes (MDPs) with large state and action spaces as well as nonconvex objective functions, which hinders the convergence of many reinforcement learning (RL) algorithms. Policy-gradient methods perform well on MDPs with large state and action spaces, but they sometimes experience slow convergence due to the high variance of the gradient estimator. In this paper, we show that some of these difficulties can be circumvented by exploiting the structure of the underlying MDP. We first introduce a new family of gradient estimators called score-aware gradient estimators (SAGEs). When the stationary distribution of the MDP belongs to an exponential family parametrized by the policy parameters, SAGEs allow us to estimate the policy gradient without relying on value-function estimation, contrary to classical policy-gradient methods like actor-critic. To demonstrate their applicability, we examine two common control problems arising in stochastic networks and queueing systems whose stationary distributions have a product-form, a special case of exponential families. As a second contribution, we show that, under appropriate assumptions, the policy under a SAGE-based policy-gradient method has a large probability of converging to an optimal policy, provided that it starts sufficiently close to it, even with a nonconvex objective function and multiple maximizers. Our key assumptions are that, locally around a maximizer, a nondegeneracy property of the Hessian of the objective function holds and a Lyapunov function exists. Finally, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic algorithm. The results demonstrate that the SAGE-based method finds close-to-optimal policies more rapidly, highlighting its superior performance over the traditional actor-critic method.

Mots clés

reinforcement learning policy-gradient method exponential families product-form stationary distribution stochastic approximation

Domaines

Apprentissage [cs.LG] Performance et fiabilité [cs.PF] Optimisation et contrôle [math.OC] Probabilités [math.PR]

Fichier principal

paper.pdf (1.8 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Céline Comte : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04329790

Soumis le : jeudi 7 décembre 2023-16:40:33

Dernière modification le : mercredi 19 juin 2024-03:21:51

Dates et versions

hal-04329790 , version 1 (07-12-2023)

hal-04329790 , version 2 (14-06-2024)

Identifiants

HAL Id : hal-04329790 , version 1
ARXIV : 2312.02804

Citer

Céline Comte, Matthieu Jonckheere, Jaron Sanders, Albert Senen-Cerda. Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems. 2023. ⟨hal-04329790v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

322 Consultations

75 Téléchargements