When to checkpoint at the end of a fixed-length reservation? - Joint Laboratory on Extreme Scale Computing
Communication Dans Un Congrès Année : 2023

When to checkpoint at the end of a fixed-length reservation?

Résumé

This work considers an application executing for a fixed duration, namely the length of the reservation that it has been granted. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. We address two scenarios. In the first scenario, a checkpoint can be taken at any time; despite its simplicity, this natural problem has not been considered yet (to the best of our knowledge). We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. The second scenario is more involved: the application is a linear workflow consisting of a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the application checkpoints at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue executing at the end of each task. We instantiate this second scenario with several examples of probability distribution laws for task durations.
Fichier principal
Vignette du fichier
ftxs2023_HAL.pdf (1.4 Mo) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04215554 , version 1 (22-09-2023)

Licence

Identifiants

  • HAL Id : hal-04215554 , version 1

Citer

Quentin Barbut, Anne Benoit, Thomas Herault, Yves Robert, Frédéric Vivien. When to checkpoint at the end of a fixed-length reservation?. Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop, Nov 2023, Denver, United States. ⟨hal-04215554⟩
102 Consultations
74 Téléchargements

Partager

More