Truly No-Regret Learning in Constrained MDPs

Authors: Adrian Müller, Pragnya Alatur, Volkan Cevher, Giorgia Ramponi, Niao He

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Additionally, we provide numerical evaluations of our algorithm in simple environments. We perform numerical simulations of our algorithms and compare them to their unregularized counterparts (Efroni et al., 2020).
Researcher Affiliation Academia 1EPFL 2ETH Zurich 3University of Zurich
Pseudocode Yes Algorithm 1 Regularized Primal-Dual Algorithm with Optimistic Exploration
Open Source Code Yes We provide the code in the supplementary material.
Open Datasets No We consider a randomly generated CMDP with deterministic rewards and unknown transitions.
Dataset Splits No No specific dataset splits (training, validation, test) were mentioned as the environment is randomly generated for simulation and interaction.
Hardware Specification Yes All simulations were performed on a Mac Book Pro 2.8 GHz Quad-Core Intel Core i7.
Software Dependencies No No specific software names with version numbers were mentioned.
Experiment Setup Yes For the vanilla algorithms, we run for K = 4000 episodes for each step size η {0.05, 0.075, 0.1, 0.125, 0.15, 0.2}, which we observed to be a reasonable range across CMDPs when fixing the number of episodes. Similarly, for the regularized algorithms, we perform the same parameter search across all pairs of step size η {0.05, 0.1, 0.2} and regularization parameter τ {0.01, 0.02}, totaling a number of six hyperparameter configurations as well. We always set λmax = 6, which did not play a role in our simulations as long as it was chosen sufficiently large. We use exploration bonuses 0.08 nh(s, a) 1/2.