Truly No-Regret Learning in Constrained MDPs
Authors: Adrian Müller, Pragnya Alatur, Volkan Cevher, Giorgia Ramponi, Niao He
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additionally, we provide numerical evaluations of our algorithm in simple environments. We perform numerical simulations of our algorithms and compare them to their unregularized counterparts (Efroni et al., 2020). |
| Researcher Affiliation | Academia | 1EPFL 2ETH Zurich 3University of Zurich |
| Pseudocode | Yes | Algorithm 1 Regularized Primal-Dual Algorithm with Optimistic Exploration |
| Open Source Code | Yes | We provide the code in the supplementary material. |
| Open Datasets | No | We consider a randomly generated CMDP with deterministic rewards and unknown transitions. |
| Dataset Splits | No | No specific dataset splits (training, validation, test) were mentioned as the environment is randomly generated for simulation and interaction. |
| Hardware Specification | Yes | All simulations were performed on a Mac Book Pro 2.8 GHz Quad-Core Intel Core i7. |
| Software Dependencies | No | No specific software names with version numbers were mentioned. |
| Experiment Setup | Yes | For the vanilla algorithms, we run for K = 4000 episodes for each step size η {0.05, 0.075, 0.1, 0.125, 0.15, 0.2}, which we observed to be a reasonable range across CMDPs when fixing the number of episodes. Similarly, for the regularized algorithms, we perform the same parameter search across all pairs of step size η {0.05, 0.1, 0.2} and regularization parameter τ {0.01, 0.02}, totaling a number of six hyperparameter configurations as well. We always set λmax = 6, which did not play a role in our simulations as long as it was chosen sufficiently large. We use exploration bonuses 0.08 nh(s, a) 1/2. |