Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning
Authors: Alessandro Montenegro, Marco Mussi, Matteo Papini, Alberto Maria Metelli
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines, demonstrating their effectiveness. |
| Researcher Affiliation | Academia | Alessandro Montenegro Politecnico di Milano, Milan, Italy alessandro.montenegro@polimi.it Marco Mussi Politecnico di Milano, Milan, Italy marco.mussi@polimi.it Matteo Papini Politecnico di Milano, Milan, Italy matteo.papini@polimi.it Alberto Maria Metelli Politecnico di Milano, Milan, Italy albertomaria.metelli@polimi.it |
| Pseudocode | Yes | Algorithms. Both algorithms, whose pseudo-codes are deferred to Appendix A, aim at solving the RCOP of Equation (11), finding the best feasible (hyper)policy parameterization. |
| Open Source Code | Yes | The code to run the experiments in this paper is available at https://github.com/Montenegro Alessandro/Magic RL. |
| Open Datasets | Yes | In our experiments, we consider a Cost LQR environment whose main characteristics are reported in Table 5. ... For our experiments on risk minimization, we utilized environments from the Mu Jo Co control suite (Todorov et al., 2012)... |
| Dataset Splits | No | The paper describes batch sizes (N) for collecting trajectories during learning, and for NPG-PD2 and RPG-PD2, it mentions 'N1 500 were used for the inner critic-loop, while N2 100 for performance and cost estimations.' However, it does not specify explicit training/validation/test dataset splits in the conventional sense, as is common in supervised learning. |
| Hardware Specification | Yes | All the experiments were run on a 2019 16-inches Mac Book Pro. The machine was equipped as follows: CPU Intel Core i7 (6 cores, 2.6 GHz) 16 GB 2667 MHz DDR4 GPU Intel UHD Graphics 630 1536 MB |
| Software Dependencies | No | The paper mentions the use of 'Adam (Kingma and Ba, 2015) scheduler' and 'Mu Jo Co control suite (Todorov et al., 2012)' but does not provide specific version numbers for these or other software dependencies like programming languages or machine learning frameworks. |
| Experiment Setup | Yes | In particular, for both C-PGAE and NPG-PD, we employed ζθ 0.01 and ζλ 0.1, while for RPG-PD we selected ζθ 0.01 and ζλ 0.01. For C-PGAE and RPG-PD we used a regularization constant ω 10 4. All the details about the experimental setting are summarized in Table 6. |