ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints
Authors: Akhil Agnihotri, Rahul Jain, Haipeng Luo
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging Open AI Gym environments, show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs. |
| Researcher Affiliation | Collaboration | 1University of Southern California, Los Angeles, CA, USA. RJ is also affiliated with Google Deep Mind. |
| Pseudocode | Yes | Algorithm 1 Average-Constrained Policy Optimization (ACPO) |
| Open Source Code | No | Code of the ACPO implementation will be made available on Git Hub. |
| Open Datasets | Yes | We work with the Open AI Gym environments to train the various learning agent on the following tasks Gather, Circle, Grid, and Bottleneck tasks (see Figure 3 in Appendix A.6.1 for more details on the environments). For our experimental evaluation, we use several Open AI Gym environments from Todorov et al. (2012). |
| Dataset Splits | No | The paper describes training steps and evaluation trajectories but does not explicitly provide percentages or counts for a separate validation dataset split. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using implementations from specific GitHub repositories and Open AI Gym environments but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Table 1. Hyperparameter Setup includes: No. of hidden layers, Activation, Initial log std, Batch size, GAE parameter (reward), GAE parameter (cost), Trust region step size δ, Learning rate for policy, Learning rate for reward critic net, Learning rate for cost critic net, Backtracking coeff., Max backtracking iterations, Max conjugate gradient iterations, Recovery regime parameter t. Also, Section 5.1 details neural network sizes. |