Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning
Authors: Ruida Zhou, Tao Liu, Dileep Kalathil, P. R. Kumar, Chao Tian
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.In addition to the theoretical advantages, we provide the results of extensive experimentation in Section 5 and Appendices A and B which demonstrate that the ARNPG-guided algorithms provide superior performance in exact gradient and sample-based tabular scenarios, as well as actor-critic deep RL scenarios, compared to several existing policy gradient-based approaches. |
| Researcher Affiliation | Academia | Ruida Zhou Texas A&M University Tao Liu Texas A&M University Dileep Kalathil Texas A&M University P. R. Kumar Texas A&M University Chao Tian Texas A&M University |
| Pseudocode | Yes | Algorithm 1: Inner Loop( rk, k, , , tk) Algorithm 2: ARNPG Implicit Mirror Descent (ARNPG-IMD) Algorithm 3: ARNPG with Extra Primal Dual (ARNPG-EPD) Algorithm 4: ARNPG with Optimistic Mirror Descent Ascent (ARNPG-OMDA) |
| Open Source Code | Yes | We provide code at https://github.com/tliu1997/ARNPG-MORL. |
| Open Datasets | Yes | To demonstrate the efficacy of ARNPG-EPD on complex tasks, we have conducted experiments on the Acrobot-v1 environment from Open AI Gym [9]. |
| Dataset Splits | No | The paper does not explicitly specify train/validation/test splits, percentages, or sample counts. While it refers to 'randomly generated CMDP' and 'Acrobot-v1 environment', and mentions following settings from another paper, it does not detail the data splitting methodology within its own text. |
| Hardware Specification | Yes | For the sample-based tabular CMDP and Acrobot-v1 experiments, we used a single NVIDIA GeForce RTX 3090 GPU for each run. |
| Software Dependencies | No | The paper mentions environments like 'Open AI Gym' and implies the use of a deep learning framework given 'actor-critic deep RL scenarios', but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, Gym). |
| Experiment Setup | Yes | Experimental details on CMDP are postponed to Appendix A and further experiments on smooth concave scalarization and max-min trade-off are presented in Appendix B. For all algorithms, we choose the learning rate α = 0.001. For ARNPG-EPD, we set the inner loop iterations tk = 5. For both problems, we set the discount factor γ = 0.99. |