Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

Authors: Ruida Zhou, Tao Liu, Dileep Kalathil, P. R. Kumar, Chao Tian

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.In addition to the theoretical advantages, we provide the results of extensive experimentation in Section 5 and Appendices A and B which demonstrate that the ARNPG-guided algorithms provide superior performance in exact gradient and sample-based tabular scenarios, as well as actor-critic deep RL scenarios, compared to several existing policy gradient-based approaches.
Researcher Affiliation Academia Ruida Zhou Texas A&M University Tao Liu Texas A&M University Dileep Kalathil Texas A&M University P. R. Kumar Texas A&M University Chao Tian Texas A&M University
Pseudocode Yes Algorithm 1: Inner Loop( rk, k, , , tk) Algorithm 2: ARNPG Implicit Mirror Descent (ARNPG-IMD) Algorithm 3: ARNPG with Extra Primal Dual (ARNPG-EPD) Algorithm 4: ARNPG with Optimistic Mirror Descent Ascent (ARNPG-OMDA)
Open Source Code Yes We provide code at https://github.com/tliu1997/ARNPG-MORL.
Open Datasets Yes To demonstrate the efficacy of ARNPG-EPD on complex tasks, we have conducted experiments on the Acrobot-v1 environment from Open AI Gym [9].
Dataset Splits No The paper does not explicitly specify train/validation/test splits, percentages, or sample counts. While it refers to 'randomly generated CMDP' and 'Acrobot-v1 environment', and mentions following settings from another paper, it does not detail the data splitting methodology within its own text.
Hardware Specification Yes For the sample-based tabular CMDP and Acrobot-v1 experiments, we used a single NVIDIA GeForce RTX 3090 GPU for each run.
Software Dependencies No The paper mentions environments like 'Open AI Gym' and implies the use of a deep learning framework given 'actor-critic deep RL scenarios', but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, Gym).
Experiment Setup Yes Experimental details on CMDP are postponed to Appendix A and further experiments on smooth concave scalarization and max-min trade-off are presented in Appendix B. For all algorithms, we choose the learning rate α = 0.001. For ARNPG-EPD, we set the inner loop iterations tk = 5. For both problems, we set the discount factor γ = 0.99.