Distributional Pareto-Optimal Multi-Objective Reinforcement Learning
Authors: Xin-Qiang Cai, Pushi Zhang, Li Zhao, Jiang Bian, Masashi Sugiyama, Ashley Llorens
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our method on several benchmark problems and demonstrated its effectiveness in discovering distributional Pareto-optimal policies and satisfying diverse distributional preferences compared to existing MORL methods. |
| Researcher Affiliation | Collaboration | 1 The University of Tokyo, Tokyo, Japan 2 Microsoft Research Asia, Beijing, China 3 RIKEN AIP, Tokyo, Japan |
| Pseudocode | Yes | Algorithm 1 Utility-based Reinforcement Learning |
| Open Source Code | Yes | The code is available on https://github.com/zpschang/DPMORL. |
| Open Datasets | Yes | We conducted experiments across five environments based on MO-Gymnasium [46] to evaluate the performance of our proposed method, DPMORL. These environments represent a diverse range of tasks, from simple toy problems to more complex continuous control tasks, and cover various aspects of multi-objective reinforcement learning: Deep Sea Treasure: A classic MORL benchmark... Fruit Tree: A multi-objective variant... Hal Cheetah, Hopper, Mountain Car: Three continuous control tasks... |
| Dataset Splits | No | The paper mentions training for a certain number of steps and the number of policies, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) as commonly seen in supervised learning contexts. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for its experiments (e.g., GPU models, CPU models, or memory details). |
| Software Dependencies | Yes | For the implementation of our DPMORL algorithm, we used the Proximal Policy Optimization (PPO) algorithm [44] as the basic RL algorithm. ... The utility functions were trained using the Adam optimizer [51]. |
| Experiment Setup | Yes | The learning rate was set to 3 × 10^-4, with a discount factor γ of 0.99. For PPO, we used a clipping parameter of 0.2. The batch size was set to 256 for all environments and algorithms, with updates performed every 2,048 steps. The utility functions were trained using the Adam optimizer [51], with a learning rate of 3 × 10^-4. |