Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Distributional Pareto-Optimal Multi-Objective Reinforcement Learning
Authors: Xin-Qiang Cai, Pushi Zhang, Li Zhao, Jiang Bian, Masashi Sugiyama, Ashley Llorens
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our method on several benchmark problems and demonstrated its effectiveness in discovering distributional Pareto-optimal policies and satisfying diverse distributional preferences compared to existing MORL methods. |
| Researcher Affiliation | Collaboration | 1 The University of Tokyo, Tokyo, Japan 2 Microsoft Research Asia, Beijing, China 3 RIKEN AIP, Tokyo, Japan |
| Pseudocode | Yes | Algorithm 1 Utility-based Reinforcement Learning |
| Open Source Code | Yes | The code is available on https://github.com/zpschang/DPMORL. |
| Open Datasets | Yes | We conducted experiments across five environments based on MO-Gymnasium [46] to evaluate the performance of our proposed method, DPMORL. These environments represent a diverse range of tasks, from simple toy problems to more complex continuous control tasks, and cover various aspects of multi-objective reinforcement learning: Deep Sea Treasure: A classic MORL benchmark... Fruit Tree: A multi-objective variant... Hal Cheetah, Hopper, Mountain Car: Three continuous control tasks... |
| Dataset Splits | No | The paper mentions training for a certain number of steps and the number of policies, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) as commonly seen in supervised learning contexts. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for its experiments (e.g., GPU models, CPU models, or memory details). |
| Software Dependencies | Yes | For the implementation of our DPMORL algorithm, we used the Proximal Policy Optimization (PPO) algorithm [44] as the basic RL algorithm. ... The utility functions were trained using the Adam optimizer [51]. |
| Experiment Setup | Yes | The learning rate was set to 3 × 10^-4, with a discount factor γ of 0.99. For PPO, we used a clipping parameter of 0.2. The batch size was set to 256 for all environments and algorithms, with updates performed every 2,048 steps. The utility functions were trained using the Adam optimizer [51], with a learning rate of 3 × 10^-4. |