reproducibilityindex.ai

Distributional Pareto-Optimal Multi-Objective Reinforcement Learning

Authors: Xin-Qiang Cai, Pushi Zhang, Li Zhao, Jiang Bian, Masashi Sugiyama, Ashley Llorens

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated our method on several benchmark problems and demonstrated its effectiveness in discovering distributional Pareto-optimal policies and satisfying diverse distributional preferences compared to existing MORL methods.
Researcher Affiliation	Collaboration	1 The University of Tokyo, Tokyo, Japan 2 Microsoft Research Asia, Beijing, China 3 RIKEN AIP, Tokyo, Japan
Pseudocode	Yes	Algorithm 1 Utility-based Reinforcement Learning
Open Source Code	Yes	The code is available on https://github.com/zpschang/DPMORL.
Open Datasets	Yes	We conducted experiments across five environments based on MO-Gymnasium [46] to evaluate the performance of our proposed method, DPMORL. These environments represent a diverse range of tasks, from simple toy problems to more complex continuous control tasks, and cover various aspects of multi-objective reinforcement learning: Deep Sea Treasure: A classic MORL benchmark... Fruit Tree: A multi-objective variant... Hal Cheetah, Hopper, Mountain Car: Three continuous control tasks...
Dataset Splits	No	The paper mentions training for a certain number of steps and the number of policies, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) as commonly seen in supervised learning contexts.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used for its experiments (e.g., GPU models, CPU models, or memory details).
Software Dependencies	Yes	For the implementation of our DPMORL algorithm, we used the Proximal Policy Optimization (PPO) algorithm [44] as the basic RL algorithm. ... The utility functions were trained using the Adam optimizer [51].
Experiment Setup	Yes	The learning rate was set to 3 × 10^-4, with a discount factor γ of 0.99. For PPO, we used a clipping parameter of 0.2. The batch size was set to 256 for all environments and algorithms, with updates performed every 2,048 steps. The utility functions were trained using the Adam optimizer [51], with a learning rate of 3 × 10^-4.