Distributional Pareto-Optimal Multi-Objective Reinforcement Learning

Authors: Xin-Qiang Cai, Pushi Zhang, Li Zhao, Jiang Bian, Masashi Sugiyama, Ashley Llorens

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our method on several benchmark problems and demonstrated its effectiveness in discovering distributional Pareto-optimal policies and satisfying diverse distributional preferences compared to existing MORL methods.
Researcher Affiliation Collaboration 1 The University of Tokyo, Tokyo, Japan 2 Microsoft Research Asia, Beijing, China 3 RIKEN AIP, Tokyo, Japan
Pseudocode Yes Algorithm 1 Utility-based Reinforcement Learning
Open Source Code Yes The code is available on https://github.com/zpschang/DPMORL.
Open Datasets Yes We conducted experiments across five environments based on MO-Gymnasium [46] to evaluate the performance of our proposed method, DPMORL. These environments represent a diverse range of tasks, from simple toy problems to more complex continuous control tasks, and cover various aspects of multi-objective reinforcement learning: Deep Sea Treasure: A classic MORL benchmark... Fruit Tree: A multi-objective variant... Hal Cheetah, Hopper, Mountain Car: Three continuous control tasks...
Dataset Splits No The paper mentions training for a certain number of steps and the number of policies, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) as commonly seen in supervised learning contexts.
Hardware Specification No The paper does not explicitly describe the specific hardware used for its experiments (e.g., GPU models, CPU models, or memory details).
Software Dependencies Yes For the implementation of our DPMORL algorithm, we used the Proximal Policy Optimization (PPO) algorithm [44] as the basic RL algorithm. ... The utility functions were trained using the Adam optimizer [51].
Experiment Setup Yes The learning rate was set to 3 × 10^-4, with a discount factor γ of 0.99. For PPO, we used a clipping parameter of 0.2. The batch size was set to 256 for all environments and algorithms, with updates performed every 2,048 steps. The utility functions were trained using the Adam optimizer [51], with a learning rate of 3 × 10^-4.