Scaling Pareto-Efficient Decision Making via Offline Multi-Objective RL

Authors: Baiting Zhu, Meihua Dang, Aditya Grover

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics.
Researcher Affiliation Academia Baiting Zhu, Meihua Dang, Aditya Grover University of California, Los Angeles, CA, USA baitingzbt@g.ucla.edu, mhdang@cs.ucla.edu, adityag@cs.ucla.edu
Pseudocode Yes Algorithm 1 Data Collection in D4MORL
Open Source Code Yes Our code is available at: https://github.com/baitingzbt/PEDA.
Open Datasets Yes We introduce Datasets for Multi-Objective Reinforcement Learning (D4MORL), a large-scale benchmark for offline MORL. Our benchmark consists of offline trajectories from 6 multiobjective Mu Jo Co environments including 5 environments with 2 objectives each (MO-Ant, MOHalf Cheetah, MO-Hopper, MO-Swimmer, MO-Walker2d), and one environment with three objectives (MO-Hopper-3obj). [...] Further details are described in Appendix C.
Dataset Splits No The paper describes splitting preferences for evaluation and mentions collecting 50K trajectories for each setting but does not specify a training/validation/test split for the dataset itself in a reproducible manner. It states: "For every environment in D4MORL, we collect 50K trajectories of length T 500 for both expert and amateur trajectory distributions under each of the 3 preference distributions."
Hardware Specification No The paper does not mention any specific hardware (GPU, CPU, etc.) used for running experiments.
Software Dependencies No The paper mentions using 'GPT (Radford et al., 2019)' and 'Scipy (Vasicek, 1976, Virtanen et al., 2020)' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes In this section, we list our hyper-parameters and model details. In specific, we use the same hyperparameters for all algorithms, except for the learning rate scheduler and warm-up steps. [...] Hyperparameter MODT MORv S BC Context Length K 20 1 20 Batch Size 64 Hidden Size 512 Learning Rate 1e-4 Weight Decay 1e-3 Dropout 0.1 n layer 3 Optimizer Adam W Loss Function MSE LR Scheduler lambda None lambda Warm-up Steps 10000 N/A 4000 Activation Re LU