Scaling Pareto-Efficient Decision Making via Offline Multi-Objective RL
Authors: Baiting Zhu, Meihua Dang, Aditya Grover
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics. |
| Researcher Affiliation | Academia | Baiting Zhu, Meihua Dang, Aditya Grover University of California, Los Angeles, CA, USA baitingzbt@g.ucla.edu, mhdang@cs.ucla.edu, adityag@cs.ucla.edu |
| Pseudocode | Yes | Algorithm 1 Data Collection in D4MORL |
| Open Source Code | Yes | Our code is available at: https://github.com/baitingzbt/PEDA. |
| Open Datasets | Yes | We introduce Datasets for Multi-Objective Reinforcement Learning (D4MORL), a large-scale benchmark for offline MORL. Our benchmark consists of offline trajectories from 6 multiobjective Mu Jo Co environments including 5 environments with 2 objectives each (MO-Ant, MOHalf Cheetah, MO-Hopper, MO-Swimmer, MO-Walker2d), and one environment with three objectives (MO-Hopper-3obj). [...] Further details are described in Appendix C. |
| Dataset Splits | No | The paper describes splitting preferences for evaluation and mentions collecting 50K trajectories for each setting but does not specify a training/validation/test split for the dataset itself in a reproducible manner. It states: "For every environment in D4MORL, we collect 50K trajectories of length T 500 for both expert and amateur trajectory distributions under each of the 3 preference distributions." |
| Hardware Specification | No | The paper does not mention any specific hardware (GPU, CPU, etc.) used for running experiments. |
| Software Dependencies | No | The paper mentions using 'GPT (Radford et al., 2019)' and 'Scipy (Vasicek, 1976, Virtanen et al., 2020)' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In this section, we list our hyper-parameters and model details. In specific, we use the same hyperparameters for all algorithms, except for the learning rate scheduler and warm-up steps. [...] Hyperparameter MODT MORv S BC Context Length K 20 1 20 Batch Size 64 Hidden Size 512 Learning Rate 1e-4 Weight Decay 1e-3 Dropout 0.1 n layer 3 Optimizer Adam W Loss Function MSE LR Scheduler lambda None lambda Warm-up Steps 10000 N/A 4000 Activation Re LU |