Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling Pareto-Efficient Decision Making via Offline Multi-Objective RL
Authors: Baiting Zhu, Meihua Dang, Aditya Grover
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics. |
| Researcher Affiliation | Academia | Baiting Zhu, Meihua Dang, Aditya Grover University of California, Los Angeles, CA, USA EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Data Collection in D4MORL |
| Open Source Code | Yes | Our code is available at: https://github.com/baitingzbt/PEDA. |
| Open Datasets | Yes | We introduce Datasets for Multi-Objective Reinforcement Learning (D4MORL), a large-scale benchmark for offline MORL. Our benchmark consists of offline trajectories from 6 multiobjective Mu Jo Co environments including 5 environments with 2 objectives each (MO-Ant, MOHalf Cheetah, MO-Hopper, MO-Swimmer, MO-Walker2d), and one environment with three objectives (MO-Hopper-3obj). [...] Further details are described in Appendix C. |
| Dataset Splits | No | The paper describes splitting preferences for evaluation and mentions collecting 50K trajectories for each setting but does not specify a training/validation/test split for the dataset itself in a reproducible manner. It states: "For every environment in D4MORL, we collect 50K trajectories of length T 500 for both expert and amateur trajectory distributions under each of the 3 preference distributions." |
| Hardware Specification | No | The paper does not mention any specific hardware (GPU, CPU, etc.) used for running experiments. |
| Software Dependencies | No | The paper mentions using 'GPT (Radford et al., 2019)' and 'Scipy (Vasicek, 1976, Virtanen et al., 2020)' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In this section, we list our hyper-parameters and model details. In specific, we use the same hyperparameters for all algorithms, except for the learning rate scheduler and warm-up steps. [...] Hyperparameter MODT MORv S BC Context Length K 20 1 20 Batch Size 64 Hidden Size 512 Learning Rate 1e-4 Weight Decay 1e-3 Dropout 0.1 n layer 3 Optimizer Adam W Loss Function MSE LR Scheduler lambda None lambda Warm-up Steps 10000 N/A 4000 Activation Re LU |