Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A distributional view on multi-objective policy optimization
Authors: Abbas Abdolmaleki, Sandy Huang, Leonard Hasenclever, Michael Neunert, Francis Song, Martina Zambelli, Murilo Martins, Nicolas Heess, Raia Hadsell, Martin Riedmiller
ICML 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach on challenging high-dimensional real and simulated robotics tasks, and show that setting different preferences in our framework allows us to trace out the space of nondominated solutions. |
| Researcher Affiliation | Industry | 1Deep Mind. Correspondence to: Abbas Abdolmaleki <EMAIL>, Sandy H. Huang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 MO-MPO: One policy improvement step |
| Open Source Code | Yes | Code for MO-MPO will be made available online.1 |
| Open Datasets | No | No explicit public dataset links, DOIs, or repository names are provided. The paper mentions using "motion capture reference data" from "Hasenclever et al. (2020)" and "treasure values in Yang et al. (2019)" but does not provide direct access information for these data sources. |
| Dataset Splits | No | The paper does not specify training, validation, or test dataset splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | No | No specific GPU or CPU models, or other detailed hardware specifications for the computing resources used for experiments, are provided. |
| Software Dependencies | Yes | We use CVXOPT (Andersen et al., 2020) as our convex optimization solver. |
| Experiment Setup | Yes | We set ϵ = 0.01 for scalarized MPO. If we start with a uniform policy and run MPO with β = 0.001 until the policy converges... For MO-V-MPO, we set all ϵk = 0.01. Also, for each objective, we set ϵk = 0.001 and set all others to 0.01. ...for MO-MPO we set ϵtask = 0.1 and ϵforce = 0.05, and for scalarized MPO we try [wtask, wforce] = [0.95, 0.05] and [0.8, 0.2]. |