A distributional view on multi-objective policy optimization

Authors: Abbas Abdolmaleki, Sandy Huang, Leonard Hasenclever, Michael Neunert, Francis Song, Martina Zambelli, Murilo Martins, Nicolas Heess, Raia Hadsell, Martin Riedmiller

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach on challenging high-dimensional real and simulated robotics tasks, and show that setting different preferences in our framework allows us to trace out the space of nondominated solutions.
Researcher Affiliation Industry 1Deep Mind. Correspondence to: Abbas Abdolmaleki <aabdolmaleki@google.com>, Sandy H. Huang <shhuang@google.com>.
Pseudocode Yes Algorithm 1 MO-MPO: One policy improvement step
Open Source Code Yes Code for MO-MPO will be made available online.1
Open Datasets No No explicit public dataset links, DOIs, or repository names are provided. The paper mentions using "motion capture reference data" from "Hasenclever et al. (2020)" and "treasure values in Yang et al. (2019)" but does not provide direct access information for these data sources.
Dataset Splits No The paper does not specify training, validation, or test dataset splits (e.g., percentages or sample counts) for reproducibility.
Hardware Specification No No specific GPU or CPU models, or other detailed hardware specifications for the computing resources used for experiments, are provided.
Software Dependencies Yes We use CVXOPT (Andersen et al., 2020) as our convex optimization solver.
Experiment Setup Yes We set ϵ = 0.01 for scalarized MPO. If we start with a uniform policy and run MPO with β = 0.001 until the policy converges... For MO-V-MPO, we set all ϵk = 0.01. Also, for each objective, we set ϵk = 0.001 and set all others to 0.01. ...for MO-MPO we set ϵtask = 0.1 and ϵforce = 0.05, and for scalarized MPO we try [wtask, wforce] = [0.95, 0.05] and [0.8, 0.2].