Multi-Objective Reinforcement Learning: Convexity, Stationarity and Pareto Optimality

Authors: Haoye Lu, Daniel Herman, Yaoliang Yu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental our algorithm achieves state-of-the-art performance on multiple Mu Jo Co tasks in the preference agnostic setting. Furthermore, we empirically show that, in contrast to other LS-based algorithms, our approach is significantly more stable, achieving similar results across various random seeds. We test our algorithm over a multi-objective version of the Mu Jo Co environment. Fig 8 plots the methods trajectories on four Mu Jo Co benchmarks. We train each method five times with various random seeds and report the mean and standard deviation.
Researcher Affiliation Academia Haoye Lu, Daniel Herman & Yaoliang Yu School of Computer Science, University of Waterloo Vector Institute {haoye.lu,d2herman,yaoliang.yu}@uwaterloo.ca
Pseudocode Yes Algorithm 1: The CAPQL implementation
Open Source Code Yes The source code of our CAPQL implementation is available online: https://github.com/haoyelu/CAPQL.git.
Open Datasets Yes We test our algorithm over a multi-objective version of the Mu Jo Co environment. The reward vector was created by simply exposing the individual components that went into the regular scalar reward: adding them up recovers the default scalar reward. (See Appx I-Table 4 for details.)
Dataset Splits No The paper describes training and evaluation within the MuJoCo environments but does not specify dataset splits (e.g., percentages or sample counts) for training, validation, and testing as might be found with static datasets. The evaluation during training is done on 'randomly sampled weights'.
Hardware Specification No The paper mentions 'Training was done using pytorch-1.12.1 and NVIDIA s CUDA 11.6,' which implies NVIDIA GPUs, but does not specify exact GPU models, CPU models, or detailed hardware configurations used for experiments.
Software Dependencies Yes Python 3.10.4 was used as the primary programming language. We accessed Mu Jo Co210 through gym-0.21.0 s wrapper classes. Training was done using pytorch-1.12.1 and NVIDIA s CUDA 11.6.
Experiment Setup Yes Table 1: Hyperparameters of CAPQL and QEnv-ctn (Optimizer Adam, learning rate 3e-4, discount factor (γ) 0.99, hidden dim (for all networks) 256, replay buffer size 10^6, minibatch size 256, nonlinearity ReLU, target smoothing coefficient (τ) 0.005). Table 2: Augmentation strength of CAPQL (Environment, α).