PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning

Authors: Tianmeng Hu, Biao Luo

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.
Researcher Affiliation Academia Tianmeng Hu, Biao Luo School of Automation, Central South University, Changsha 410083, China
Pseudocode Yes Algorithm 1: PA2D-MORL
Open Source Code No The paper does not include an unambiguous statement or link indicating that the source code for the methodology described is publicly available.
Open Datasets Yes The proposed method is evaluated in seven multi-objective continuous robot control environments. These environments are based on Mu Jo Co (Todorov, Erez, and Tassa 2012), which is a widely used complex Deep RL benchmark. The original tasks were modified into multi-objective problems with multiple conflicting objectives (Xu et al. 2020).
Dataset Splits No The paper does not provide explicit training, validation, or test dataset splits with percentages or sample counts. It discusses multi-objective robot control environments where agents interact directly, rather than using fixed dataset splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running its experiments.
Software Dependencies No The paper mentions 'Mu Jo Co' and the 'PPO algorithm', but does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup Yes All baselines, as well as our methods, update 8 policies in parallel within the evolutionary framework, i.e., p = 8 in Algorithm 1. The policy parameter updates are performed using the PPO algorithm. Also, all methods contain a warmup phase of mw iterations, which generates the first generation of policy population. In the experiments above, we use Mft = 1/3M, meaning that PA-FT is involved at one-third of the training.