reproducibilityindex.ai

PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning

Authors: Tianmeng Hu, Biao Luo

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.
Researcher Affiliation	Academia	Tianmeng Hu, Biao Luo School of Automation, Central South University, Changsha 410083, China
Pseudocode	Yes	Algorithm 1: PA2D-MORL
Open Source Code	No	The paper does not include an unambiguous statement or link indicating that the source code for the methodology described is publicly available.
Open Datasets	Yes	The proposed method is evaluated in seven multi-objective continuous robot control environments. These environments are based on Mu Jo Co (Todorov, Erez, and Tassa 2012), which is a widely used complex Deep RL benchmark. The original tasks were modified into multi-objective problems with multiple conflicting objectives (Xu et al. 2020).
Dataset Splits	No	The paper does not provide explicit training, validation, or test dataset splits with percentages or sample counts. It discusses multi-objective robot control environments where agents interact directly, rather than using fixed dataset splits.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running its experiments.
Software Dependencies	No	The paper mentions 'Mu Jo Co' and the 'PPO algorithm', but does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup	Yes	All baselines, as well as our methods, update 8 policies in parallel within the evolutionary framework, i.e., p = 8 in Algorithm 1. The policy parameter updates are performed using the PPO algorithm. Also, all methods contain a warmup phase of mw iterations, which generates the first generation of policy population. In the experiments above, we use Mft = 1/3M, meaning that PA-FT is involved at one-third of the training.