PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning
Authors: Tianmeng Hu, Biao Luo
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes. |
| Researcher Affiliation | Academia | Tianmeng Hu, Biao Luo School of Automation, Central South University, Changsha 410083, China |
| Pseudocode | Yes | Algorithm 1: PA2D-MORL |
| Open Source Code | No | The paper does not include an unambiguous statement or link indicating that the source code for the methodology described is publicly available. |
| Open Datasets | Yes | The proposed method is evaluated in seven multi-objective continuous robot control environments. These environments are based on Mu Jo Co (Todorov, Erez, and Tassa 2012), which is a widely used complex Deep RL benchmark. The original tasks were modified into multi-objective problems with multiple conflicting objectives (Xu et al. 2020). |
| Dataset Splits | No | The paper does not provide explicit training, validation, or test dataset splits with percentages or sample counts. It discusses multi-objective robot control environments where agents interact directly, rather than using fixed dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Mu Jo Co' and the 'PPO algorithm', but does not provide specific version numbers for software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | All baselines, as well as our methods, update 8 policies in parallel within the evolutionary framework, i.e., p = 8 in Algorithm 1. The policy parameter updates are performed using the PPO algorithm. Also, all methods contain a warmup phase of mw iterations, which generates the first generation of policy population. In the experiments above, we use Mft = 1/3M, meaning that PA-FT is involved at one-third of the training. |