Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal
Authors: Lihe Li, Ruotong Chen, Ziqian Zhang, Zhichao Wu, Yi-Chen Li, Cong Guan, Yang Yu, Lei Yuan
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on four CMORL benchmarks showcase that CORE3 effectively learns policies satisfying different preferences on all encountered objectives, and outperforms the best baseline by 171%, highlighting the capability of CORE3 to handle situations with evolving objectives. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3Polixir Technologies |
| Pseudocode | Yes | Algorithm 1 CORE3 |
| Open Source Code | No | The paper does not contain any explicit statement about making its source code publicly available or providing a link to a code repository. |
| Open Datasets | Yes | The first benchmark is Fruit Tree Navigation (FTN) [Yang et al., 2019]... Another Mu Jo Co benchmark is Hopper, where the five objectives of the hopper robot include saving energy, moving forward, moving backward, jumping high, and staying low. In the CMORL setting, the agent learns a sequence of five MOMDPs. Each MOMDP is trained for 100k steps in FTN and 500k steps otherwise, with two objectives drawn from the objectives mentioned above. |
| Dataset Splits | No | The paper describes training durations ('100k steps in FTN and 500k steps otherwise') but does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined validation sets). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions software components like GRU, MLP, MuJoCo, and PPO, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The feature extractor first transforms the varying-length preference ω into a fixed-length embedding zω using a GRU [Chung et al., 2014]. Then, a multilayer perceptron (MLP) takes the state, action, and embedding zω as input and outputs feature e = E(s, a, ω; ξ). Then, each MLP head h(e; ψi) takes e as input, and outputs the Q value of objective ki... More details about the network architecture and the optimization process are provided in Appendix C.1. |