Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal

Authors: Lihe Li, Ruotong Chen, Ziqian Zhang, Zhichao Wu, Yi-Chen Li, Cong Guan, Yang Yu, Lei Yuan

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on four CMORL benchmarks showcase that CORE3 effectively learns policies satisfying different preferences on all encountered objectives, and outperforms the best baseline by 171%, highlighting the capability of CORE3 to handle situations with evolving objectives.
Researcher Affiliation Collaboration 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3Polixir Technologies
Pseudocode Yes Algorithm 1 CORE3
Open Source Code No The paper does not contain any explicit statement about making its source code publicly available or providing a link to a code repository.
Open Datasets Yes The first benchmark is Fruit Tree Navigation (FTN) [Yang et al., 2019]... Another Mu Jo Co benchmark is Hopper, where the five objectives of the hopper robot include saving energy, moving forward, moving backward, jumping high, and staying low. In the CMORL setting, the agent learns a sequence of five MOMDPs. Each MOMDP is trained for 100k steps in FTN and 500k steps otherwise, with two objectives drawn from the objectives mentioned above.
Dataset Splits No The paper describes training durations ('100k steps in FTN and 500k steps otherwise') but does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined validation sets).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions software components like GRU, MLP, MuJoCo, and PPO, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes The feature extractor first transforms the varying-length preference ω into a fixed-length embedding zω using a GRU [Chung et al., 2014]. Then, a multilayer perceptron (MLP) takes the state, action, and embedding zω as input and outputs feature e = E(s, a, ω; ξ). Then, each MLP head h(e; ψi) takes e as input, and outputs the Q value of objective ki... More details about the network architecture and the optimization process are provided in Appendix C.1.