reproducibilityindex.ai

Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal

Authors: Lihe Li, Ruotong Chen, Ziqian Zhang, Zhichao Wu, Yi-Chen Li, Cong Guan, Yang Yu, Lei Yuan

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on four CMORL benchmarks showcase that CORE3 effectively learns policies satisfying different preferences on all encountered objectives, and outperforms the best baseline by 171%, highlighting the capability of CORE3 to handle situations with evolving objectives.
Researcher Affiliation	Collaboration	1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3Polixir Technologies
Pseudocode	Yes	Algorithm 1 CORE3
Open Source Code	No	The paper does not contain any explicit statement about making its source code publicly available or providing a link to a code repository.
Open Datasets	Yes	The first benchmark is Fruit Tree Navigation (FTN) [Yang et al., 2019]... Another Mu Jo Co benchmark is Hopper, where the five objectives of the hopper robot include saving energy, moving forward, moving backward, jumping high, and staying low. In the CMORL setting, the agent learns a sequence of five MOMDPs. Each MOMDP is trained for 100k steps in FTN and 500k steps otherwise, with two objectives drawn from the objectives mentioned above.
Dataset Splits	No	The paper describes training durations ('100k steps in FTN and 500k steps otherwise') but does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined validation sets).
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions software components like GRU, MLP, MuJoCo, and PPO, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	The feature extractor first transforms the varying-length preference ω into a fixed-length embedding zω using a GRU [Chung et al., 2014]. Then, a multilayer perceptron (MLP) takes the state, action, and embedding zω as input and outputs feature e = E(s, a, ω; ξ). Then, each MLP head h(e; ψi) takes e as input, and outputs the Q value of objective ki... More details about the network architecture and the optimization process are provided in Appendix C.1.