Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal
Authors: Lihe Li, Ruotong Chen, Ziqian Zhang, Zhichao Wu, Yi-Chen Li, Cong Guan, Yang Yu, Lei Yuan
IJCAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on four CMORL benchmarks showcase that CORE3 effectively learns policies satisfying different preferences on all encountered objectives, and outperforms the best baseline by 171%, highlighting the capability of CORE3 to handle situations with evolving objectives. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3Polixir Technologies |
| Pseudocode | Yes | Algorithm 1 CORE3 |
| Open Source Code | No | The paper does not contain any explicit statement about making its source code publicly available or providing a link to a code repository. |
| Open Datasets | Yes | The first benchmark is Fruit Tree Navigation (FTN) [Yang et al., 2019]... Another Mu Jo Co benchmark is Hopper, where the five objectives of the hopper robot include saving energy, moving forward, moving backward, jumping high, and staying low. In the CMORL setting, the agent learns a sequence of five MOMDPs. Each MOMDP is trained for 100k steps in FTN and 500k steps otherwise, with two objectives drawn from the objectives mentioned above. |
| Dataset Splits | No | The paper describes training durations ('100k steps in FTN and 500k steps otherwise') but does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined validation sets). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions software components like GRU, MLP, MuJoCo, and PPO, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The feature extractor first transforms the varying-length preference Ο into a fixed-length embedding zΟ using a GRU [Chung et al., 2014]. Then, a multilayer perceptron (MLP) takes the state, action, and embedding zΟ as input and outputs feature e = E(s, a, Ο; ΞΎ). Then, each MLP head h(e; Οi) takes e as input, and outputs the Q value of objective ki... More details about the network architecture and the optimization process are provided in Appendix C.1. |