Self-Composing Policies for Scalable Continual Reinforcement Learning

Authors: Mikel Malagon, Josu Ceberio, Jose A. Lozano

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments conducted in benchmark continuous control and visual problems reveal that the proposed approach achieves greater knowledge transfer and performance than alternative methods.1
Researcher Affiliation Academia 1Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia San Sebastian, Spain 2Basque Center for Applied Mathematics (BCAM), Bilbao, Spain.
Pseudocode No The paper includes a diagram (Figure 2) illustrating the architecture but does not provide any pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/mikelma/componet.
Open Datasets Yes The first sequence includes 20 robotic arm manipulation tasks (a sequence of 10 different tasks repeated twice) from Meta-World (Yu et al., 2020b)... The other two task sequences are selected from the Arcade Learning Environment (Machado et al., 2018). In this case, actions are discrete, and states consist of RGB images of 210 × 160 pixels. Thus, we employ a CNN encoder as described in Section 4.1 to encode images into a lower dimensional space (see Appendix E.1).
Dataset Splits No The paper describes success rate and episodic return as evaluation metrics for task performance over time and for comparing methods, but it does not specify a separate 'validation' dataset split for hyperparameter tuning or early stopping criteria.
Hardware Specification Yes The experimentation has been conducted on two cluster nodes, one containing eight RTX3090 GPUs, an Intel Xeon Silver 4210R CPU, and 345GB of RAM, while the second comprises eight Nvidia A5000 GPUs with an AMD EPYC 7252 CPU and 377GB of RAM.
Software Dependencies No The paper mentions basing implementations on Huang et al. (2022) and using SAC and PPO algorithms, but it does not specify version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes In the Meta-World sequence, where SAC is used to optimize all methods, the complete list of hyperparameters is provided in Table E.1. Table E.2 shows the hyperparameters shared by every method in the case of the Space Invaders and Freeway sequences under the PPO algorithm.