Self-Composing Policies for Scalable Continual Reinforcement Learning
Authors: Mikel Malagon, Josu Ceberio, Jose A. Lozano
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted in benchmark continuous control and visual problems reveal that the proposed approach achieves greater knowledge transfer and performance than alternative methods.1 |
| Researcher Affiliation | Academia | 1Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, Donostia San Sebastian, Spain 2Basque Center for Applied Mathematics (BCAM), Bilbao, Spain. |
| Pseudocode | No | The paper includes a diagram (Figure 2) illustrating the architecture but does not provide any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at https://github.com/mikelma/componet. |
| Open Datasets | Yes | The first sequence includes 20 robotic arm manipulation tasks (a sequence of 10 different tasks repeated twice) from Meta-World (Yu et al., 2020b)... The other two task sequences are selected from the Arcade Learning Environment (Machado et al., 2018). In this case, actions are discrete, and states consist of RGB images of 210 × 160 pixels. Thus, we employ a CNN encoder as described in Section 4.1 to encode images into a lower dimensional space (see Appendix E.1). |
| Dataset Splits | No | The paper describes success rate and episodic return as evaluation metrics for task performance over time and for comparing methods, but it does not specify a separate 'validation' dataset split for hyperparameter tuning or early stopping criteria. |
| Hardware Specification | Yes | The experimentation has been conducted on two cluster nodes, one containing eight RTX3090 GPUs, an Intel Xeon Silver 4210R CPU, and 345GB of RAM, while the second comprises eight Nvidia A5000 GPUs with an AMD EPYC 7252 CPU and 377GB of RAM. |
| Software Dependencies | No | The paper mentions basing implementations on Huang et al. (2022) and using SAC and PPO algorithms, but it does not specify version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | In the Meta-World sequence, where SAC is used to optimize all methods, the complete list of hyperparameters is provided in Table E.1. Table E.2 shows the hyperparameters shared by every method in the case of the Space Invaders and Freeway sequences under the PPO algorithm. |