Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Continual Knowledge Adaptation for Reinforcement Learning

Authors: Jinwu Hu, ZiHao Lian, Zhiquan Wen, Chenghao Li, Guohao Chen, Xutao Wen, Bin Xiao, Mingkui Tan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on three benchmarks demonstrate that the proposed CKARL outperforms state-of-the-art methods, achieving an improvement of 4.20% in overall performance and 8.02% in forward transfer.
Researcher Affiliation Academia 1South China University of Technology, 2Pazhou Laboratory, 3Chongqing University of Posts and Telecommunications 4Key Laboratory of Big Data and Intelligent Robot, Ministry of Education, Equal contribution. Email: EMAIL, EMAIL, EMAIL Corresponding author. Email: EMAIL, EMAIL
Pseudocode Yes The pseudo-code of CKA-RL is summarized in Algorithm 1.
Open Source Code Yes The source code is available at https://github.com/Fhujinwu/CKA-RL.
Open Datasets Yes We follow the experimental settings established in prior work [32] and compare CKA-RL with SOTA CRL methods across three distinct dynamic task sequences, including 1) Meta-World [57], 2) Freeway [31], and 3) Space Invaders [31].
Dataset Splits No The paper defines task sequences and game modes (e.g., 20-task sequence, 10 distinct manipulation tasks repeated twice for Meta-World; ten strategically selected game modes for Space Invaders; eight distinct game modes for Freeway). However, it does not provide explicit numerical train/test/validation splits for a static dataset, as is typical in supervised learning. Instead, the experimental setup involves training agents within these environments for a specified number of steps, and evaluation metrics are reported on the tasks themselves.
Hardware Specification No The paper discusses the use of a CNN encoder for high-dimensional Atari inputs and mentions memory usage analysis, but it does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments.
Software Dependencies No We follow the prior work [32], employing SAC [16] for Meta-World and PPO [42] for Freeway and Space Invaders. For high-dimensional Atari inputs (210 Ɨ 160 RGB), a CNN encoder maps images to compact latent features. All tasks are trained for = 1M steps. We use Adam (momentum 0.9, second moment 0.999)...
Experiment Setup Yes Implementation Details. We follow the prior work [32], employing SAC [16] for Meta-World and PPO [42] for Freeway and Space Invaders. For high-dimensional Atari inputs (210 Ɨ 160 RGB), a CNN encoder maps images to compact latent features. All tasks are trained for = 1M steps. We use Adam (momentum 0.9, second moment 0.999), with batch sizes 1024/128 and learning rates 2.5 Ɨ 10āˆ’4/1 Ɨ 10āˆ’3 for PPO/SAC. The discount factor is γ = 0.99. For SAC, the action standard deviation is constrained to [eāˆ’20, e2], with target smoothing coefficient 5 Ɨ 10āˆ’3, auto-tuned entropy coefficient 0.2, and action noise clipped to 0.5. Learning starts after 5 Ɨ 103 steps using 104 random actions for exploration. Policy and target networks are updated every 2 and 1 steps, respectively, using 3-layer MLPs with 256 hidden units. For PPO, we apply GAE with Ī» = 0.95 across 8 parallel environments, gradient clipping at 0.5, PPO clip of 0.2, entropy coefficient 0.01, and 128 rollout steps. The agent uses a 2-layer MLP with 512 units, and advantage normalization is employed.