Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
From End-to-end to Step-by-step: Learning to Abstract via Abductive Reinforcement Learning
Authors: Zilong Wang, Jiongda Wang, Xiaoyong Chen, Meng Wang, Ming Ma, ZhiPeng Wang, Zhenyu Zhou, Tianming Yang, Wang-Zhou Dai
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that A2RL can mitigate the delayed reward problem and improve the generalization capability compared to traditional end-to-end RL methods. We conducted experiments on two sets of benchmark tasks, and the results showed that A2RL effectively learned the abstract structure, improving performance and easing the challenge of learning from delayed feedback. |
| Researcher Affiliation | Academia | The affiliations listed are: National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Intelligence Science and Technology, Nanjing University, China; Institute of Neuroscience, State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, 200031, China; University of Chinese Academy of Sciences, School of Future Technology, Beijing, 100049, China. All these are academic or public research institutions, and the email domains (`.edu.cn`, `.ac.cn`) confirm an academic setting. |
| Pseudocode | No | The paper describes the A2RL framework textually and with conceptual diagrams (e.g., Figure 3), but it does not include a structured pseudocode block or algorithm. |
| Open Source Code | Yes | The code is available at https://github.com/sporeking/A2RL. |
| Open Datasets | Yes | Experiments are conducted in two types of benchmark environments having delayed reward feedback: (1) Minigrid [Chevalier-Boisvert et al., 2023]: Minigrid, implemented in Gymnasium [Towers et al., 2024], is a suite of easily configurable grid world environments specifically designed for RL research. (2) Taxi [Dietterich, 2000]: In Taxi, the taxi must pick up the passenger, drive to the destination, and drop them off to end the episode. |
| Dataset Splits | Yes | We replicated the setup in Section 5.2 but trained agents on procedurally generated random maps throughout the entire curriculum, and subsequently evaluated and finetuned on 50 unseen maps of comparable difficulty after training. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for conducting the experiments. |
| Software Dependencies | No | The paper mentions software like Minigrid (implemented in Gymnasium), PPO2, and D3QN, but it does not specify version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | No | The paper states total training steps (e.g., 300k, 600k, 900k) and describes the use of PPO2 and D3QN algorithms, but it does not provide specific hyperparameter values such as learning rates, batch sizes, or optimizer settings, which are crucial for reproducing the experiments. |