Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

Authors: Yuexiang Zhai, Christina Baek, Zhengyuan Zhou, Jiantao Jiao, Yi Ma

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also corroborate our theoretical results with extensive experiments on the Mini Grid environments using Q-learning and some popular deep RL algorithms.
Researcher Affiliation	Academia	Yuexiang Zhai EMAIL University of California, Berkeley Department of Electrical Engineering & Computer Sciences Berkeley, CA, 94720, USA Christina Baek EMAIL University of California, Berkeley Department of Electrical Engineering & Computer Sciences Berkeley, CA, 94720, USA Zhengyuan Zhou EMAIL New York University Stern School of Business New York, NY, 10012, USA Jiantao Jiao EMAIL University of California, Berkeley Department of Electrical Engineering & Computer Sciences Department of Statistics Berkeley, CA, 94720, USA Yi Ma EMAIL University of California, Berkeley Department of Electrical Engineering & Computer Sciences Berkeley, CA, 94720, USA
Pseudocode	No	The paper describes mathematical formulations and theoretical propositions but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	See https://github.com/kebaek/minigrid for the code to run all presented experiments.
Open Datasets	Yes	We experimentally verify in several Open AI Gym Mini Grid environments that the agent is able to learn a successful trajectory more quickly in OWSP and OWMP intermediate reward settings than the sparse reward setting.
Dataset Splits	No	The paper conducts experiments in reinforcement learning environments (Mini Grid) using training episodes and trials (e.g., 'For 0.8 ε-greedy Q-learning, we train 100 independent models and evaluate each model once, for a total of 100 trials.'). It does not specify traditional training/test/validation splits for a fixed dataset in the context of supervised learning.
Hardware Specification	Yes	Each training session for deep RL algorithms was run using a Ge Force RTX 2080 GPU.
Software Dependencies	No	The paper mentions deep RL algorithms like DQN, A2C, and PPO, and describes a 'Network architecture' (Table 7) using components like Conv2D and ReLU, but it does not specify version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version) that would be needed to reproduce the experiments.
Experiment Setup	Yes	Shared parameters are listed in Table 8, and parameters speciﬁc to each algorithm is provided in Table 9. For DQN, like asynchronous Q-learning, we use a 0.8 ε-greedy exploration strategy. (Table 8 shows Learning Rate 0.001. Table 9 shows Discount Factor (γ) 0.90, batch size 128, entropy coeff. 0.01, etc.)