Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning
Authors: Yuexiang Zhai, Christina Baek, Zhengyuan Zhou, Jiantao Jiao, Yi Ma
JAIR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also corroborate our theoretical results with extensive experiments on the Mini Grid environments using Q-learning and some popular deep RL algorithms. |
| Researcher Affiliation | Academia | Yuexiang Zhai EMAIL University of California, Berkeley Department of Electrical Engineering & Computer Sciences Berkeley, CA, 94720, USA Christina Baek EMAIL University of California, Berkeley Department of Electrical Engineering & Computer Sciences Berkeley, CA, 94720, USA Zhengyuan Zhou EMAIL New York University Stern School of Business New York, NY, 10012, USA Jiantao Jiao EMAIL University of California, Berkeley Department of Electrical Engineering & Computer Sciences Department of Statistics Berkeley, CA, 94720, USA Yi Ma EMAIL University of California, Berkeley Department of Electrical Engineering & Computer Sciences Berkeley, CA, 94720, USA |
| Pseudocode | No | The paper describes mathematical formulations and theoretical propositions but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | See https://github.com/kebaek/minigrid for the code to run all presented experiments. |
| Open Datasets | Yes | We experimentally verify in several Open AI Gym Mini Grid environments that the agent is able to learn a successful trajectory more quickly in OWSP and OWMP intermediate reward settings than the sparse reward setting. |
| Dataset Splits | No | The paper conducts experiments in reinforcement learning environments (Mini Grid) using training episodes and trials (e.g., 'For 0.8 ε-greedy Q-learning, we train 100 independent models and evaluate each model once, for a total of 100 trials.'). It does not specify traditional training/test/validation splits for a fixed dataset in the context of supervised learning. |
| Hardware Specification | Yes | Each training session for deep RL algorithms was run using a Ge Force RTX 2080 GPU. |
| Software Dependencies | No | The paper mentions deep RL algorithms like DQN, A2C, and PPO, and describes a 'Network architecture' (Table 7) using components like Conv2D and ReLU, but it does not specify version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version) that would be needed to reproduce the experiments. |
| Experiment Setup | Yes | Shared parameters are listed in Table 8, and parameters specific to each algorithm is provided in Table 9. For DQN, like asynchronous Q-learning, we use a 0.8 ε-greedy exploration strategy. (Table 8 shows Learning Rate 0.001. Table 9 shows Discount Factor (γ) 0.90, batch size 128, entropy coeff. 0.01, etc.) |