Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Harnessing Structures for Value-Based Planning and Reinforcement Learning
Authors: Yuzhe Yang, Guo Zhang, Zhi Xu, Dina Katabi
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on control tasks and Atari games confirm the efficacy of our approach. |
| Researcher Affiliation | Academia | Yuzhe Yang , Guo Zhang , Zhi Xu , Dina Katabi Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology EMAIL |
| Pseudocode | Yes | In Appendix A, we provide the pseudo-code and additionally, a short discussion on the technical difficulty for theoretical analysis. |
| Open Source Code | Yes | Code is available at: https://github.com/YyzHarry/SV-RL |
| Open Datasets | No | The paper mentions using |
| Dataset Splits | No | No specific percentages or counts for training/validation/test splits were found. The paper mentions |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory, etc.) were mentioned for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer (Kingma & Ba, 2014)' but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | In all experiments, we set the hyper-parameters as follows: learning rate α = 1e-5, discount coefficient γ = 0.99, and a minibatch size of 32. The number of steps between target network updates is set to 10,000. We use a simple exploration policy as the ϵ-greedy policy with the ϵ decreasing linearly from 1 to 0.01 over 3e5 steps. |