Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-step Greedy Reinforcement Learning Algorithms
Authors: Manan Tomar, Yonathan Efroni, Mohammad Ghavamzadeh
ICML 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When evaluated on a range of Atari and Mu Jo Co benchmark tasks, our results indicate that for the right range of , our algorithms outperform DQN and TRPO. |
| Researcher Affiliation | Collaboration | 1Facebook AI Research, Menlo Park, USA 2Technion, Haifa, Israel 3Google Research, Mountain View, USA. |
| Pseudocode | Yes | Algorithm 1 -Policy Iteration; Algorithm 2 -Value Iteration; Algorithm 3 -PI-DQN; Algorithm 4 -PI-TRPO |
| Open Source Code | No | The paper cites external codebases like Open AI Baselines but does not provide concrete access to its own source code. |
| Open Datasets | Yes | We choose to test our -DQN and -TRPO algorithms on the Atari and Mu Jo Co benchmarks, respectively. |
| Dataset Splits | No | The paper describes total sample counts for training and iterations but does not provide specific train/validation/test dataset splits (percentages or counts) in the conventional sense. |
| Hardware Specification | No | The paper mentions using 'standard setups' but does not provide specific hardware details (e.g., exact GPU/CPU models or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions optimization algorithms like 'Adam optimizer' and components like 'target Q value networks' but does not list specific software libraries or solvers with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8'). |
| Experiment Setup | Yes | Both of these algorithms use standard setups, including the use of the Adam optimizer for performing gradient descent, a discount factor of 0.99 across all tasks, target Q value networks in the case of -DQN and an entropy regularizer with a coefficient of 0.01 in the case of -TRPO. ... we set the total number of iterations to 2000, with each iteration consisting 1000 samples. ... CF A is set to 0.05 for all our experiments with other Atari domains. ... we set CF A = 0.2 in our experiments with other Mu Jo Co domains. |