Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Beyond the One-Step Greedy Approach in Reinforcement Learning
Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
ICML 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7. Experimental Results In this section we empirically test the hand κ-PI algorithms on a toy grid-world problem. |
| Researcher Affiliation | Academia | 1Technion, Israel Institute of Technology 2INRIA, Villers-lès Nancy, F-54600, France. |
| Pseudocode | Yes | Algorithm 1 h-PI |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing source code for the methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | No | We conduct our simulations on a simple N N deterministic grid-world problem with γ = 0.97. ... In each experiment, we randomly chose a single state and placed a reward rg = 1. In all other states the reward was drawn uniformly from [ 0.1rg, 0.1rg]. In the considered problem there is no terminal state. |
| Dataset Splits | No | The paper describes running simulations on a 'toy grid-world problem' and tracking 'total queries to simulator' but does not specify any explicit training, validation, or test dataset splits. |
| Hardware Specification | No | The paper describes a simulation setup and performance metrics ('total queries to simulator') but does not specify any hardware details such as CPU or GPU models used for the experiments. |
| Software Dependencies | No | The paper mentions implementing parts via the 'VI algorithm' but does not provide specific version numbers for any software, libraries, or solvers used. |
| Experiment Setup | Yes | Here, we implement the hand κ-greedy step via the VI algorithm. In the former case, we simply do h steps, while in the latter case, we stop VI when the value change in max norm is less than ϵ = 10 5... We conduct our simulations on a simple N N deterministic grid-world problem with γ = 0.97. The actions set is { up , down , right , left , stay }. In each experiment, we randomly chose a single state and placed a reward rg = 1. In all other states the reward was drawn uniformly from [ 0.1rg, 0.1rg]. In the considered problem there is no terminal state. Also, the entries of the initial value function are drawn from N(0, r2 g). |