reproducibilityindex.ai

Beyond the One-Step Greedy Approach in Reinforcement Learning

Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	7. Experimental Results In this section we empirically test the hand κ-PI algorithms on a toy grid-world problem.
Researcher Affiliation	Academia	1Technion, Israel Institute of Technology 2INRIA, Villers-lès Nancy, F-54600, France.
Pseudocode	Yes	Algorithm 1 h-PI
Open Source Code	No	The paper does not include an unambiguous statement about releasing source code for the methodology, nor does it provide a direct link to a code repository.
Open Datasets	No	We conduct our simulations on a simple N N deterministic grid-world problem with γ = 0.97. ... In each experiment, we randomly chose a single state and placed a reward rg = 1. In all other states the reward was drawn uniformly from [ 0.1rg, 0.1rg]. In the considered problem there is no terminal state.
Dataset Splits	No	The paper describes running simulations on a 'toy grid-world problem' and tracking 'total queries to simulator' but does not specify any explicit training, validation, or test dataset splits.
Hardware Specification	No	The paper describes a simulation setup and performance metrics ('total queries to simulator') but does not specify any hardware details such as CPU or GPU models used for the experiments.
Software Dependencies	No	The paper mentions implementing parts via the 'VI algorithm' but does not provide specific version numbers for any software, libraries, or solvers used.
Experiment Setup	Yes	Here, we implement the hand κ-greedy step via the VI algorithm. In the former case, we simply do h steps, while in the latter case, we stop VI when the value change in max norm is less than ϵ = 10 5... We conduct our simulations on a simple N N deterministic grid-world problem with γ = 0.97. The actions set is { up , down , right , left , stay }. In each experiment, we randomly chose a single state and placed a reward rg = 1. In all other states the reward was drawn uniformly from [ 0.1rg, 0.1rg]. In the considered problem there is no terminal state. Also, the entries of the initial value function are drawn from N(0, r2 g).