Beyond the One-Step Greedy Approach in Reinforcement Learning

Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7. Experimental Results In this section we empirically test the hand κ-PI algorithms on a toy grid-world problem.
Researcher Affiliation Academia 1Technion, Israel Institute of Technology 2INRIA, Villers-lès Nancy, F-54600, France.
Pseudocode Yes Algorithm 1 h-PI
Open Source Code No The paper does not include an unambiguous statement about releasing source code for the methodology, nor does it provide a direct link to a code repository.
Open Datasets No We conduct our simulations on a simple N N deterministic grid-world problem with γ = 0.97. ... In each experiment, we randomly chose a single state and placed a reward rg = 1. In all other states the reward was drawn uniformly from [ 0.1rg, 0.1rg]. In the considered problem there is no terminal state.
Dataset Splits No The paper describes running simulations on a 'toy grid-world problem' and tracking 'total queries to simulator' but does not specify any explicit training, validation, or test dataset splits.
Hardware Specification No The paper describes a simulation setup and performance metrics ('total queries to simulator') but does not specify any hardware details such as CPU or GPU models used for the experiments.
Software Dependencies No The paper mentions implementing parts via the 'VI algorithm' but does not provide specific version numbers for any software, libraries, or solvers used.
Experiment Setup Yes Here, we implement the hand κ-greedy step via the VI algorithm. In the former case, we simply do h steps, while in the latter case, we stop VI when the value change in max norm is less than ϵ = 10 5... We conduct our simulations on a simple N N deterministic grid-world problem with γ = 0.97. The actions set is { up , down , right , left , stay }. In each experiment, we randomly chose a single state and placed a reward rg = 1. In all other states the reward was drawn uniformly from [ 0.1rg, 0.1rg]. In the considered problem there is no terminal state. Also, the entries of the initial value function are drawn from N(0, r2 g).