Beyond the One-Step Greedy Approach in Reinforcement Learning
Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7. Experimental Results In this section we empirically test the hand κ-PI algorithms on a toy grid-world problem. |
| Researcher Affiliation | Academia | 1Technion, Israel Institute of Technology 2INRIA, Villers-lès Nancy, F-54600, France. |
| Pseudocode | Yes | Algorithm 1 h-PI |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing source code for the methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | No | We conduct our simulations on a simple N N deterministic grid-world problem with γ = 0.97. ... In each experiment, we randomly chose a single state and placed a reward rg = 1. In all other states the reward was drawn uniformly from [ 0.1rg, 0.1rg]. In the considered problem there is no terminal state. |
| Dataset Splits | No | The paper describes running simulations on a 'toy grid-world problem' and tracking 'total queries to simulator' but does not specify any explicit training, validation, or test dataset splits. |
| Hardware Specification | No | The paper describes a simulation setup and performance metrics ('total queries to simulator') but does not specify any hardware details such as CPU or GPU models used for the experiments. |
| Software Dependencies | No | The paper mentions implementing parts via the 'VI algorithm' but does not provide specific version numbers for any software, libraries, or solvers used. |
| Experiment Setup | Yes | Here, we implement the hand κ-greedy step via the VI algorithm. In the former case, we simply do h steps, while in the latter case, we stop VI when the value change in max norm is less than ϵ = 10 5... We conduct our simulations on a simple N N deterministic grid-world problem with γ = 0.97. The actions set is { up , down , right , left , stay }. In each experiment, we randomly chose a single state and placed a reward rg = 1. In all other states the reward was drawn uniformly from [ 0.1rg, 0.1rg]. In the considered problem there is no terminal state. Also, the entries of the initial value function are drawn from N(0, r2 g). |