How to Combine Tree-Search Methods in Reinforcement Learning

Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor3494-3501

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The contribution of this work is primarily theoretical, but in Section 8 we also present experimental results on a toy domain. In this section, we empirically study NC-hm-PI (Section 5) and hm-PI (Section 6) in the exact and approximate cases.
Researcher Affiliation Academia Yonathan Efroni Technion, Israel Gal Dalal Technion, Israel Bruno Scherrer INRIA, Villers les Nancy, France Shie Mannor Technion, Israel
Pseudocode Yes Algorithm 1 h-PI, Algorithm 2 hm-PI, Algorithm 3 hλ-PI
Open Source Code No The paper does not provide any specific links or explicit statements about the availability of its own source code for the described methodology.
Open Datasets No The paper states 'We conducted our simulations on a simple N N deterministic grid-world problem with γ = 0.97, as was done in (Efroni et al. 2018a).' This is a custom experimental setup, and no access information (link, DOI, specific citation to a dataset resource) is provided for this 'toy domain' dataset.
Dataset Splits No The paper describes experiments in a grid-world reinforcement learning setup, but it does not specify any training, validation, or test dataset splits in terms of percentages or sample counts.
Hardware Specification No The paper describes the experimental setup in Section 8 but does not provide specific details about the hardware used, such as CPU or GPU models, or memory specifications.
Software Dependencies No The paper discusses various algorithms and theoretical concepts but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries with their versions) that would be needed to reproduce the experiments.
Experiment Setup Yes We conducted our simulations on a simple N N deterministic grid-world problem with γ = 0.97, as was done in (Efroni et al. 2018a). The action set is { up , down , right , left , stay }. In each experiment, a reward rg = 1 was placed in a random state while in all other states the reward was drawn uniformly from [ 0.1, 0.1]. In the considered problem there is no terminal state. Also, the entries of the initial value function are drawn from N(0, 1). counted the total number of queries to the simulator until convergence, which defined as ||v vk|| 10 7.