reproducibilityindex.ai

Tree Search-Based Policy Optimization under Stochastic Execution Delay

Authors: David Valensi, Esther Derman, Shie Mannor, Gal Dalal

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. (...) 2. We prove that if the realizations of the delay process are observed by the agent, then it suffices to restrict policy search to the set of Markov policies to attain optimal performance.
Researcher Affiliation	Collaboration	David Valensi Technion davidvalensi@campus.technion.ac.il Esther Derman Technion estherderman@campus.technion.ac.il Shie Mannor Technion & Nvidia Research shie@ee.technion.ac.il Gal Dalal Nvidia Research gdalal@nvidia.com
Pseudocode	Yes	The pseudo-code in Algo. 1 depicts how self-play samples episodes in the stochastic delayed environment. (...) Algorithm 1 DEZ: acting in environments with stochastic delays.
Open Source Code	Yes	The code is available at https://github.com/davidva1/Delayed-EZ.
Open Datasets	Yes	Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. (...) Efficient Zero sampled 100K transitions, aligning with the Atari 100K benchmark.
Dataset Splits	No	No explicit mention of training, validation, and test dataset splits (e.g., percentages, sample counts, or specific predefined splits) was found. The paper mentions training interactions and test episodes but not validation splits.
Hardware Specification	Yes	Our experimental setup included two RTX 2080 TI GPUs.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) were explicitly listed.
Experiment Setup	Yes	In the context of DEZ, each training run comprised 130,000 environment interactions and 150,000 training steps. (...) For M = 5, the training duration exhibited fluctuations over a period of 20 hours. For M = 15, the training duration exhibited fluctuations over a period of 22 hours. For M = 25, the training duration exhibited fluctuations over a period of 25 hours. (...) Our repository is a fork of Efficient Zero (Ye et al., 2021) with the default parameters taken from the original paper.