Tree Search-Based Policy Optimization under Stochastic Execution Delay

Authors: David Valensi, Esther Derman, Shie Mannor, Gal Dalal

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. (...) 2. We prove that if the realizations of the delay process are observed by the agent, then it suffices to restrict policy search to the set of Markov policies to attain optimal performance.
Researcher Affiliation Collaboration David Valensi Technion davidvalensi@campus.technion.ac.il Esther Derman Technion estherderman@campus.technion.ac.il Shie Mannor Technion & Nvidia Research shie@ee.technion.ac.il Gal Dalal Nvidia Research gdalal@nvidia.com
Pseudocode Yes The pseudo-code in Algo. 1 depicts how self-play samples episodes in the stochastic delayed environment. (...) Algorithm 1 DEZ: acting in environments with stochastic delays.
Open Source Code Yes The code is available at https://github.com/davidva1/Delayed-EZ.
Open Datasets Yes Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. (...) Efficient Zero sampled 100K transitions, aligning with the Atari 100K benchmark.
Dataset Splits No No explicit mention of training, validation, and test dataset splits (e.g., percentages, sample counts, or specific predefined splits) was found. The paper mentions training interactions and test episodes but not validation splits.
Hardware Specification Yes Our experimental setup included two RTX 2080 TI GPUs.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) were explicitly listed.
Experiment Setup Yes In the context of DEZ, each training run comprised 130,000 environment interactions and 150,000 training steps. (...) For M = 5, the training duration exhibited fluctuations over a period of 20 hours. For M = 15, the training duration exhibited fluctuations over a period of 22 hours. For M = 25, the training duration exhibited fluctuations over a period of 25 hours. (...) Our repository is a fork of Efficient Zero (Ye et al., 2021) with the default parameters taken from the original paper.