Tree Search-Based Policy Optimization under Stochastic Execution Delay
Authors: David Valensi, Esther Derman, Shie Mannor, Gal Dalal
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. (...) 2. We prove that if the realizations of the delay process are observed by the agent, then it suffices to restrict policy search to the set of Markov policies to attain optimal performance. |
| Researcher Affiliation | Collaboration | David Valensi Technion davidvalensi@campus.technion.ac.il Esther Derman Technion estherderman@campus.technion.ac.il Shie Mannor Technion & Nvidia Research shie@ee.technion.ac.il Gal Dalal Nvidia Research gdalal@nvidia.com |
| Pseudocode | Yes | The pseudo-code in Algo. 1 depicts how self-play samples episodes in the stochastic delayed environment. (...) Algorithm 1 DEZ: acting in environments with stochastic delays. |
| Open Source Code | Yes | The code is available at https://github.com/davidva1/Delayed-EZ. |
| Open Datasets | Yes | Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. (...) Efficient Zero sampled 100K transitions, aligning with the Atari 100K benchmark. |
| Dataset Splits | No | No explicit mention of training, validation, and test dataset splits (e.g., percentages, sample counts, or specific predefined splits) was found. The paper mentions training interactions and test episodes but not validation splits. |
| Hardware Specification | Yes | Our experimental setup included two RTX 2080 TI GPUs. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) were explicitly listed. |
| Experiment Setup | Yes | In the context of DEZ, each training run comprised 130,000 environment interactions and 150,000 training steps. (...) For M = 5, the training duration exhibited fluctuations over a period of 20 hours. For M = 15, the training duration exhibited fluctuations over a period of 22 hours. For M = 25, the training duration exhibited fluctuations over a period of 25 hours. (...) Our repository is a fork of Efficient Zero (Ye et al., 2021) with the default parameters taken from the original paper. |