Learning Adversarial Markov Decision Processes with Delayed Feedback
Authors: Tal Lancewicki, Aviv Rosenberg, Yishay Mansour7281-7289
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation. We used synthetic experiments to compare the performance of Delayed OPPO to two other generic approaches for handling delays: Parallel-OPPO running in parallel dmax online algorithms, as described in Section 3, and Pipeline-OPPO another simple approach for turning a non-delayed algorithm to an algorithm that handles delays by simply waiting for the first dmax episodes and then feeding the feedback always with delay dmax. We used a simple 10 10 grid world where the agent starts in one corner and needs to reach the opposite corner, which is the goal state. The cost is 1 in all states except for 0 cost in the goal state. Delays are drawn i.i.d from a geometric distribution with mean 10, and the maximum delay dmax is computed on the sequence of realized delays. Fig. 1 shows Delayed OPPO significantly outperforms the other approaches, thus highlighting the importance of handling variable delays and not simply considering the worst-case delay dmax. |
| Researcher Affiliation | Collaboration | Tal Lancewicki*1, Aviv Rosenberg*1, Yishay Mansour1,2 1 Tel Aviv University, Israel 2 Google Research, Israel |
| Pseudocode | Yes | Algorithm 1: Delayed OPPO |
| Open Source Code | No | The paper describes algorithms and experiments but does not provide any explicit statement about releasing the source code or a link to a code repository for the methodology described. |
| Open Datasets | No | We used a simple 10 10 grid world where the agent starts in one corner and needs to reach the opposite corner, which is the goal state. The cost is 1 in all states except for 0 cost in the goal state. Delays are drawn i.i.d from a geometric distribution with mean 10, and the maximum delay dmax is computed on the sequence of realized delays. |
| Dataset Splits | No | The paper describes experiments in a simulated grid world environment and discusses ‘episodes’ and ‘delays’ but does not specify any training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions various algorithms and theoretical concepts but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their specific versions) used for implementation or experiments. |
| Experiment Setup | Yes | We used a simple 10 10 grid world where the agent starts in one corner and needs to reach the opposite corner, which is the goal state. The cost is 1 in all states except for 0 cost in the goal state. Delays are drawn i.i.d from a geometric distribution with mean 10, and the maximum delay dmax is computed on the sequence of realized delays. |