Learning Adversarial Markov Decision Processes with Delayed Feedback

Authors: Tal Lancewicki, Aviv Rosenberg, Yishay Mansour7281-7289

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluation. We used synthetic experiments to compare the performance of Delayed OPPO to two other generic approaches for handling delays: Parallel-OPPO running in parallel dmax online algorithms, as described in Section 3, and Pipeline-OPPO another simple approach for turning a non-delayed algorithm to an algorithm that handles delays by simply waiting for the first dmax episodes and then feeding the feedback always with delay dmax. We used a simple 10 10 grid world where the agent starts in one corner and needs to reach the opposite corner, which is the goal state. The cost is 1 in all states except for 0 cost in the goal state. Delays are drawn i.i.d from a geometric distribution with mean 10, and the maximum delay dmax is computed on the sequence of realized delays. Fig. 1 shows Delayed OPPO significantly outperforms the other approaches, thus highlighting the importance of handling variable delays and not simply considering the worst-case delay dmax.
Researcher Affiliation Collaboration Tal Lancewicki*1, Aviv Rosenberg*1, Yishay Mansour1,2 1 Tel Aviv University, Israel 2 Google Research, Israel
Pseudocode Yes Algorithm 1: Delayed OPPO
Open Source Code No The paper describes algorithms and experiments but does not provide any explicit statement about releasing the source code or a link to a code repository for the methodology described.
Open Datasets No We used a simple 10 10 grid world where the agent starts in one corner and needs to reach the opposite corner, which is the goal state. The cost is 1 in all states except for 0 cost in the goal state. Delays are drawn i.i.d from a geometric distribution with mean 10, and the maximum delay dmax is computed on the sequence of realized delays.
Dataset Splits No The paper describes experiments in a simulated grid world environment and discusses ‘episodes’ and ‘delays’ but does not specify any training, validation, or test dataset splits.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions various algorithms and theoretical concepts but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their specific versions) used for implementation or experiments.
Experiment Setup Yes We used a simple 10 10 grid world where the agent starts in one corner and needs to reach the opposite corner, which is the goal state. The cost is 1 in all states except for 0 cost in the goal state. Delays are drawn i.i.d from a geometric distribution with mean 10, and the maximum delay dmax is computed on the sequence of realized delays.