reproducibilityindex.ai

Learning Adversarial Markov Decision Processes with Delayed Feedback

Authors: Tal Lancewicki, Aviv Rosenberg, Yishay Mansour7281-7289

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation. We used synthetic experiments to compare the performance of Delayed OPPO to two other generic approaches for handling delays: Parallel-OPPO running in parallel dmax online algorithms, as described in Section 3, and Pipeline-OPPO another simple approach for turning a non-delayed algorithm to an algorithm that handles delays by simply waiting for the ﬁrst dmax episodes and then feeding the feedback always with delay dmax. We used a simple 10 10 grid world where the agent starts in one corner and needs to reach the opposite corner, which is the goal state. The cost is 1 in all states except for 0 cost in the goal state. Delays are drawn i.i.d from a geometric distribution with mean 10, and the maximum delay dmax is computed on the sequence of realized delays. Fig. 1 shows Delayed OPPO signiﬁcantly outperforms the other approaches, thus highlighting the importance of handling variable delays and not simply considering the worst-case delay dmax.
Researcher Affiliation	Collaboration	Tal Lancewicki1, Aviv Rosenberg1, Yishay Mansour1,2 1 Tel Aviv University, Israel 2 Google Research, Israel
Pseudocode	Yes	Algorithm 1: Delayed OPPO
Open Source Code	No	The paper describes algorithms and experiments but does not provide any explicit statement about releasing the source code or a link to a code repository for the methodology described.
Open Datasets	No	We used a simple 10 10 grid world where the agent starts in one corner and needs to reach the opposite corner, which is the goal state. The cost is 1 in all states except for 0 cost in the goal state. Delays are drawn i.i.d from a geometric distribution with mean 10, and the maximum delay dmax is computed on the sequence of realized delays.
Dataset Splits	No	The paper describes experiments in a simulated grid world environment and discusses ‘episodes’ and ‘delays’ but does not specify any training, validation, or test dataset splits.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions various algorithms and theoretical concepts but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their specific versions) used for implementation or experiments.
Experiment Setup	Yes	We used a simple 10 10 grid world where the agent starts in one corner and needs to reach the opposite corner, which is the goal state. The cost is 1 in all states except for 0 cost in the goal state. Delays are drawn i.i.d from a geometric distribution with mean 10, and the maximum delay dmax is computed on the sequence of realized delays.