Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward
Authors: Washim Uddin Mondal, Vaneet Aggarwal
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. ... We propose an algorithm named DUCRL2 to obtain a near-optimal policy for this setting and show that it achieves a regret bound of O DS AT + d(SA)3 where S and A are the sizes of the state and action spaces, respectively, D is the diameter of the MDP, d is a parameter upper bounded by the maximum reward delay, and T denotes the time horizon. ... Our article provides a partial resolution to this problem. In particular, we demonstrate that if the rewards are delayed, composite, and partially anonymous, then an algorithm can be designed to achieve near-optimal regret. ... Our primary innovation lies in demonstrating how an accurate reward function estimate can be obtained using the partially anonymous feedback. DUCRL2 yields a regret bound of O(DS AT + d(SA)3) where S, A denote the sizes of the state and action spaces respectively, T is the time horizon, D denotes the diameter of the MDP, and the parameter d is bounded by the maximum delay in the reward generation process. The obtained result matches a well-known lower bound in T. ... Theorem 1 Let D D(M). With probability at least 1 δ, δ (0, 1), for arbitrary initial state s, the regret accumulated by the algorithm DUCRL2 over T > 1 steps can be bounded above as follows. |
| Researcher Affiliation | Academia | Washim Uddin Mondal EMAIL School of IE and CE, Purdue University Vaneet Aggarwal EMAIL School of IE and ECE, Purdue University |
| Pseudocode | Yes | Algorithm 1 DUCRL2 Algorithm 1: Input: δ (0, 1), d, S, A 2: Initialization: Observe the initial state s1 S and set t 1, t0 0. 3: for episodes k {1, 2, } do Computing empirical estimates ... Algorithm 2 Extended Value Iteration 1: Input: {d(s, a), P(s, a), ˆr(s, a), P(s, a)}(s,a) S A, ϵ > 0 2: Initialization: u0 {u0(s)}s S 0, i 0, error 2ϵ ... |
| Open Source Code | No | The paper does not contain any explicit statements about providing open-source code, nor does it include links to code repositories or mention code in supplementary materials. |
| Open Datasets | No | The paper introduces a theoretical framework for Reinforcement Learning with delayed, composite, and partially anonymous rewards. It uses hypothetical scenarios (online advertisement, medical trials, road networks) and an example of an MDP with partially anonymous rewards to illustrate the problem setting, but it does not utilize or provide access to any specific datasets for empirical evaluation. |
| Dataset Splits | No | The paper is theoretical and does not perform experiments using datasets. Therefore, there is no mention of dataset splits. |
| Hardware Specification | No | The paper is theoretical and focuses on algorithm design and regret bounds. It does not include any experimental evaluation that would require specifying hardware. |
| Software Dependencies | No | The paper presents a theoretical algorithm (DUCRL2) and its regret analysis. It does not mention any specific software dependencies or versions used for implementation or experimentation. |
| Experiment Setup | No | The paper proposes a theoretical algorithm and provides regret bounds. It does not describe any experimental setup, hyperparameters, or training configurations. |