Reinforcement Learning in Reward-Mixing MDPs
Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We provide the first polynomial-time algorithm that finds an ϵ-optimal policy after exploring O(poly(H, ϵ 1) S2A2) episodes... We also show that Ω(S2A2/ϵ2) number of episodes is necessary, and thus our result is tight in S, A. This is the first efficient algorithm that does not require any assumptions in partially observed environments where the observation space is smaller than the latent state space. Main Theoretical Results Upper Bound We first prove the correctness of our main algorithm (Algorithm 1) and compute an upper bound on its sample complexity. Our upper bound has a quadratic dependence on S A. This is well-expected because we must estimate quantities related to the augmented second order MDP. Our lower bound shows that this dependence is in fact tight. |
| Researcher Affiliation | Collaboration | Jeongyeol Kwon The University of Texas at Austin kwonchungli@utexas.edu; Yonathan Efroni Microsoft Research, NYC jonathan.efroni@gmail.com; Constantine Caramanis The University of Texas at Austin constantine@utexas.edu; Shie Mannor Technion, NVIDIA shie@ee.technion.ac.il, smannor@nvidia.com |
| Pseudocode | Yes | Algorithm 1 Learning Two Reward-Mixture MDPs; Algorithm 2 Pure Exploration of Second-Order Correlations; Algorithm 3 Reward Model Recovery from Second-Order Correlations |
| Open Source Code | No | The paper is theoretical and does not mention releasing any source code or providing a link to a repository for the methodology described. |
| Open Datasets | No | The paper is theoretical and does not involve the use of datasets for training or evaluation. |
| Dataset Splits | No | The paper is theoretical and does not involve the use of datasets or their splits for validation or training. |
| Hardware Specification | No | The paper is theoretical and does not describe any hardware used for running experiments. |
| Software Dependencies | No | The paper is theoretical and does not specify any software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup including hyperparameters or training configurations. |