Reinforcement Learning in Reward-Mixing MDPs

Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We provide the first polynomial-time algorithm that finds an ϵ-optimal policy after exploring O(poly(H, ϵ 1) S2A2) episodes... We also show that Ω(S2A2/ϵ2) number of episodes is necessary, and thus our result is tight in S, A. This is the first efficient algorithm that does not require any assumptions in partially observed environments where the observation space is smaller than the latent state space. Main Theoretical Results Upper Bound We first prove the correctness of our main algorithm (Algorithm 1) and compute an upper bound on its sample complexity. Our upper bound has a quadratic dependence on S A. This is well-expected because we must estimate quantities related to the augmented second order MDP. Our lower bound shows that this dependence is in fact tight.
Researcher Affiliation Collaboration Jeongyeol Kwon The University of Texas at Austin kwonchungli@utexas.edu; Yonathan Efroni Microsoft Research, NYC jonathan.efroni@gmail.com; Constantine Caramanis The University of Texas at Austin constantine@utexas.edu; Shie Mannor Technion, NVIDIA shie@ee.technion.ac.il, smannor@nvidia.com
Pseudocode Yes Algorithm 1 Learning Two Reward-Mixture MDPs; Algorithm 2 Pure Exploration of Second-Order Correlations; Algorithm 3 Reward Model Recovery from Second-Order Correlations
Open Source Code No The paper is theoretical and does not mention releasing any source code or providing a link to a repository for the methodology described.
Open Datasets No The paper is theoretical and does not involve the use of datasets for training or evaluation.
Dataset Splits No The paper is theoretical and does not involve the use of datasets or their splits for validation or training.
Hardware Specification No The paper is theoretical and does not describe any hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe an experimental setup including hyperparameters or training configurations.