Reward-Mixing MDPs with Few Latent Contexts are Learnable

Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work, we resolve several open questions for the general RMMDP setting. We consider an arbitrary M ≥ 2 and provide a sample-efficient algorithm EM2 that outputs an ϵ-optimal policy using O(ϵ−2SdAd poly(H, Z)d episodes... We also provide a (SA)Ω(M)/ϵ2 lower bound, supporting that super-polynomial sample complexity in M is necessary.
Researcher Affiliation Collaboration 1Wisconsin Institute for Discovery, University of Wisconsin Madison, USA 2Meta, New York 3Department of Electrical and Computer Engineering, University of Texas at Austin, USA 4Technion, Israel 5Nvidia Research.
Pseudocode Yes Algorithm 1 Estimate and Match Moments (EM2) ... Algorithm 2
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No The paper is theoretical and does not use or provide access to any specific datasets for training.
Dataset Splits No The paper is theoretical and does not describe experimental validation procedures with dataset splits.
Hardware Specification No The paper is theoretical and does not mention any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not mention specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not include details about an experimental setup with hyperparameters or training configurations.