Optimism in Face of a Context:Regret Guarantees for Stochastic Contextual MDP
Authors: Orin Levy, Yishay Mansour
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of e O((H + 1/pmin)H|S|3/2p|A|T log(max{|G|, |P|}/δ)) with probability 1 δ, where P and G are finite and realizable function classes used to approximate the dynamics and rewards respectively, pmin is the minimum reachability parameter, S is the set of states, A the set of actions, H the horizon, and T the number of episodes. |
| Researcher Affiliation | Collaboration | Orin Levy1, Yishay Mansour1,2 1 Tel Aviv University 2 Google Research, Tel Aviv |
| Pseudocode | Yes | Algorithm 1: Regret Minimization for CMDP with Known Dynamics (RM-KD) |
| Open Source Code | No | The paper is theoretical and does not mention releasing open-source code for the described algorithms. |
| Open Datasets | No | The paper is theoretical and does not mention the use of any datasets for training or evaluation. |
| Dataset Splits | No | The paper is theoretical and does not mention any training/validation/test dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not mention any specific hardware specifications used for experiments. |
| Software Dependencies | No | The paper is theoretical and describes algorithms conceptually, but does not list any specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not include details about an experimental setup, hyperparameters, or system-level training settings. |