Eluder-based Regret for Stochastic Contextual MDPs

Authors: Orin Levy, Asaf Cassel, Alon Cohen, Yishay Mansour

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We present the E-UC3RL algorithm for regret minimization in Stochastic Contextual Markov Decision Processes (CMDPs). The algorithm operates under the minimal assumptions of realizable function class and access to offline least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of e O(H3p T|S||A|d E(P) log(|F||P|/δ))) , with T being the number of episodes, S the state space, A the action space, H the horizon, P and F are finite function classes used to approximate the context-dependent dynamics and rewards, respectively, and d E(P) is the Eluder dimension of P w.r.t the Hellinger distance. To the best of our knowledge, our algorithm is the first efficient and rate-optimal regret minimization algorithm for CMDPs that operates under the general offline function approximation setting. In addition, we extend the Eluder dimension to general bounded metrics which may be of independent interest.
Researcher Affiliation Collaboration 1Balavatnick school of Computer Science, Tel Aviv University, Tel Aviv, Israel 2School of Electrical Engeneering, Tel Aviv University, Tel Aviv, Israel 3Google Research, Tel Aviv, Israel.
Pseudocode Yes Algorithm 1 Eluder Upper Counterfactual Confidence for Contextual RL (E-UC3RL)
Open Source Code No The paper does not provide any specific links to open-source code or explicitly state that the code for the methodology is being released.
Open Datasets No The paper is theoretical and focuses on algorithm design and regret bounds. It does not mention using any specific datasets (publicly available or otherwise) for training, validation, or testing.
Dataset Splits No The paper is theoretical and does not involve empirical experiments with data splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and focuses on algorithmic efficiency ('efficient offline regression oracles', 'runtime complexity of our algorithm is in poly(T, |S|, |A|, H)'). It does not describe any specific hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers (e.g., specific libraries or frameworks) required for implementation.
Experiment Setup No The paper is theoretical and focuses on algorithm design and analysis. It does not describe an experimental setup with specific hyperparameters or system-level training settings.