Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

Authors: Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose an efficient algorithm that achieves O(L|X| p |A|T) regret with high probability, where L is the horizon, |X| the number of states, |A| the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure O( T) regret in this challenging setting; in fact it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an upper occupancy bound.
Researcher Affiliation Academia 1Princeton University 2University of Southern California 3Massachusetts Institute of Technology. Correspondence to: Tiancheng Jin <tiancheng.jin@usc.edu>, Tiancheng Yu <yutc@mit.edu>.
Pseudocode Yes Algorithm 2 Upper Occupancy Bound Relative Entropy Policy Search (UOB-REPS); Algorithm 3 COMP-UOB
Open Source Code No The paper does not provide any concrete access to source code for the methodology described. It is a theoretical paper and does not mention code release.
Open Datasets No The paper is a theoretical work focusing on algorithm design and regret analysis for MDPs. It does not conduct empirical experiments or use datasets for training, validation, or testing.
Dataset Splits No The paper is a theoretical work and does not conduct empirical experiments. Therefore, it does not mention training/test/validation dataset splits.
Hardware Specification No The paper is theoretical and does not describe any experimental setup, including hardware specifications.
Software Dependencies No The paper is theoretical and does not describe any specific software dependencies with version numbers for experimental reproducibility.
Experiment Setup No The paper is theoretical and focuses on algorithm design and analysis. It does not include details about an experimental setup, hyperparameters, or training configurations.