Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition
Authors: Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose an efficient algorithm that achieves O(L|X| p |A|T) regret with high probability, where L is the horizon, |X| the number of states, |A| the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure O( T) regret in this challenging setting; in fact it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an upper occupancy bound. |
| Researcher Affiliation | Academia | 1Princeton University 2University of Southern California 3Massachusetts Institute of Technology. Correspondence to: Tiancheng Jin <tiancheng.jin@usc.edu>, Tiancheng Yu <yutc@mit.edu>. |
| Pseudocode | Yes | Algorithm 2 Upper Occupancy Bound Relative Entropy Policy Search (UOB-REPS); Algorithm 3 COMP-UOB |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. It is a theoretical paper and does not mention code release. |
| Open Datasets | No | The paper is a theoretical work focusing on algorithm design and regret analysis for MDPs. It does not conduct empirical experiments or use datasets for training, validation, or testing. |
| Dataset Splits | No | The paper is a theoretical work and does not conduct empirical experiments. Therefore, it does not mention training/test/validation dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not describe any experimental setup, including hardware specifications. |
| Software Dependencies | No | The paper is theoretical and does not describe any specific software dependencies with version numbers for experimental reproducibility. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithm design and analysis. It does not include details about an experimental setup, hyperparameters, or training configurations. |