reproducibilityindex.ai

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

Authors: Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We propose an efficient algorithm that achieves O(L\|X\| p \|A\|T) regret with high probability, where L is the horizon, \|X\| the number of states, \|A\| the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure O( T) regret in this challenging setting; in fact it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter conﬁdence set for the transition function; and an optimistic loss estimator that is inversely weighted by an upper occupancy bound.
Researcher Affiliation	Academia	1Princeton University 2University of Southern California 3Massachusetts Institute of Technology. Correspondence to: Tiancheng Jin <tiancheng.jin@usc.edu>, Tiancheng Yu <yutc@mit.edu>.
Pseudocode	Yes	Algorithm 2 Upper Occupancy Bound Relative Entropy Policy Search (UOB-REPS); Algorithm 3 COMP-UOB
Open Source Code	No	The paper does not provide any concrete access to source code for the methodology described. It is a theoretical paper and does not mention code release.
Open Datasets	No	The paper is a theoretical work focusing on algorithm design and regret analysis for MDPs. It does not conduct empirical experiments or use datasets for training, validation, or testing.
Dataset Splits	No	The paper is a theoretical work and does not conduct empirical experiments. Therefore, it does not mention training/test/validation dataset splits.
Hardware Specification	No	The paper is theoretical and does not describe any experimental setup, including hardware specifications.
Software Dependencies	No	The paper is theoretical and does not describe any specific software dependencies with version numbers for experimental reproducibility.
Experiment Setup	No	The paper is theoretical and focuses on algorithm design and analysis. It does not include details about an experimental setup, hyperparameters, or training configurations.