Online Convex Optimization in Adversarial Markov Decision Processes

Authors: Aviv Rosenberg, Yishay Mansour

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We show O(L|X| p |A|T) regret bound, where T is the number of episodes, X is the state space, A is the action space, and L is the length of each episode. Our online algorithm is implemented using entropic regularization methodology, which allows to extend the original adversarial MDP model to handle convex performance criteria (different ways to aggregate the losses of a single episode) , as well as improve previous regret bounds.
Researcher Affiliation Collaboration 1Tel Aviv University, Israel 2Google Research, Tel Aviv, Israel. Correspondence to: Aviv Rosenberg <avivros007@gmail.com>, Yishay Mansour <mansour.yishay@gmail.com>.
Pseudocode Yes Algorithm 1 Learner-Environment Interaction Algorithm 2 UC-O-REPS Algorithm Algorithm 3 Comp-Policy Procedure
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets No The paper is theoretical and does not describe empirical experiments with datasets. Therefore, no information on public datasets for training is provided.
Dataset Splits No The paper is theoretical and does not describe empirical experiments with datasets. Therefore, no information on dataset splits for validation is provided.
Hardware Specification No The paper is theoretical and does not describe empirical experiments. Therefore, no hardware specifications for running experiments are provided.
Software Dependencies No The paper is theoretical and discusses algorithms conceptually, not in terms of specific software implementations with version numbers.
Experiment Setup No The paper is theoretical and does not describe empirical experiments. Therefore, no specific experimental setup details like hyperparameters or training configurations are provided.