Online Convex Optimization in Adversarial Markov Decision Processes
Authors: Aviv Rosenberg, Yishay Mansour
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We show O(L|X| p |A|T) regret bound, where T is the number of episodes, X is the state space, A is the action space, and L is the length of each episode. Our online algorithm is implemented using entropic regularization methodology, which allows to extend the original adversarial MDP model to handle convex performance criteria (different ways to aggregate the losses of a single episode) , as well as improve previous regret bounds. |
| Researcher Affiliation | Collaboration | 1Tel Aviv University, Israel 2Google Research, Tel Aviv, Israel. Correspondence to: Aviv Rosenberg <avivros007@gmail.com>, Yishay Mansour <mansour.yishay@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 Learner-Environment Interaction Algorithm 2 UC-O-REPS Algorithm Algorithm 3 Comp-Policy Procedure |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | No | The paper is theoretical and does not describe empirical experiments with datasets. Therefore, no information on public datasets for training is provided. |
| Dataset Splits | No | The paper is theoretical and does not describe empirical experiments with datasets. Therefore, no information on dataset splits for validation is provided. |
| Hardware Specification | No | The paper is theoretical and does not describe empirical experiments. Therefore, no hardware specifications for running experiments are provided. |
| Software Dependencies | No | The paper is theoretical and discusses algorithms conceptually, not in terms of specific software implementations with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe empirical experiments. Therefore, no specific experimental setup details like hyperparameters or training configurations are provided. |