Optimistic Policy Optimization with Bandit Feedback

Authors: Lior Shani, Yonathan Efroni, Aviv Rosenberg, Shie Mannor

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical For this setting, we propose an optimistic policy optimization algorithm for which we establish O(S2AH4K) regret for stochastic rewards. Furthermore, we prove O(S2AH4K2/3) regret for adversarial rewards.
Researcher Affiliation Academia 1Technion Israel Institute of Technology, Haifa, Israel 2Tel Aviv University, Tel Aviv, Israel.
Pseudocode Yes Algorithm 1 POMD with Known Model; Algorithm 2 Optimistic POMD for Stochastic MDPs; Algorithm 3 Optimistic POMD for Adversarial MDPs
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No This paper is theoretical and focuses on algorithm design and proofs, rather than conducting empirical experiments on datasets. Therefore, no information about public datasets is provided.
Dataset Splits No This paper is theoretical and does not involve empirical experiments or dataset usage, so there are no dataset split details for validation.
Hardware Specification No This paper is theoretical and does not report on empirical experiments; therefore, no hardware specifications are mentioned.
Software Dependencies No This paper is theoretical and focuses on algorithm design and proofs, without mentioning any specific software dependencies or version numbers.
Experiment Setup No This paper is theoretical and does not report on empirical experiments; therefore, no experimental setup details such as hyperparameters or training configurations are provided.