Optimistic Policy Optimization with Bandit Feedback
Authors: Lior Shani, Yonathan Efroni, Aviv Rosenberg, Shie Mannor
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | For this setting, we propose an optimistic policy optimization algorithm for which we establish O(S2AH4K) regret for stochastic rewards. Furthermore, we prove O(S2AH4K2/3) regret for adversarial rewards. |
| Researcher Affiliation | Academia | 1Technion Israel Institute of Technology, Haifa, Israel 2Tel Aviv University, Tel Aviv, Israel. |
| Pseudocode | Yes | Algorithm 1 POMD with Known Model; Algorithm 2 Optimistic POMD for Stochastic MDPs; Algorithm 3 Optimistic POMD for Adversarial MDPs |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | This paper is theoretical and focuses on algorithm design and proofs, rather than conducting empirical experiments on datasets. Therefore, no information about public datasets is provided. |
| Dataset Splits | No | This paper is theoretical and does not involve empirical experiments or dataset usage, so there are no dataset split details for validation. |
| Hardware Specification | No | This paper is theoretical and does not report on empirical experiments; therefore, no hardware specifications are mentioned. |
| Software Dependencies | No | This paper is theoretical and focuses on algorithm design and proofs, without mentioning any specific software dependencies or version numbers. |
| Experiment Setup | No | This paper is theoretical and does not report on empirical experiments; therefore, no experimental setup details such as hyperparameters or training configurations are provided. |