Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies
Authors: Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, Shie Mannor
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present an empirical evaluation of both UCRL2 and EULER, and compare their performance to the proposed variants, which use greedy policy updates, UCRL2-GP and EULER-GP, respectively. The simulation results can be found in Figure 1, and clearly indicate that using greedy planning leads to negligible degradation in the performance. |
| Researcher Affiliation | Collaboration | Yonathan Efroni Technion, Israel Nadav Merlis Technion, Israel Mohammad Ghavamzadeh Facebook AI Research Shie Mannor Technion, Israel |
| Pseudocode | Yes | Algorithm 1 Real-Time Dynamic Programming, Algorithm 2 Model-based RL with Greedy Policies, Algorithm 3 UCRL2 with Greedy Policies (UCRL2-GP) |
| Open Source Code | No | The paper does not provide any explicit statement about releasing the source code or a link to a code repository. |
| Open Datasets | Yes | We evaluated the algorithms on two environments. (i) Chain environment [Osband and Van Roy, 2017]: In this MDP, there are N states, which are connected in a chain... (ii) 2D chain: A generalization of the chain environment... |
| Dataset Splits | No | The paper describes simulation environments and averages results over random seeds, but it does not specify explicit training/validation/test dataset splits as commonly found in supervised learning. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or solvers). |
| Experiment Setup | No | The paper describes the environment parameters (e.g., N states, H horizon) and mentions averaging over 5 random seeds, but it does not provide specific experimental setup details such as hyperparameters (learning rate, batch size, epochs, optimizer settings) for the algorithms tested. |