Safe Policy Improvement by Minimizing Robust Baseline Regret
Authors: Mohammad Ghavamzadeh, Marek Petrik, Yinlam Chow
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results on several domains further show that even the simple approximate algorithm can outperform standard approaches. In this section, we experimentally evaluate the beneļ¬ts of minimizing the robust baseline regret. |
| Researcher Affiliation | Collaboration | Marek Petrik University of New Hampshire mpetrik@cs.unh.edu Mohammad Ghavamzadeh Adobe Research & INRIA Lille ghavamza@adobe.com Yinlam Chow Stanford University ychow@stanford.edu |
| Pseudocode | Yes | Algorithm 1: Approximate Robust Baseline Regret Minimization Algorithm |
| Open Source Code | No | No explicit statement about providing open-source code or a link to a code repository. |
| Open Datasets | No | We use a uniform random policy to gather samples. The problem is based on the domain from [Petrik and Wu, 2015], whose description is detailed in Appendix I.2. |
| Dataset Splits | No | No specific training, validation, or test dataset splits are described in terms of percentages or sample counts. The paper mentions 'mean of 40 runs' and 'averaged over 5 runs' for experiments, but not data splitting. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts) are mentioned for running experiments. |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned. |
| Experiment Setup | No | The paper describes how the model error function is constructed from samples and details of the problem domains (Grid Problem, Energy Arbitrage), including aspects like "uniform random policy to gather samples" and "number of transition samples used in constructing the uncertain model." However, it does not provide specific algorithmic hyperparameters (e.g., learning rate, batch size, number of epochs) or other detailed training configurations for the proposed algorithms. |