Safe Policy Improvement by Minimizing Robust Baseline Regret

Authors: Mohammad Ghavamzadeh, Marek Petrik, Yinlam Chow

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results on several domains further show that even the simple approximate algorithm can outperform standard approaches. In this section, we experimentally evaluate the benefits of minimizing the robust baseline regret.
Researcher Affiliation Collaboration Marek Petrik University of New Hampshire mpetrik@cs.unh.edu Mohammad Ghavamzadeh Adobe Research & INRIA Lille ghavamza@adobe.com Yinlam Chow Stanford University ychow@stanford.edu
Pseudocode Yes Algorithm 1: Approximate Robust Baseline Regret Minimization Algorithm
Open Source Code No No explicit statement about providing open-source code or a link to a code repository.
Open Datasets No We use a uniform random policy to gather samples. The problem is based on the domain from [Petrik and Wu, 2015], whose description is detailed in Appendix I.2.
Dataset Splits No No specific training, validation, or test dataset splits are described in terms of percentages or sample counts. The paper mentions 'mean of 40 runs' and 'averaged over 5 runs' for experiments, but not data splitting.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts) are mentioned for running experiments.
Software Dependencies No No specific software dependencies with version numbers are mentioned.
Experiment Setup No The paper describes how the model error function is constructed from samples and details of the problem domains (Grid Problem, Energy Arbitrage), including aspects like "uniform random policy to gather samples" and "number of transition samples used in constructing the uncertain model." However, it does not provide specific algorithmic hyperparameters (e.g., learning rate, batch size, number of epochs) or other detailed training configurations for the proposed algorithms.