Near Optimal Policy Optimization via REPS
Authors: Aldo Pacchiano, Jonathan N Lee, Peter Bartlett, Ofir Nachum
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper we aim to fill this gap by providing guarantees and convergence rates for the sub-optimality of a policy learned using first-order optimization methods applied to the REPS objective. We first consider the setting in which we are given access to exact gradients and demonstrate how near-optimality of the objective translates to near-optimality of the policy. We then consider the setting of stochastic gradients and introduce a technique that uses generative access to the underlying Markov decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy. |
| Researcher Affiliation | Collaboration | Aldo Pacchiano Microsoft Research apacchiano@microsoft.com Jonathan Lee Stanford University jnl@stanford.edu Peter L. Bartlett UC Berkeley peter@berkeley.edu Google ofirnachum@google.com |
| Pseudocode | Yes | Algorithm 1 Relative Entropy Policy Search [Sketch]. Input: Initial iterate v0, accuracy level > 0, gradient optimization algorithm O. ... Algorithm 2 Biased Gradient Estimator Input Number of samples t. ... Algorithm 3 Biased Stochastic Gradient Descent Input Desired accuracy , learning rates { t}1 t=1, and number-of-samples function n : N ! N . |
| Open Source Code | No | Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [N/A] |
| Open Datasets | No | The paper is theoretical and does not mention using any specific datasets for training or provide access information for any dataset. |
| Dataset Splits | No | The paper is theoretical and does not mention specific dataset splits for training, validation, or testing. |
| Hardware Specification | No | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe a concrete experimental setup with hyperparameters or system-level training settings. |