An Analytical Update Rule for General Policy Optimization
Authors: Hepeng Li, Nicholas Clavette, Haibo He
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we present an analytical policy update rule that is independent of parametric function approximators. We prove that the update rule has a monotonic improvement guarantee and is suitable for optimizing general stochastic policies with continuous or discrete actions. The update rule provides a new theoretical foundation for policy-based RL, which traditionally restricts the policy search to a family of parametric functions, such as policy gradient (Sutton et al., 1999), deterministic policy gradient (Silver et al., 2014; Lillicrap et al., 2016), actor critic (Konda & Tsitsiklis, 1999; Degris et al., 2012), soft actor-critic (SAC) (Haarnoja et al., 2018a;b), and so on. Our update rule is derived from a closed-form solution to a trust region method using calculus of variation. |
| Researcher Affiliation | Academia | 1Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, South Kingstown, RI, USA. |
| Pseudocode | No | The paper focuses on theoretical derivations and proofs, not on providing pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | No | The paper is theoretical and does not mention the use of any datasets for training experiments. |
| Dataset Splits | No | The paper is theoretical and does not discuss validation datasets or splits. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not list any specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe any experimental setup details such as hyperparameters or training configurations. |