An Analytical Update Rule for General Policy Optimization

Authors: Hepeng Li, Nicholas Clavette, Haibo He

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we present an analytical policy update rule that is independent of parametric function approximators. We prove that the update rule has a monotonic improvement guarantee and is suitable for optimizing general stochastic policies with continuous or discrete actions. The update rule provides a new theoretical foundation for policy-based RL, which traditionally restricts the policy search to a family of parametric functions, such as policy gradient (Sutton et al., 1999), deterministic policy gradient (Silver et al., 2014; Lillicrap et al., 2016), actor critic (Konda & Tsitsiklis, 1999; Degris et al., 2012), soft actor-critic (SAC) (Haarnoja et al., 2018a;b), and so on. Our update rule is derived from a closed-form solution to a trust region method using calculus of variation.
Researcher Affiliation Academia 1Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, South Kingstown, RI, USA.
Pseudocode No The paper focuses on theoretical derivations and proofs, not on providing pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets No The paper is theoretical and does not mention the use of any datasets for training experiments.
Dataset Splits No The paper is theoretical and does not discuss validation datasets or splits.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not list any specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe any experimental setup details such as hyperparameters or training configurations.