A Parametric Class of Approximate Gradient Updates for Policy Optimization

Authors: Ramki Gummadi, Saurabh Kumar, Junfeng Wen, Dale Schuurmans

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental An experimental investigation demonstrates that the additional degrees of freedom provided in the parameterized family of updates can be leveraged to obtain non-trivial improvements both in synthetic domains and on popular deep RL benchmarks.
Researcher Affiliation Collaboration 1Google Research, Brain Team 2Stanford University 3Layer 6 AI 4University of Alberta.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper mentions using the 'open source ACME framework' but does not state that the code for their own described methodology is open source or available.
Open Datasets Yes Section 6 conducts an experimental evaluation of novel update rules in the expanded family, in settings ranging from a synthetic 2D bandit benchmark (Section 6.1), a tabular environment (Section 6.2) and the Mu Jo Co continuous control benchmark versus the PPO baseline (Section 6.3).
Dataset Splits No The paper does not explicitly provide train/validation/test dataset splits with percentages, absolute counts, or explicit splitting methodology.
Hardware Specification No The paper does not specify any hardware details such as CPU/GPU models or memory used for the experiments.
Software Dependencies No The paper mentions software like the 'ACME framework' and 'Mu Jo Co' but does not provide specific version numbers for these or other dependencies.
Experiment Setup Yes The learning rate for PG is 0.1 for both the actor and the critic, while the learning rate for QL is 0.01. (Appendix C.1) Table 2: Optimal hyper-parameter configurations for PPO MLA on the Mu Jo Co Tasks. (Appendix C.2)