A Parametric Class of Approximate Gradient Updates for Policy Optimization
Authors: Ramki Gummadi, Saurabh Kumar, Junfeng Wen, Dale Schuurmans
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An experimental investigation demonstrates that the additional degrees of freedom provided in the parameterized family of updates can be leveraged to obtain non-trivial improvements both in synthetic domains and on popular deep RL benchmarks. |
| Researcher Affiliation | Collaboration | 1Google Research, Brain Team 2Stanford University 3Layer 6 AI 4University of Alberta. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using the 'open source ACME framework' but does not state that the code for their own described methodology is open source or available. |
| Open Datasets | Yes | Section 6 conducts an experimental evaluation of novel update rules in the expanded family, in settings ranging from a synthetic 2D bandit benchmark (Section 6.1), a tabular environment (Section 6.2) and the Mu Jo Co continuous control benchmark versus the PPO baseline (Section 6.3). |
| Dataset Splits | No | The paper does not explicitly provide train/validation/test dataset splits with percentages, absolute counts, or explicit splitting methodology. |
| Hardware Specification | No | The paper does not specify any hardware details such as CPU/GPU models or memory used for the experiments. |
| Software Dependencies | No | The paper mentions software like the 'ACME framework' and 'Mu Jo Co' but does not provide specific version numbers for these or other dependencies. |
| Experiment Setup | Yes | The learning rate for PG is 0.1 for both the actor and the critic, while the learning rate for QL is 0.01. (Appendix C.1) Table 2: Optimal hyper-parameter configurations for PPO MLA on the Mu Jo Co Tasks. (Appendix C.2) |