Optimistic Meta-Gradients
Authors: Sebastian Flennerhag, Tom Zahavy, Brendan O'Donoghue, Hado P. van Hasselt, András György, Satinder Singh
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We consider the problem of minimizing an ill-conditioned convex quadratic and compare standard momentum to a version with meta-learned step-size, i.e. ϕ : (x, w) 7 w f(x), where is the Hadamard product. We find that introducing a non-linearity ϕ leads to a sizeable improvement in the rate of convergence. See Section 7.1 for further details. |
| Researcher Affiliation | Industry | Sebastian Flennerhag Google Deep Mind flennerhag@google.com Tom Zahavy Google Deep Mind Brendan O Donoghue Google Deep Mind Hado van Hasselt Google Deep Mind András György Google Deep Mind Satinder Singh Google Deep Mind |
| Pseudocode | Yes | Algorithm 1: Meta-learning in practice. ... Algorithm 2: Meta-learning in the convex setting. ... Algorithm 3: BMG in practice. ... Algorithm 4: Convex optimistic meta-learning. |
| Open Source Code | No | The paper does not provide any explicit statement or link for the release of its source code. |
| Open Datasets | Yes | We train a 50-layer Res Net following a standard protocol (Appendix C) with SGD as the baseline optimiser. ... Figure 1: Image Net. We compare training a 50layer Res Net using SGD against variants that tune an element-wise learning rate online using standard meta-learning or optimistic meta-learning. |
| Dataset Splits | No | The paper mentions training steps and test accuracy but does not specify train/validation/test splits by percentage or sample count, nor does it refer to predefined splits with citations. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU or CPU models. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | For each Q and each algorithm, we sweep over the learning rate, decay rate, and the initialization of w (see Table 2 for values) and report results for the best performing hyper parameters. ... We sweep over the learning rate (for SGD) or meta-learning rate and report results for the best hyper-parameter over three independent runs. |