On the Iteration Complexity of Hypergradient Computation
Authors: Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, Saverio Salzo
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present an extensive experimental comparison among the methods which confirm the theoretical findings. and 3. Experiments |
| Researcher Affiliation | Academia | 1Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia, Genoa, Italy 2Department of Computer Science, University College London, London, UK. |
| Pseudocode | Yes | Algorithm 1 Iterative Differentiation (ITD) and Algorithm 2 Approximate Implicit Differentiation (AID) |
| Open Source Code | Yes | The algorithms have been implemented3 in Py Torch (Paszke et al., 2019). In the following, we shorthand AID-FP and AID-CG with FP and CG, respectively. 3The code is freely available at the following link. https://github.com/prolearner/hypertorch |
| Open Datasets | Yes | UCI Parkinson dataset (Little et al., 2008), 20 Newsgroup6. This dataset contains 18000 news divided in 20 topics and the features consist in 101631 tf-idf sparse vectors. 6http://qwone.com/~jason/20Newsgroups/, and Fashion MNIST dataset (Xiao et al., 2017). |
| Dataset Splits | Yes | We split the data randomly into three equal parts to make the train, validation and test sets. |
| Hardware Specification | No | in the case of 20 newsgroup for some t between 50 and 100, this cost exceeded the 11GB on the GPU. This mentions a GPU and its memory capacity but no specific model or other detailed hardware specifications. |
| Software Dependencies | No | The algorithms have been implemented3 in Py Torch (Paszke et al., 2019). This mentions the software 'Py Torch' but does not provide a specific version number required for reproduction. |
| Experiment Setup | Yes | We set h = 200 and use t = 20 fixed-point iterations to solve the lower-level problem in all the experiments. and We solve each problem using (hyper)gradient descent with fixed step size selected via grid search (additional details are provided in Appendix C.2). and We used t = k = 20 for all methods and Nesterov momentum for optimizing λ. |