On the Iteration Complexity of Hypergradient Computation

Authors: Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, Saverio Salzo

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an extensive experimental comparison among the methods which confirm the theoretical findings. and 3. Experiments
Researcher Affiliation Academia 1Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia, Genoa, Italy 2Department of Computer Science, University College London, London, UK.
Pseudocode Yes Algorithm 1 Iterative Differentiation (ITD) and Algorithm 2 Approximate Implicit Differentiation (AID)
Open Source Code Yes The algorithms have been implemented3 in Py Torch (Paszke et al., 2019). In the following, we shorthand AID-FP and AID-CG with FP and CG, respectively. 3The code is freely available at the following link. https://github.com/prolearner/hypertorch
Open Datasets Yes UCI Parkinson dataset (Little et al., 2008), 20 Newsgroup6. This dataset contains 18000 news divided in 20 topics and the features consist in 101631 tf-idf sparse vectors. 6http://qwone.com/~jason/20Newsgroups/, and Fashion MNIST dataset (Xiao et al., 2017).
Dataset Splits Yes We split the data randomly into three equal parts to make the train, validation and test sets.
Hardware Specification No in the case of 20 newsgroup for some t between 50 and 100, this cost exceeded the 11GB on the GPU. This mentions a GPU and its memory capacity but no specific model or other detailed hardware specifications.
Software Dependencies No The algorithms have been implemented3 in Py Torch (Paszke et al., 2019). This mentions the software 'Py Torch' but does not provide a specific version number required for reproduction.
Experiment Setup Yes We set h = 200 and use t = 20 fixed-point iterations to solve the lower-level problem in all the experiments. and We solve each problem using (hyper)gradient descent with fixed step size selected via grid search (additional details are provided in Appendix C.2). and We used t = k = 20 for all methods and Nesterov momentum for optimizing λ.