Amortized Implicit Differentiation for Stochastic Bilevel Optimization

Authors: Michael Arbel, Julien Mairal

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We run three sets of experiments described in Sections 5.1 to 5.3. In all cases, we consider Am IGO with either gradient descent (Am IGO-GD) or conjugate gradient (Am IGO-CG) for algorithm Bk. We Am IGO with AID methods without warm-start for Bk which we refer to as (AID-GD) and (AID-CG) and with (AID-CG-WS) which uses warm-start for Bk but not for Ak. We also consider other variants using either a fixed-point algorithm (AID-FP) (Grazzi et al., 2020) or Neumann series expansion (AID-N) (Lorraine et al., 2020) for Bk. Finally, we consider two algorithms based on iterative differentiation which we refer to as (ITD) (Grazzi et al., 2020) and (Reverse) (Franceschi et al., 2017). For all methods except (AID-CG-WS), we use warm-start in algorithm Ak, however only Am IGO, Am IGO-CG and AID-CG-WS exploits warm-start in Bk the other AID based methods initializing Bk with z0=0. In Sections 5.2 and 5.3, we also compare with BSA algorithm (Ghadimi and Wang, 2018), TTSA algorithm (Hong et al., 2020a) and stoc Bi O (Ji et al., 2021). An implementation of Am IGO is available in https://github.com/Michael Arbel/Am IGO. and Figure 1: Top row: performance on the synthetic task... Bottom row: performance on the hyper-parameter optimization task.
Researcher Affiliation Academia Michael Arbel & Julien Mairal Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.
Pseudocode Yes Algorithm 1 Am IGO, Algorithm 2 Ak(x, y0), Algorithm 3 Bk(x, y, v, z0).
Open Source Code Yes An implementation of Am IGO is available in https://github.com/Michael Arbel/Am IGO.
Open Datasets Yes We consider a classification task on the 20Newsgroup dataset and Figure 5 of Appendix F.3 shows the training loss (outer loss), the training and test accuracies of a model trained on MNIST by dataset distillation.
Dataset Splits No The outer-level cost functions for such task take the following form: f(x, y) = 1 |Dval| Pξ Dval L(y, ξ), g(x, y) = 1 |Dtr| Pξ Dtr L(y, ξ)... and optimized using an unregularized regression loss over the validation set while the model is learned using the training set. (Mentions use of training and validation sets, but does not provide specific percentages, counts, or explicit instructions for dataset splits needed for reproducibility).
Hardware Specification No No specific hardware details (like GPU/CPU models, memory, or cloud instance types) used for running the experiments are mentioned in the paper.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are mentioned in the paper.
Experiment Setup Yes For the default setting, we use the well-chosen parameters reported in Grazzi et al. (2020); Ji et al. (2021) where αk=γk=100, βk=0.5, and T=N=10. For the grid-search setting, we select the best performing parameters T, M and βk from a grid {10, 20} {5, 10} {0.5, 10}, while the batchsize (chosen to be the same for all steps of the algorithms) varies from 10 {0.1, 1, 2, 4}.