On Implicit Bias in Overparameterized Bilevel Optimization
Authors: Paul Vicol, Jonathan P Lorraine, Fabian Pedregosa, David Duvenaud, Roger B Grosse
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 3, we characterize solution concepts that capture the behavior of the coldand warm-start BLO algorithms and show that, for quadratic inner and outer objectives, cold-start BLO yields minimum-norm outer parameters. In Section 4, we show that warm-start BLO induces an implicit bias on the inner parameter iterates, regularizing the updates to maintain proximity to previous solutions. In Section 4, we introduce simple tasks based on dataset distillation, designed to provide insights into the phenomena at play. First, we show that when the inner problem is overparameterized, the inner parameters w can retain information associated with different settings of the outer parameters over the course of joint optimization. Then, we show that when the outer problem is overparameterized, the choice of hypergradient approximation can affect which outer solutions is found. Experimental details and extended results are provided in Appendix G.Dataset Distillation. Dataset Distillation Data Aug. Net Method MNIST CIFAR-10 MNIST CIFAR-10 Cold-Start 30.7 / 89.1 17.6 / 46.9 84.86 45.38 Warm-Start 90.8 / 97.5 50.3 / 59.8 92.81 59.30 Warm-Start + Re-Train 12.9 / 17.1 10.2 / 8.9 9.15 11.12 Table 3: Cols 1-3: Accuracy on original data with 1 or 10 synthetic samples. Cols 4-6: Learning a data augmentation network. Extended table in App. G. |
| Researcher Affiliation | Collaboration | 1University of Toronto 2Vector Institute 3Google Research. |
| Pseudocode | Yes | Algorithm 1 Cold-Start BLO ... Algorithm 2 Warm-start BLO ... Algorithm 3 Neumann Hypergrad ... Algorithm 4 Optimize(u, w, T) ... Algorithm 5 BRJ-Approx ... Algorithm 6 Iterative Diff Hypergrad (All located in Appendix H, titled "Algorithms") |
| Open Source Code | No | The paper does not include an unambiguous statement that the authors are releasing the code for the work described in the paper, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | We ran a dataset distillation task on MNIST using a linear classifier. We obtained similar results on MNIST, Fashion MNIST, and CIFAR-10, with 1 or 10 synthetic datapoints (see Table 3). |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits (e.g., exact percentages, sample counts, or explicit splitting methodologies). It mentions datasets like MNIST, CIFAR-10, and Fashion MNIST, which have standard splits, but it does not explicitly state the splits used for its experiments. |
| Hardware Specification | Yes | Compute Environment. All experiments were implemented using JAX (Bradbury et al., 2018), and were run on NVIDIA P100 GPUs. Each instance of the dataset distillation and antidistillation task took approximately 5 minutes of compute on a single GPU. |
| Software Dependencies | Yes | All experiments were implemented using JAX (Bradbury et al., 2018). |
| Experiment Setup | Yes | For our dataset distillation experiments, we trained a 4-layer MLP with 200 hidden units per layer and ReLU activations. For warm-start joint optimization, we computed hypergradients by differentiating through K = 1 steps of unrolling, and updated the hyperparameters (learned datapoints) and MLP parameters using alternating gradient descent, with one step on each. We used SGD with learning rate 0.001 for the inner optimization and Adam with learning rate 0.01 for the outer optimization. For the exponential-amplitude Fourier basis used in Section 4.2, we used SGD with learning rates 1e-8 and 1e-2 for the inner and outer parameters, respectively; for the 1/n amplitude Fourier basis (discussed below, and used for Figure 12), we used SGD with learning rates 1e-3 and 1e-2 for the inner and outer parameters, respectively. |