BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach

Authors: Bo Liu, Mao Ye, Stephen Wright, Peter Stone, Qiang Liu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a non-asymptotic convergence analysis of the proposed method to stationary points for non-convex objectives and present empirical results that show its superior practical performance.
Researcher Affiliation Collaboration Bo Liu1 Mao Ye1 Stephen Wright2 Peter Stone1,3 Qiang Liu1 1The University of Texas at Austin 2University of Wisconsin-Madison 3 Sony AI
Pseudocode Yes Algorithm 1 Bilevel Optimization Made Easy (BOME!)
Open Source Code Yes 3. (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes For the dataset, we use MNIST [9] (Fashion MNIST [50]).
Dataset Splits Yes The stepsizes of all methods are set by a grid search from the set {0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000}. All toy problems adopt vanilla gradient descent (GD) and applications on hyperparameter optimization adapts GD with a momentum of 0.9.
Hardware Specification No The paper states in its checklist: "3. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]". No specific hardware details are mentioned in the main text.
Software Dependencies No The paper mentions using "Adam [26]" as an optimizer, but does not provide specific version numbers for any software dependencies, programming languages, or libraries like PyTorch, TensorFlow, Python, or CUDA.
Experiment Setup Yes Unless otherwise specified, BOME strictly follows Algorithm 1 with φk = krˆq(vk, k)k2, = 0.5, and T = 10. The inner stepsize is set to be the same as outer stepsize . The stepsizes of all methods are set by a grid search from the set {0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000}. All toy problems adopt vanilla gradient descent (GD) and applications on hyperparameter optimization adapts GD with a momentum of 0.9.