Min-Max Multi-objective Bilevel Optimization with Applications in Robust Machine Learning

Authors: Alex Gu, Songtao Lu, Parikshit Ram, Tsui-Wei Weng

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on robust representation learning and robust hyperparameter optimization showcase (i) the advantages of considering the min-max multi-objective setup, and (ii) convergence properties of the proposed MORBi T.
Researcher Affiliation Collaboration Alex Gu , Songtao Lu , Parikshit Ram , Tsui-Wei Weng* MIT CSAIL, IBM Research, *UCSD gua@mit.edu, {songtao, parikshit.ram}@ibm.com, *lweng@ucsd.edu
Pseudocode Yes Algorithm 1: MORBi T with learning rates α, β and γ for x, y, λ respectively
Open Source Code Yes Our code is at https://github.com/minimario/MORBi T.
Open Datasets Yes We first consider a multi-task setup with n = 10 binary classification tasks from the Fashion MNIST dataset (Xiao et al., 2017). ... We also consider a bilevel extension of the robust meta-learning application (Collins et al., 2020) for a sinusoid regression task, a common meta-learning application introduced by Finn et al. (2017) ... We generate n = 16 binary classification tasks from the Letter dataset (Frey & Slate, 1991)
Dataset Splits Yes each of the 16 learning tasks (and hence, objective pairs) has a training set size of around 900 samples (for the LL loss), with 300 samples each for the UL loss and for computing the generalization loss.
Hardware Specification Yes We perform our experiments in Python 3.7.10 and Py Torch 1.8.1 with Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz.
Software Dependencies Yes We perform our experiments in Python 3.7.10 and Py Torch 1.8.1
Experiment Setup Yes We use Py Torch (Paszke et al., 2019), and implementation details are in Appendix C. All results are aggregated over 10 trials. ... For the Task-Robust version of the algorithm, we use α = 0.007, β = 0.005, γ = 0.003. For the standard version of the algorithm, we use α = 0.007, β = 0.011, γ = 0.003. ... For our data, we had x R784 100 and y R100 2. We used step sizes α = 0.01, β = 0.01, and γ = 0.3. We used batch sizes of 8 and 128 to compute gi for each inner step and fi for each outer iteration, respectively. In addition, we included ℓ2-regularization of y with regularization penalty 0.0005. We used vanilla SGD with a learning rate scheduler (Reduce LROn Plateau), invoked every 100 outer iterations, with patience of 10. Each optimization was executed for 10000 outer iterations. ... In this application, we use learning rates α = 0.0001, β = 0.001, γ = 0.001 and 20000 outer iterations. We use a batch size of 8 for both the inner and outer steps for each i [16] for the initial experiment in figure 2a. The optimizer was vanilla SGD with a learning rate scheduler (Reduce LROn Plateau), invoked every 100 outer iterations, with patience of 30.