Min-Max Multi-objective Bilevel Optimization with Applications in Robust Machine Learning
Authors: Alex Gu, Songtao Lu, Parikshit Ram, Tsui-Wei Weng
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on robust representation learning and robust hyperparameter optimization showcase (i) the advantages of considering the min-max multi-objective setup, and (ii) convergence properties of the proposed MORBi T. |
| Researcher Affiliation | Collaboration | Alex Gu , Songtao Lu , Parikshit Ram , Tsui-Wei Weng* MIT CSAIL, IBM Research, *UCSD gua@mit.edu, {songtao, parikshit.ram}@ibm.com, *lweng@ucsd.edu |
| Pseudocode | Yes | Algorithm 1: MORBi T with learning rates α, β and γ for x, y, λ respectively |
| Open Source Code | Yes | Our code is at https://github.com/minimario/MORBi T. |
| Open Datasets | Yes | We first consider a multi-task setup with n = 10 binary classification tasks from the Fashion MNIST dataset (Xiao et al., 2017). ... We also consider a bilevel extension of the robust meta-learning application (Collins et al., 2020) for a sinusoid regression task, a common meta-learning application introduced by Finn et al. (2017) ... We generate n = 16 binary classification tasks from the Letter dataset (Frey & Slate, 1991) |
| Dataset Splits | Yes | each of the 16 learning tasks (and hence, objective pairs) has a training set size of around 900 samples (for the LL loss), with 300 samples each for the UL loss and for computing the generalization loss. |
| Hardware Specification | Yes | We perform our experiments in Python 3.7.10 and Py Torch 1.8.1 with Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz. |
| Software Dependencies | Yes | We perform our experiments in Python 3.7.10 and Py Torch 1.8.1 |
| Experiment Setup | Yes | We use Py Torch (Paszke et al., 2019), and implementation details are in Appendix C. All results are aggregated over 10 trials. ... For the Task-Robust version of the algorithm, we use α = 0.007, β = 0.005, γ = 0.003. For the standard version of the algorithm, we use α = 0.007, β = 0.011, γ = 0.003. ... For our data, we had x R784 100 and y R100 2. We used step sizes α = 0.01, β = 0.01, and γ = 0.3. We used batch sizes of 8 and 128 to compute gi for each inner step and fi for each outer iteration, respectively. In addition, we included ℓ2-regularization of y with regularization penalty 0.0005. We used vanilla SGD with a learning rate scheduler (Reduce LROn Plateau), invoked every 100 outer iterations, with patience of 10. Each optimization was executed for 10000 outer iterations. ... In this application, we use learning rates α = 0.0001, β = 0.001, γ = 0.001 and 20000 outer iterations. We use a batch size of 8 for both the inner and outer steps for each i [16] for the initial experiment in figure 2a. The optimizer was vanilla SGD with a learning rate scheduler (Reduce LROn Plateau), invoked every 100 outer iterations, with patience of 30. |