Multi-block Min-max Bilevel Optimization with Applications in Multi-task Deep AUC Maximization

Authors: Quanqi Hu, YONGJIAN ZHONG, Tianbao Yang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results validate our theory and demonstrate the effectiveness of our method on problems with hundreds of tasks. 4 Experiments
Researcher Affiliation Academia Quanqi Hu Department of Computer Science Texas A&M University College Station, TX 77843 quanqi-hu@tamu.edu Yongjian Zhong Department of Computer Science University of Iowa Iowa City, IA 52242 yongjian-zhong@uiowa.edu Tianbao Yang Department of Computer Science Texas A&M University College Station, TX 77843 tianbao-yang@tamu.edu
Pseudocode Yes Algorithm 1 A Stochastic Algorithm for Multi-block Min-max Bilevel Optimization (v1) Algorithm 2 A Stochastic Algorithm for Multi-block Min-max Bilevel Optimization (v2) Algorithm 3 in Appendix C
Open Source Code Yes 3. (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes We use four datasets, namely CIFAR100, Che Xpert, Celeb A and ogbg-molpcba.
Dataset Splits Yes We follow 45, 000/5, 000/10, 000 split to construct training/validation/testing datasets. We use the recommended training/validation/testing split as 162, 770/19, 866/19, 961. We take the official validation set as the testing data, and take the last 1000 images in the training dataset for validation.
Hardware Specification No 3. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A]
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes For both methods, the learning rates η2, η1, η0 are set to be the same and tuned in {0.01, 0.03, 0.05, 0.07, 0.1}. The learning rates decay by a factor of 10 at the 4th and 30th epoch for Che Xpert and Celeb A, respectively. No learning rate decay is applied for CIFAR-100 and ogbg-molpcba. The moving average parameter β0 and η in the lower level problem of m AUC-CT (ours) are tuned in {0.1, 0.5, 0.9}. Regarding the task sampling, for datasets CIFAR-100 and ogbg-molpcba, 10 tasks are sampled to be updated in each iteration, and for each sampled task, we independently sample a data batch of size 128. For the other two datasets with fewer tasks, Che Xpert and Celeb A, we sample one task at each iteration. The batch size for data samples is 32 for Che Xpert, and 128 for Celeb A. We run both methods the same number of epochs which varies on different data, 2000 epochs for CIFAR100, 6 epochs for Che Xpert, 40 epochs for Celeb A and 100 epochs for obgb-molpcba. For all methods, the learning rate is tuned in {0.0001, 0.0005, 0.001, 0.005, 0.01}. The hyperparameters selection of MMB-p AUC are: η1 and η2 {0.5, 0.1, 0.01}, β1 {0.99, 0.9, 0.5, 0.1, 0.01} and β0 {0.9, 0.99}. The momentum parameters in SOPA-s are tuned in the same range and their λ parameter in {0.1, 1, 10} as in [42]. The margin parameter in the surrogate loss (e.g., c) is set to be 1. Regarding the task sampling, we sample one task at each iteration for ogbg-molpcba and Che Xpert, sample 10 tasks for CIFAR100, and sample 4 tasks for Celeb A. The data sample batch size is 32 for Che Xpert, and 64 for others. For smaller datasets (CIFAR100 and ogbg-molpcba), we run 100 epochs for each, and we decay the learning rate by a factor of 10 at the 50-th epoch. For larger datasets (Celeb A, Che Xpert), we run 50 and 5 epochs respectively.