When Do Flat Minima Optimizers Work?
Authors: Jean Kaddour, Linqing Liu, Ricardo Silva, Matt J. Kusner
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We empirically compare the optimizers with a rigorous model selection procedure on a broad range of tasks across different domains (CV, NLP, and GRL), model types (MLPs, CNNs, Transformers) and tasks (classification, self-supervised learning, open-domain question answering, natural language understanding, and node/graph/link property prediction). |
| Researcher Affiliation | Academia | Jean Kaddour Centre for Artificial Intelligence University College London Linqing Liu Centre for Artificial Intelligence University College London Ricardo Silva Department of Statistical Science University College London Matt J. Kusner Centre for Artificial Intelligence University College London |
| Pseudocode | Yes | Algorithm 1 Stochastic Weight Averaging [48] Algorithm 2 Sharpness-Aware Minimization [22] |
| Open Source Code | Yes | To assist future work, we open-source the code for all pipelines and hyper-parameters to reproduce the results. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Supplemental material |
| Open Datasets | Yes | WRN on CIFAR-100 [63] WRN-28-10... on CIFAR{10, 100} [63] SSL: CIFAR10 and Image Nette We use a subset of the Open Graph Benchmark (OGB) datasets [45]. The tasks are node property prediction (NPP), graph property prediction (GPP), and link property prediction (LPP). Natural Questions (NQ) [64] and Trivia QA [50] GLUE benchmark [91]. |
| Dataset Splits | Yes | We select hyper-parameters using a grid search over a held-out validation set. Appendix B contains the values of all hyper-parameters and additional training details (including public model checkpoints, hardware infrastructure, software libraries, etc.) to ensure full reproducibility alongside open-sourcing our code. |
| Hardware Specification | Yes | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix. (Appendix B.1) All experiments are conducted on NVIDIA A100-SXM4-40GB GPUs. |
| Software Dependencies | No | Appendix B contains the values of all hyper-parameters and additional training details (including public model checkpoints, hardware infrastructure, software libraries, etc.) to ensure full reproducibility alongside open-sourcing our code. However, the appendix (B.1) does not specify version numbers for the mentioned software libraries (e.g., PyTorch, Hugging Face Transformers, Pytorch Geometric), which is required for reproducibility. |
| Experiment Setup | Yes | For all architectures and datasets, we set hyperparameters shared by all methods (e.g., learning rate) mostly to values cited in prior work. ... We select hyper-parameters using a grid search over a held-out validation set. Specifically, for SWA we follow Izmailov et al. [48] and hold the update frequency ν constant to once per epoch and tune the start time E {0.5T, 0.6T, 0.75T, 0.9T} (T is the number of baseline training epochs). ... For SAM, we tune its neighborhood size ρ {0.01, 0.02, 0.05, 0.1, 0.2}, as in previous work [22, 4]. Appendix B contains the values of all hyper-parameters and additional training details (including public model checkpoints, hardware infrastructure, software libraries, etc.) to ensure full reproducibility alongside open-sourcing our code. |