Variational Learning is Effective for Large Deep Networks

Authors: Yuesong Shen, Nico Daheim, Bai Cong, Peter Nickl, Gian Maria Marconi, Bazan Clement Emile Marcel Raoul, Rio Yokota, Iryna Gurevych, Daniel Cremers, Mohammad Emtiyaz Khan, Thomas Möllenhoff

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and Res Nets from scratch. IVON s computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data.
Researcher Affiliation Academia 1Technical University of Munich & Munich Center for Machine Learning, Munich, Germany 2UKP Lab, Technical University of Darmstadt & hessian.AI, Darmstadt, Germany 3Tokyo Institute of Technology, Tokyo, Japan 4RIKEN Center for AI Project, Tokyo, Japan.
Pseudocode Yes Algorithm 1 Improved Variational Online Newton (IVON). Hyperparameter setting is described in App. A.
Open Source Code Yes Code is available at https://github.com/team-approx-bayes/ivon.
Open Datasets Yes GPT-2 on Open Web Text, Res Net-50 on Image Net. For training GPT-2 (773M parameters) from scratch, IVON gives 0.4 reduction in validation perplexity over Adam W and, for Res Net-50 (25.6M parameters) on Image Net, it gives around 2% more accurate predictions that are also better calibrated.
Dataset Splits Yes Fig. 1(a) shows some examples where, for training GPT-2 (773M parameters) from scratch, IVON gives 0.4 reduction in validation perplexity over Adam W and, for Res Net-50 (25.6M parameters) on Image Net, it gives around 2% more accurate predictions that are also better calibrated. For IVON, we set them by grid search on a smaller model. We pretrain from scratch three models with parameter sizes of 125M, 355M ( GPT2-medium ), and 773M ( GPT-2-large ), respectively. We use gradient clipping to stabilize the training. Details are in App. C.1.
Hardware Specification Yes We train on 8 NVIDIA A100 GPUs with 40GB GPU memory each for up to three days. Some of the experiments were carried out with the TSUBAME3.0 supercomputer at the Tokyo Institute of Technology.
Software Dependencies No The paper mentions "Py Torch" but does not specify a version number or other software dependencies with their versions.
Experiment Setup Yes Hyperparameter setting is described in App. A. For IVON, we use an initial learning rate of 0.3 for the 125M parameter checkpoint, 0.2 for the 355M parameter checkpoint, and 0.15 for the 773M parameter checkpoint. Note, that we do not rescale by h0 and δ in this case, because element-wise clipping is used. We use β1 = 0.9, β2 = 1 10 5, h0 = 0.001 and a weight decay factor of 10 6, as well as element-wise clipping of 10 3.