Global Convergence in Training Large-Scale Transformers

Authors: Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason Klusowski, Jianqing Fan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and depth go to infinity, gradient flow converges to the Wasserstein gradient flow, which is represented by a partial differential equation. Then, we demonstrate that the gradient flow reaches a global minimum consistent with the PDE solution when the weight decay regularization parameter is sufficiently small.
Researcher Affiliation Academia Cheng Gao1 Yuan Cao2 Zihao Li1 Yihan He1 Mengdi Wang1 Han Liu3 Jason M. Klusowski1 Jianqing Fan1 1Princeton University 2The University of Hong Kong 3Northwestern University
Pseudocode No The paper describes methods and processes using mathematical equations and textual explanations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No]
Open Datasets Yes In this section, we run simple experiments on training Vision Transformers (Vi T) [24] on the CIFAR-10 datasets to demonstrate global convergence in practical applications.
Dataset Splits No The paper mentions 'training loss and training accuracy' but does not specify exact training, validation, or test dataset splits. It only indicates that CIFAR-10 images are split into patches.
Hardware Specification No We note that all these experiments are conducted on a standard GPU card.
Software Dependencies No We train the Vi T models using Adam for 200 epochs with a mini-batch size 512. ... The output of each self-attention layer is passed through a single-hidden-layer feedforward component with 128 hidden neurons and Ge LU activation.
Experiment Setup Yes We train Vision Transformers with different numbers of heads and layers. In all our experiments, we split each CIFAR-10 image into four patches and then pass the patches into Vision Transformer models. We keep the dimension of each attention head to be 128. The output of each self-attention layer is passed through a single-hidden-layer feedforward component with 128 hidden neurons and Ge LU activation. Both the self-attention and feedforward components include skip connections. We implement dropout in the self-attention layers as well as the feedforward layers with a dropout probability of 0.1. The model is attached to a linear classifier. ... We train the Vi T models using Adam for 200 epochs with a mini-batch size 512. We set the initial learning rate to be 1e 4, and implement a cosine annealing learning rate schedule.