Global Convergence in Training Large-Scale Transformers
Authors: Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason Klusowski, Jianqing Fan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and depth go to infinity, gradient flow converges to the Wasserstein gradient flow, which is represented by a partial differential equation. Then, we demonstrate that the gradient flow reaches a global minimum consistent with the PDE solution when the weight decay regularization parameter is sufficiently small. |
| Researcher Affiliation | Academia | Cheng Gao1 Yuan Cao2 Zihao Li1 Yihan He1 Mengdi Wang1 Han Liu3 Jason M. Klusowski1 Jianqing Fan1 1Princeton University 2The University of Hong Kong 3Northwestern University |
| Pseudocode | No | The paper describes methods and processes using mathematical equations and textual explanations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] |
| Open Datasets | Yes | In this section, we run simple experiments on training Vision Transformers (Vi T) [24] on the CIFAR-10 datasets to demonstrate global convergence in practical applications. |
| Dataset Splits | No | The paper mentions 'training loss and training accuracy' but does not specify exact training, validation, or test dataset splits. It only indicates that CIFAR-10 images are split into patches. |
| Hardware Specification | No | We note that all these experiments are conducted on a standard GPU card. |
| Software Dependencies | No | We train the Vi T models using Adam for 200 epochs with a mini-batch size 512. ... The output of each self-attention layer is passed through a single-hidden-layer feedforward component with 128 hidden neurons and Ge LU activation. |
| Experiment Setup | Yes | We train Vision Transformers with different numbers of heads and layers. In all our experiments, we split each CIFAR-10 image into four patches and then pass the patches into Vision Transformer models. We keep the dimension of each attention head to be 128. The output of each self-attention layer is passed through a single-hidden-layer feedforward component with 128 hidden neurons and Ge LU activation. Both the self-attention and feedforward components include skip connections. We implement dropout in the self-attention layers as well as the feedforward layers with a dropout probability of 0.1. The model is attached to a linear classifier. ... We train the Vi T models using Adam for 200 epochs with a mini-batch size 512. We set the initial learning rate to be 1e 4, and implement a cosine annealing learning rate schedule. |