reproducibilityindex.ai

Global Convergence in Training Large-Scale Transformers

Authors: Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason Klusowski, Jianqing Fan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and depth go to infinity, gradient flow converges to the Wasserstein gradient flow, which is represented by a partial differential equation. Then, we demonstrate that the gradient flow reaches a global minimum consistent with the PDE solution when the weight decay regularization parameter is sufficiently small.
Researcher Affiliation	Academia	Cheng Gao1 Yuan Cao2 Zihao Li1 Yihan He1 Mengdi Wang1 Han Liu3 Jason M. Klusowski1 Jianqing Fan1 1Princeton University 2The University of Hong Kong 3Northwestern University
Pseudocode	No	The paper describes methods and processes using mathematical equations and textual explanations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No]
Open Datasets	Yes	In this section, we run simple experiments on training Vision Transformers (Vi T) [24] on the CIFAR-10 datasets to demonstrate global convergence in practical applications.
Dataset Splits	No	The paper mentions 'training loss and training accuracy' but does not specify exact training, validation, or test dataset splits. It only indicates that CIFAR-10 images are split into patches.
Hardware Specification	No	We note that all these experiments are conducted on a standard GPU card.
Software Dependencies	No	We train the Vi T models using Adam for 200 epochs with a mini-batch size 512. ... The output of each self-attention layer is passed through a single-hidden-layer feedforward component with 128 hidden neurons and Ge LU activation.
Experiment Setup	Yes	We train Vision Transformers with different numbers of heads and layers. In all our experiments, we split each CIFAR-10 image into four patches and then pass the patches into Vision Transformer models. We keep the dimension of each attention head to be 128. The output of each self-attention layer is passed through a single-hidden-layer feedforward component with 128 hidden neurons and Ge LU activation. Both the self-attention and feedforward components include skip connections. We implement dropout in the self-attention layers as well as the feedforward layers with a dropout probability of 0.1. The model is attached to a linear classifier. ... We train the Vi T models using Adam for 200 epochs with a mini-batch size 512. We set the initial learning rate to be 1e 4, and implement a cosine annealing learning rate schedule.