Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transformers Provably Learn Two-Mixture of Linear Classification via Gradient Flow

Authors: Hongru Yang, Zhangyang Wang, Jason Lee, Yingbin Liang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a guidance for our theory, we first conduct experiments to observe the training dynamics of the transformer where we train all the weights simultaneously. Our experiment results show a clear stage-wise learning phenomenon where the neuron weights learn before the attention modules. We empirically show the difficulty of generalizing our analysis of the gradient flow dynamics to the case even when the number of mixtures equals three, although the transformer can still successfully learn such distribution.
Researcher Affiliation Academia Hongru Yang The University of Texas at Austin & Princeton University EMAIL Zhangyang Wang The University of Texas at Austin EMAIL Jason D. Lee Princeton University EMAIL Yingbin Liang The Ohio State University EMAIL
Pseudocode Yes Algorithm 1 Three-stage Training
Open Source Code No The paper does not provide any explicit statements about releasing source code for their methodology, nor does it include a link to a code repository.
Open Datasets Yes In our experiments, we use MNIST dataset and extract the images with label 1 and label 2 to play the role of classification signals.
Dataset Splits No The paper describes how the synthetic K-mixture of linear classification data is generated (Definition 2.1) and mentions specific parameters like K=2 and L=3 for the training dynamics. For the MNIST dataset, it states that images with labels 1 and 2 are extracted. However, it does not provide explicit training, validation, or test split percentages or counts for either the synthetic or real-world datasets, nor does it reference standard splits with citations.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper does not specify any software dependencies with version numbers that would be necessary to replicate the experiments.
Experiment Setup Yes In our experiments, all the weights in the transformer are trained simultaneously via gradient descent with learning rate 0.1. Initialize w(0) = 0, W (0) K , W (0) Q N(0, ω2 1 m) and b to be a sufficiently small positive constant such as 1/2. The attention initialization scale satisfies ω < C < 1 for some small constant C.