Non-asymptotic Convergence of Training Transformers for Next-token Prediction
Authors: Ruiquan Huang, Yingbin Liang, Jing Yang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments further validate our theoretical findings. |
| Researcher Affiliation | Academia | Ruiquan Huang Penn State University State College, PA, 16801 rzh5514@psu.edu Yingbin Liang Ohio State University Columbus, OH, 43210 liang.889@osu.edu Jing Yang Penn State Univeristy State College, PA, 16801 yangjing@psu.edu |
| Pseudocode | Yes | Algorithm 1 Two-stage Normalized Gradient Descent |
| Open Source Code | Yes | We provide our code in the supplemental. |
| Open Datasets | No | Specifically, we randomly generate a realizable dataset as described in Assumption 1 with |V| = 20. ... We do not use open source data. |
| Dataset Splits | No | The paper describes training on a synthetically generated dataset but does not specify explicit training/validation/test splits for reproduction. The experiment verifies theoretical findings rather than evaluating performance on distinct data subsets. |
| Hardware Specification | Yes | All experiments are conducted on a PC equipped with an i5-12400F processor and 16GB of memory. |
| Software Dependencies | No | The paper mentions no specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or libraries). |
| Experiment Setup | Yes | The parameters are chosen as d = |V|, η0 = 0.2/ d, and η = 0.05/ d. In Figure 2, the first three plots show the dynamics of the training stage 1, which indicates the convergence of the loss L0(W (t) ov ) to its minimum value, the convergence of W (t) ov in direction to W ov, and the linear increase of the norm W (t) ov , respectively. These results verify Proposition 1. The last three plots show the dynamics of the training stage 2, which indicates the convergence of the loss L(θ(t)), the convergence of W (t) kq in direction to W kq, and the linear increase of the norm W (t) kq . These results verify Theorem 1 and Theorem 2. |