On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

Authors: Chenyu Zheng, Wei Huang, Rongzhen Wang, Guoqiang Wu, Jun Zhu, Chongxuan LI

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, under a certain condition of data distribution, we prove that an autoregressively trained transformer learns W by implementing one step of gradient descent to minimize an ordinary least squares (OLS) problem in-context. It then applies the learned c W for next-token prediction, thereby verifying the mesa-optimization hypothesis. Next, under the same data conditions, we explore the capability limitations of the obtained mesa-optimizer. We show that a stronger assumption related to the moments of data is the sufficient and necessary condition that the learned mesa-optimizer recovers the distribution. Besides, we conduct exploratory analyses beyond the first data condition and prove that generally, the trained transformer will not perform vanilla gradient descent for the OLS problem. Finally, our simulation results verify the theoretical results, and the code is available at https://github.com/ML-GSAI/Mesa Opt-AR-Transformer.
Researcher Affiliation Academia 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Beijing Key Laboratory of Big Data Management and Analysis Methods 3 RIKEN AIP 4 School of Software, Shandong University 5 Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, Tsinghua University {cyzheng,wangrz,chongxuanli}@ruc.edu.cn; wei.huang.vr@riken.jp; guoqiangwu@sdu.edu.cn; dcszj@mail.tsinghua.edu.cn
Pseudocode No The paper includes lemmas and theorems, outlining proof ideas in Section 5, but does not present any pseudocode or algorithm blocks.
Open Source Code Yes Finally, our simulation results verify the theoretical results, and the code is available at https://github.com/ML-GSAI/Mesa Opt-AR-Transformer.
Open Datasets No In terms of the train set, we generate 10k sequences with Ttr = 100 and d = 5. In addition, we generate another test set with 10k sequences of the same shape.
Dataset Splits No The paper states, “we generate 10k sequences with Ttr = 100 and d = 5. In addition, we generate another test set with 10k sequences of the same shape.” It defines train and test sets but does not explicitly mention a separate validation set or how data was split for validation purposes.
Hardware Specification Yes All experiments are done on a single Ge Force RTX 3090 GPU in one hour.
Software Dependencies No The paper mentions running simulations and provides step sizes in Appendix B.2 but does not list any specific software dependencies (libraries, frameworks, or solvers) with version numbers.
Experiment Setup Yes In terms of the train set, we generate 10k sequences with Ttr = 100 and d = 5. In addition, we generate another test set with 10k sequences of the same shape. We train for 200 epochs with vanilla gradient descent, with different diagonal initialization of (a0, b0) by (0.1, 0.1), (0.5, 1.5), (2, 2). The detailed configurations (e.g., step size) and results of different experiments can be found in Appendix B.