reproducibilityindex.ai

On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

Authors: Chenyu Zheng, Wei Huang, Rongzhen Wang, Guoqiang Wu, Jun Zhu, Chongxuan LI

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, under a certain condition of data distribution, we prove that an autoregressively trained transformer learns W by implementing one step of gradient descent to minimize an ordinary least squares (OLS) problem in-context. It then applies the learned c W for next-token prediction, thereby verifying the mesa-optimization hypothesis. Next, under the same data conditions, we explore the capability limitations of the obtained mesa-optimizer. We show that a stronger assumption related to the moments of data is the sufﬁcient and necessary condition that the learned mesa-optimizer recovers the distribution. Besides, we conduct exploratory analyses beyond the ﬁrst data condition and prove that generally, the trained transformer will not perform vanilla gradient descent for the OLS problem. Finally, our simulation results verify the theoretical results, and the code is available at https://github.com/ML-GSAI/Mesa Opt-AR-Transformer.
Researcher Affiliation	Academia	1 Gaoling School of Artiﬁcial Intelligence, Renmin University of China 2 Beijing Key Laboratory of Big Data Management and Analysis Methods 3 RIKEN AIP 4 School of Software, Shandong University 5 Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, Tsinghua University {cyzheng,wangrz,chongxuanli}@ruc.edu.cn; wei.huang.vr@riken.jp; guoqiangwu@sdu.edu.cn; dcszj@mail.tsinghua.edu.cn
Pseudocode	No	The paper includes lemmas and theorems, outlining proof ideas in Section 5, but does not present any pseudocode or algorithm blocks.
Open Source Code	Yes	Finally, our simulation results verify the theoretical results, and the code is available at https://github.com/ML-GSAI/Mesa Opt-AR-Transformer.
Open Datasets	No	In terms of the train set, we generate 10k sequences with Ttr = 100 and d = 5. In addition, we generate another test set with 10k sequences of the same shape.
Dataset Splits	No	The paper states, “we generate 10k sequences with Ttr = 100 and d = 5. In addition, we generate another test set with 10k sequences of the same shape.” It defines train and test sets but does not explicitly mention a separate validation set or how data was split for validation purposes.
Hardware Specification	Yes	All experiments are done on a single Ge Force RTX 3090 GPU in one hour.
Software Dependencies	No	The paper mentions running simulations and provides step sizes in Appendix B.2 but does not list any specific software dependencies (libraries, frameworks, or solvers) with version numbers.
Experiment Setup	Yes	In terms of the train set, we generate 10k sequences with Ttr = 100 and d = 5. In addition, we generate another test set with 10k sequences of the same shape. We train for 200 epochs with vanilla gradient descent, with different diagonal initialization of (a0, b0) by (0.1, 0.1), (0.5, 1.5), (2, 2). The detailed conﬁgurations (e.g., step size) and results of different experiments can be found in Appendix B.