From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
Authors: Muhammed Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, Samet Oymak
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we study learning a 1-layer self-attention model from a set of prompts and the associated outputs sampled from the model. We first establish a formal link between the self-attention mechanism and Markov models under suitable conditions: Inputting a prompt to the self-attention model samples the output token according to a context-conditioned Markov Chain (CCMC). CCMC is obtained by weighing the transition matrix of a standard Markov chain according to the sufficient statistics of the prompt/context. Building on this formalism, we develop identifiability/coverage conditions for the data distribution that guarantee consistent estimation of the latent model under a teacher-student setting and establish sample complexity guarantees under IID data. Finally, we study the problem of learning from a single output trajectory generated in response to an initial prompt. We characterize a winner-takes-all phenomenon where the generative process of self-attention evolves to sampling from a small set of winner tokens that dominate the context window. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. |
| Researcher Affiliation | Collaboration | 1University of Michigan, Ann Arbor, USA 2Google Research, NYC, USA. |
| Pseudocode | No | The paper does not contain pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper discusses concepts like 'vocabulary of K tokens' and 'input prompts' and 'outputs sampled from the model' in a theoretical/simulation context but does not specify any publicly available datasets used for training. |
| Dataset Splits | No | The paper does not specify training/validation/test dataset splits for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models) used for running experiments or simulations. |
| Software Dependencies | No | The paper mentions general models like GPT-2, but does not list specific software dependencies with version numbers needed for reproducibility (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | We randomly initialize a one-layer self-attention model and generate a single trajectory with a length of 500 starting from initial prompt X1 = [6]. We track the evolution of token frequency as shown in the middle of Figure 4. As illustrated in the right side of Figure 4, when the generation time increases, the diversity of output tokens is greatly reduced, which eventually collapses to a singleton. This arises due to the self-reinforcement of majority tokens within the trajectory. |