The Expressive Power of Transformers with Chain of Thought
Authors: William Merrill, Ashish Sabharwal
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers reasoning can be improved by allowing them to use a chain of thought or scratchpad , i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer s chain of thought or scratchpad impacts its reasoning power. |
| Researcher Affiliation | Collaboration | William Merrill New York University willm@nyu.edu Ashish Sabharwal Allen Institute for AI ashishs@allenai.org |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. It presents mathematical definitions, theorems, lemmas, and proofs. |
| Open Source Code | No | The paper does not provide any statements about releasing code or links to source code repositories for the methodology described. |
| Open Datasets | No | This is a theoretical paper, and thus it does not use datasets or conduct empirical training processes. No information on dataset availability is provided. |
| Dataset Splits | No | This is a theoretical paper and does not involve empirical experiments with dataset splits for training, validation, or testing. |
| Hardware Specification | No | This is a theoretical paper focusing on computational power, not practical implementation or empirical experiments. Therefore, no hardware specifications are mentioned. |
| Software Dependencies | No | This is a theoretical paper. No software dependencies or specific version numbers for experimental setup are mentioned. |
| Experiment Setup | No | This is a theoretical paper and does not describe any empirical experiment setup, hyperparameters, or training configurations. |