The Expressive Power of Transformers with Chain of Thought

Authors: William Merrill, Ashish Sabharwal

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers reasoning can be improved by allowing them to use a chain of thought or scratchpad , i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer s chain of thought or scratchpad impacts its reasoning power.
Researcher Affiliation Collaboration William Merrill New York University willm@nyu.edu Ashish Sabharwal Allen Institute for AI ashishs@allenai.org
Pseudocode No The paper does not contain any pseudocode or algorithm blocks. It presents mathematical definitions, theorems, lemmas, and proofs.
Open Source Code No The paper does not provide any statements about releasing code or links to source code repositories for the methodology described.
Open Datasets No This is a theoretical paper, and thus it does not use datasets or conduct empirical training processes. No information on dataset availability is provided.
Dataset Splits No This is a theoretical paper and does not involve empirical experiments with dataset splits for training, validation, or testing.
Hardware Specification No This is a theoretical paper focusing on computational power, not practical implementation or empirical experiments. Therefore, no hardware specifications are mentioned.
Software Dependencies No This is a theoretical paper. No software dependencies or specific version numbers for experimental setup are mentioned.
Experiment Setup No This is a theoretical paper and does not describe any empirical experiment setup, hyperparameters, or training configurations.