reproducibilityindex.ai

The Expressive Power of Transformers with Chain of Thought

Authors: William Merrill, Ashish Sabharwal

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers reasoning can be improved by allowing them to use a chain of thought or scratchpad , i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer s chain of thought or scratchpad impacts its reasoning power.
Researcher Affiliation	Collaboration	William Merrill New York University willm@nyu.edu Ashish Sabharwal Allen Institute for AI ashishs@allenai.org
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks. It presents mathematical definitions, theorems, lemmas, and proofs.
Open Source Code	No	The paper does not provide any statements about releasing code or links to source code repositories for the methodology described.
Open Datasets	No	This is a theoretical paper, and thus it does not use datasets or conduct empirical training processes. No information on dataset availability is provided.
Dataset Splits	No	This is a theoretical paper and does not involve empirical experiments with dataset splits for training, validation, or testing.
Hardware Specification	No	This is a theoretical paper focusing on computational power, not practical implementation or empirical experiments. Therefore, no hardware specifications are mentioned.
Software Dependencies	No	This is a theoretical paper. No software dependencies or specific version numbers for experimental setup are mentioned.
Experiment Setup	No	This is a theoretical paper and does not describe any empirical experiment setup, hyperparameters, or training configurations.