reproducibilityindex.ai

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Authors: Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, enabling Co T dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
Researcher Affiliation	Collaboration	Zhiyuan Li TTIC & Stanford University zhiyuanli@ttic.edu Hong Liu Stanford University hliu99@stanford.edu Denny Zhou Google Deep Mind dennyzhou@google.com Tengyu Ma Stanford University tengyuma@stanford.edu
Pseudocode	Yes	Algorithm 1 Decoder-only Transformer, TFθ and pθ
Open Source Code	Yes	We use the nanogpt6 codebase for language modeling. 6https://github.com/karpathy/nano GPT
Open Datasets	No	The paper states "We train transformers to solve these tasks with a large amount of synthetic data" and "At each step, we sample a batch of training data from a distribution pn(x)", indicating data is generated dynamically rather than sourced from a publicly available fixed dataset with specific access information.
Dataset Splits	No	Since we train transformers using fresh sampled synthetic data each step, the training accuracy/loss is just the same as validation accuracy/loss. This indicates an online supervised setting without explicit, fixed dataset splits.
Hardware Specification	No	The paper mentions "use ﬂoat16" for training but does not provide specific hardware details such as GPU/CPU models, memory, or computing environments.
Software Dependencies	No	The paper mentions using "Adam" and the "nanogpt6 codebase" but does not specify version numbers for these software components or other libraries/frameworks.
Experiment Setup	Yes	For all settings we use Adam with 10 5 learning rate, 0 weight decay, β1 = 0.9, β2 = 0.95, and gradient clipping with threshold equal to 1.0. The total training budget is 106 steps and we use a linear warmup in the ﬁrst 2000 steps starting from 10 6. For each step, we use a fresh sampled batch of size 64 from population distribution. We turn off dropout and use ﬂoat16. We vary the depth of the transformer for different settings while the embedding size and the number of attention heads are ﬁxed to be 512 and 8 respectively.