Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Authors: Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, enabling Co T dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers. |
| Researcher Affiliation | Collaboration | Zhiyuan Li TTIC & Stanford University zhiyuanli@ttic.edu Hong Liu Stanford University hliu99@stanford.edu Denny Zhou Google Deep Mind dennyzhou@google.com Tengyu Ma Stanford University tengyuma@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Decoder-only Transformer, TFθ and pθ |
| Open Source Code | Yes | We use the nanogpt6 codebase for language modeling. 6https://github.com/karpathy/nano GPT |
| Open Datasets | No | The paper states "We train transformers to solve these tasks with a large amount of synthetic data" and "At each step, we sample a batch of training data from a distribution pn(x)", indicating data is generated dynamically rather than sourced from a publicly available fixed dataset with specific access information. |
| Dataset Splits | No | Since we train transformers using fresh sampled synthetic data each step, the training accuracy/loss is just the same as validation accuracy/loss. This indicates an online supervised setting without explicit, fixed dataset splits. |
| Hardware Specification | No | The paper mentions "use float16" for training but does not provide specific hardware details such as GPU/CPU models, memory, or computing environments. |
| Software Dependencies | No | The paper mentions using "Adam" and the "nanogpt6 codebase" but does not specify version numbers for these software components or other libraries/frameworks. |
| Experiment Setup | Yes | For all settings we use Adam with 10 5 learning rate, 0 weight decay, β1 = 0.9, β2 = 0.95, and gradient clipping with threshold equal to 1.0. The total training budget is 106 steps and we use a linear warmup in the first 2000 steps starting from 10 6. For each step, we use a fresh sampled batch of size 64 from population distribution. We turn off dropout and use float16. We vary the depth of the transformer for different settings while the embedding size and the number of attention heads are fixed to be 512 and 8 respectively. |