Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective

Authors: Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, Liwei Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical findings are complemented by an extensive set of experiments. We consider the two aforementioned math tasks plus two celebrated DP problems listed in the Introduction to Algorithms book [17], known as longest increasing subsequence (LIS) and edit distance (ED). For all these tasks, our experimental results show that directly predicting the answers without Co T always fails (accuracy mostly below 60%). In contrast, autoregressive Transformers equipped with Co T can learn entire solutions given sufficient training demonstrations.
Researcher Affiliation Collaboration 1School of EECS, Peking University 2National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 3Yuanpei College, Peking University 4Center for Machine Learning Research, Peking University 5Pazhou Lab
Pseudocode Yes All arithmetic expression problems are generated according to Algorithm 1. In Algorithm 1, we first create a number that serves as the answer to the problem. We then decompose the number using sampled operators sequentially, serving as the problem, until the maximum number of operators is met. The Co T procedure is precisely defined by reversing this problem generation process.
Open Source Code No The paper states 'We use the min GPT implementation6 for all the experiments,' referring to a third-party open-source project. However, it does not provide an explicit statement or link for the authors' specific source code or modifications for the work described in this paper.
Open Datasets No For each task, we construct three datasets with increasing difficulty... We generate 1M samples for each training dataset and 0.1M for testing while ensuring that duplicate samples between training and testing are removed. More details about the dataset construction can be found in Appendix H.
Dataset Splits No We generate 1M samples for each training dataset and 0.1M for testing while ensuring that duplicate samples between training and testing are removed.
Hardware Specification Yes All models are trained on 4 V100 GPUs for 100 epochs.
Software Dependencies No The paper mentions 'We use the min GPT implementation6 for all the experiments' but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For all experiments, we use standard Transformer models with hidden dimension d = 256, heads H = 4, and different model depths L. We adopt the Adam W optimizer [39] with β1 = 0.9, β2 = 0.999, lr = 10 4, and weight decay = 0.01 in all experiments. We use a fixed dropout ratio of 0.1 for all experiments to improve generalization. For Co T datasets, we optimize the negative log-likelihood loss on all tokens in the Co T steps and answers. For direct datasets, we optimize the negative log-likelihood loss on answer tokens. All models are trained on 4 V100 GPUs for 100 epochs. During inference, models trained on the direct datasets are required to output the answer directly, and models trained on Co T datasets will generate the whole Co T process token-by-token (using greedy search) until generating the End-of-Sentence token, where the output in the final step is regarded as the answer. We report the accuracy as the evaluation metric. Please refer to Appendix H for more training configuration details.