Chain-of-Thought Reasoning Without Prompting

Authors: Xuezhi Wang, Denny Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical studies on various reasoning benchmarks show that the proposed Co T-decoding effectively elicits reasoning capabilities from language models, which were previously obscured by standard greedy decoding.
Researcher Affiliation Industry Xuezhi Wang Google Deep Mind xuezhiw@google.com Denny Zhou Google Deep Mind dennyzhou@google.com
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes We provide the full details of the experiment settings in both the experiment section and the appendix. We also attach our code in supplemental materials.
Open Datasets No The paper mentions using established public datasets such as GSM8K, Multi Arith, and Year Parity, but does not explicitly describe the training data splits, only that pre-trained models are used and evaluated on test sets.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with percentages or counts for its own experiments. It mentions using established benchmark datasets like GSM8K.
Hardware Specification Yes For Mistral and Gemma models, we use A100 GPU with 40 GB RAM to run the decoding experiments. ...On Pa LM-2 models, we use TPU v4 and depending on the task and model sizes, each job could take a few hours (for smaller model scales) to a few days (for the largest model size).
Software Dependencies No The paper mentions using the "huggingface library" for Mistral and Gemma models, but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup Yes For all experiments, the default input to the model is the standard QA format of Q: [question]\n A:. ...During decoding, we use k = 10 as default for the alternative top-k tokens at the first decoding position, and continue greedy decoding afterwards. ...we use an input sequence length of 256 and a maximum decoding step of 128...the output decoding step is set to 256...on math tasks we generate 200 new tokens for the pre-trained model and 400 new tokens for the instruction-tuned model, to make sure the responses do not get truncated in the middle. For the year parity task, we generate 50 new tokens for the pre-trained model and 100 new tokens for the instruction-tuned model.