Chain-of-Thought Reasoning Without Prompting
Authors: Xuezhi Wang, Denny Zhou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical studies on various reasoning benchmarks show that the proposed Co T-decoding effectively elicits reasoning capabilities from language models, which were previously obscured by standard greedy decoding. |
| Researcher Affiliation | Industry | Xuezhi Wang Google Deep Mind xuezhiw@google.com Denny Zhou Google Deep Mind dennyzhou@google.com |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | We provide the full details of the experiment settings in both the experiment section and the appendix. We also attach our code in supplemental materials. |
| Open Datasets | No | The paper mentions using established public datasets such as GSM8K, Multi Arith, and Year Parity, but does not explicitly describe the training data splits, only that pre-trained models are used and evaluated on test sets. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with percentages or counts for its own experiments. It mentions using established benchmark datasets like GSM8K. |
| Hardware Specification | Yes | For Mistral and Gemma models, we use A100 GPU with 40 GB RAM to run the decoding experiments. ...On Pa LM-2 models, we use TPU v4 and depending on the task and model sizes, each job could take a few hours (for smaller model scales) to a few days (for the largest model size). |
| Software Dependencies | No | The paper mentions using the "huggingface library" for Mistral and Gemma models, but does not provide specific version numbers for it or any other software dependencies. |
| Experiment Setup | Yes | For all experiments, the default input to the model is the standard QA format of Q: [question]\n A:. ...During decoding, we use k = 10 as default for the alternative top-k tokens at the first decoding position, and continue greedy decoding afterwards. ...we use an input sequence length of 256 and a maximum decoding step of 128...the output decoding step is set to 256...on math tasks we generate 200 new tokens for the pre-trained model and 400 new tokens for the instruction-tuned model, to make sure the responses do not get truncated in the middle. For the year parity task, we generate 50 new tokens for the pre-trained model and 100 new tokens for the instruction-tuned model. |