Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Chain-of-Thought Reasoning Without Prompting
Authors: Xuezhi Wang, Denny Zhou
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical studies on various reasoning benchmarks show that the proposed Co T-decoding effectively elicits reasoning capabilities from language models, which were previously obscured by standard greedy decoding. |
| Researcher Affiliation | Industry | Xuezhi Wang Google Deep Mind EMAIL Denny Zhou Google Deep Mind EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | We provide the full details of the experiment settings in both the experiment section and the appendix. We also attach our code in supplemental materials. |
| Open Datasets | No | The paper mentions using established public datasets such as GSM8K, Multi Arith, and Year Parity, but does not explicitly describe the training data splits, only that pre-trained models are used and evaluated on test sets. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with percentages or counts for its own experiments. It mentions using established benchmark datasets like GSM8K. |
| Hardware Specification | Yes | For Mistral and Gemma models, we use A100 GPU with 40 GB RAM to run the decoding experiments. ...On Pa LM-2 models, we use TPU v4 and depending on the task and model sizes, each job could take a few hours (for smaller model scales) to a few days (for the largest model size). |
| Software Dependencies | No | The paper mentions using the "huggingface library" for Mistral and Gemma models, but does not provide specific version numbers for it or any other software dependencies. |
| Experiment Setup | Yes | For all experiments, the default input to the model is the standard QA format of Q: [question]\n A:. ...During decoding, we use k = 10 as default for the alternative top-k tokens at the first decoding position, and continue greedy decoding afterwards. ...we use an input sequence length of 256 and a maximum decoding step of 128...the output decoding step is set to 256...on math tasks we generate 200 new tokens for the pre-trained model and 400 new tokens for the instruction-tuned model, to make sure the responses do not get truncated in the middle. For the year parity task, we generate 50 new tokens for the pre-trained model and 100 new tokens for the instruction-tuned model. |