MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
Authors: Zayne Rea Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning. |
| Researcher Affiliation | Academia | Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett Department of Computer Science The University of Texas in Austin zayne@utexas.edu |
| Pseudocode | Yes | This process is shown in Algorithm 1 in Appendix G.1. ... Algorithm 1 The recursive reasoning tree expansion algorithm. |
| Open Source Code | Yes | 1Project website can be found at https://github.com/Zayne-Sprague/Mu SR |
| Open Datasets | Yes | 1Project website can be found at https://github.com/Zayne-Sprague/Mu SR |
| Dataset Splits | No | The paper creates a new dataset and evaluates LLMs on it, but it does not specify explicit training/validation/test splits (e.g., percentages or counts) for reproducibility of the LLM experiments. The LLMs are evaluated zero-shot or few-shot on the generated examples. |
| Hardware Specification | Yes | For all local models, we ran them on on a machine with 4x NVIDIA RTX A6000 48GB cards. |
| Software Dependencies | No | The paper mentions software like 'gpt-4', 'gpt-3.5-turbo chat endpoints', 'Llama2 and Vicuna models', 'Hugging Face s implementations (Wolf et al., 2020)', and 'bitsandbytes quantization (Dettmers et al., 2022)', but does not provide specific version numbers for Hugging Face or bitsandbytes, only citations to the general tools. |
| Experiment Setup | No | For the dataset creation, it states 'Temperature and top-p are set to 1.0.' for GPT-4. For model evaluation, it states 'No changes to the inference parameters were made when sampling from the models,' but does not explicitly list the specific values of these default parameters for reproduction. |