Instance-adaptive Zero-shot Chain-of-Thought Prompting
Authors: Xiaosong Yuan, Chen Shen, Shaotian Yan, Xiaofeng Zhang, Liang Xie, Wenxiao Wang, Renchu Guan, Ying Wang, Jieping Ye
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted with LLa MA-2, LLa MA3, and Qwen on math, logic, and commonsense reasoning tasks (e.g., GSM8k, MMLU, Causal Judgement) obtain consistent improvement, demonstrating that the instance-adaptive zero-shot Co T prompting performs better than other task-level methods with some curated prompts or sophisticated procedures, showing the significance of our findings in the zero-shot Co T reasoning mechanism. |
| Researcher Affiliation | Collaboration | 1College of Computer Science and Technology, Jilin University 2Key Laboratory of Symbolic Computation and Knowledge Engineering, Mo E, Jilin University 3Alibaba Cloud Computing 4Shanghai Jiao Tong University 5 College of Computer Science and Technology, Zhejiang University of Technology 6College of Software, Zhejiang University |
| Pseudocode | No | The paper describes algorithms (Sequential Substitution and Majority Vote) in text but does not provide them in pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | We will release the codes and data as soon as possible. |
| Open Datasets | Yes | GSM8k [17] is a challenging dataset for assessing the capability of language models in multi-step math reasoning. SVAMP [29] is presented for one-step math reasoning, which is easier than GSM8k. Commonsense QA [30] is designed to evaluate a model s capacity for commonsense reasoning with questions that demand commonsense knowledge. The MMLU [31] can assess a model s multi-task learning abilities across natural language inference, commonsense reasoning, question answering, etc. Causal Judgement and Tracking Shuffled Objects are two sub-tasks in BBH [32] |
| Dataset Splits | No | The paper mentions using 'training data' to find thresholds for IAP-ss, but does not specify exact train/validation/test splits (percentages or counts) for any dataset. |
| Hardware Specification | Yes | all the experiments are run on an 8x NVIDIA A100 GPU server. |
| Software Dependencies | No | The paper mentions using specific LLMs such as LLa MA-3-8B-Instruct and Qwen-14B-Chat, but it does not provide specific version numbers for underlying software libraries like PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | We set the generation mode to greedy-decoding to minimize irrelevant confounders during the model inference to ensure the answers to fixed questions under the same model and prompt, and all the experiments are run on an 8x NVIDIA A100 GPU server. For the IAP-ss, we obtain threshold values w.r.t distinct LLMs on different datasets, we compute the overall synthesized scores as defined in Eq 4. The λ1, λ2, and λ3 are hyperparameters to adjust the ratio of different saliency scores and obey λ1 + λ2 + λ3 = 1. As for the IAP-mv, we select top-k (hyper-parameter, k=3) values and use the majority result as the final result. |