reproducibilityindex.ai

Instance-adaptive Zero-shot Chain-of-Thought Prompting

Authors: Xiaosong Yuan, Chen Shen, Shaotian Yan, Xiaofeng Zhang, Liang Xie, Wenxiao Wang, Renchu Guan, Ying Wang, Jieping Ye

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments conducted with LLa MA-2, LLa MA3, and Qwen on math, logic, and commonsense reasoning tasks (e.g., GSM8k, MMLU, Causal Judgement) obtain consistent improvement, demonstrating that the instance-adaptive zero-shot Co T prompting performs better than other task-level methods with some curated prompts or sophisticated procedures, showing the significance of our findings in the zero-shot Co T reasoning mechanism.
Researcher Affiliation	Collaboration	1College of Computer Science and Technology, Jilin University 2Key Laboratory of Symbolic Computation and Knowledge Engineering, Mo E, Jilin University 3Alibaba Cloud Computing 4Shanghai Jiao Tong University 5 College of Computer Science and Technology, Zhejiang University of Technology 6College of Software, Zhejiang University
Pseudocode	No	The paper describes algorithms (Sequential Substitution and Majority Vote) in text but does not provide them in pseudocode or a clearly labeled algorithm block.
Open Source Code	No	We will release the codes and data as soon as possible.
Open Datasets	Yes	GSM8k [17] is a challenging dataset for assessing the capability of language models in multi-step math reasoning. SVAMP [29] is presented for one-step math reasoning, which is easier than GSM8k. Commonsense QA [30] is designed to evaluate a model s capacity for commonsense reasoning with questions that demand commonsense knowledge. The MMLU [31] can assess a model s multi-task learning abilities across natural language inference, commonsense reasoning, question answering, etc. Causal Judgement and Tracking Shuffled Objects are two sub-tasks in BBH [32]
Dataset Splits	No	The paper mentions using 'training data' to find thresholds for IAP-ss, but does not specify exact train/validation/test splits (percentages or counts) for any dataset.
Hardware Specification	Yes	all the experiments are run on an 8x NVIDIA A100 GPU server.
Software Dependencies	No	The paper mentions using specific LLMs such as LLa MA-3-8B-Instruct and Qwen-14B-Chat, but it does not provide specific version numbers for underlying software libraries like PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	We set the generation mode to greedy-decoding to minimize irrelevant confounders during the model inference to ensure the answers to fixed questions under the same model and prompt, and all the experiments are run on an 8x NVIDIA A100 GPU server. For the IAP-ss, we obtain threshold values w.r.t distinct LLMs on different datasets, we compute the overall synthesized scores as defined in Eq 4. The λ1, λ2, and λ3 are hyperparameters to adjust the ratio of different saliency scores and obey λ1 + λ2 + λ3 = 1. As for the IAP-mv, we select top-k (hyper-parameter, k=3) values and use the majority result as the final result.