reproducibilityindex.ai

Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation

Authors: Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test So T on 12 recently released LLMs. Not only does So T provide considerable speed-ups (up to 2.39 ), but it can also improve the answer quality in many cases (Fig. 1).", "3 SOT EVALUATION Datasets. We evaluate So T on two recent assistant-style datasets: (1) Vicuna-80 (Chiang et al., 2023), which contains 80 questions spanning nine categories, such as coding, math, writing, roleplay, and so on, and (2) Wizard LM (Xu et al., 2023), which contains 218 questions spanning more categories and diverse difﬁculties.
Researcher Affiliation	Collaboration	Xuefei Ning1 foxdoraame@gmail.com Zinan Lin2 linzinan1995@gmail.com Zixuan Zhou14 zhouzx21@mails.tsinghua.edu.cn Zifu Wang3 zifu.wang@kuleuven.be Huazhong Yang1 yanghz@tsinghua.edu.cn Yu Wang1 yu-wang@tsinghua.edu.cn 1 Department of Electronic Engineering, Tsinghua University, Beijing, China 2 Microsoft Research, Redmond, Washington, USA 3 ESAT-PSI, KU Leuven, Leuven, Belgium 4 Inﬁnigence-AI
Pseudocode	No	The paper includes structured prompt templates (Prompt 1, Prompt 2, Prompt 3) but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Website: https://sites.google.com/view/sot-llm Code: https://github.com/imagination-research/sot
Open Datasets	Yes	Datasets. We evaluate So T on two recent assistant-style datasets: (1) Vicuna-80 (Chiang et al., 2023), which contains 80 questions spanning nine categories, such as coding, math, writing, roleplay, and so on, and (2) Wizard LM (Xu et al., 2023), which contains 218 questions spanning more categories and diverse difﬁculties.
Dataset Splits	No	No explicit training/test/validation dataset splits (e.g., percentages or sample counts) were provided for the datasets used in the main LLM evaluation (Vicuna-80, Wizard LM) or the LIMA dataset used for router training.
Hardware Specification	Yes	For example, it takes 22 seconds for Claude (Anthropic, 2023) (accessed through Slack API) and 43 seconds for Vicuna-33B V1.3 (a 33B LLa MA-based model, running locally on one NVIDIA A100 GPU) to answer the question in Fig. 1.", "We run the proﬁling on the target GPU (NVIDIA A100-80G and NVIDIA RTX 3090) with CUDA 11.7, using the Hugging Face transformer library 4.28.1 and Py Torch 2.0.1. The host of A100-80G has an Intel Xeon Platinum 8358P CPU and 1T memory. The host of RTX 3090 has an Intel Xeon Gold 6246R CPU and 512G memory.
Software Dependencies	Yes	We run the proﬁling on the target GPU (NVIDIA A100-80G and NVIDIA RTX 3090) with CUDA 11.7, using the Hugging Face transformer library 4.28.1 and Py Torch 2.0.1.
Experiment Setup	Yes	The ﬁnetuning is conducted using the Adam W optimizer (Loshchilov & Hutter, 2019) with a weight decay of 0.01. The learning rate undergoes a warm-up phase during the ﬁrst 1% of iterations to 5e-5 and then decays linearly. We train the model for 2 epochs using a batch size of 32. Input sequences are either padded or truncated to achieve a consistent length of 512 tokens. In the application of So T, false positives... we employ the Tversky loss (Wang et al., 2023b) with parameters α = 0.7 and β = 0.3... We also incorporate label smoothing (Szegedy et al., 2016) with a factor of ϵ = 0.2.