Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation
Authors: Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test So T on 12 recently released LLMs. Not only does So T provide considerable speed-ups (up to 2.39 ), but it can also improve the answer quality in many cases (Fig. 1).", "3 SOT EVALUATION Datasets. We evaluate So T on two recent assistant-style datasets: (1) Vicuna-80 (Chiang et al., 2023), which contains 80 questions spanning nine categories, such as coding, math, writing, roleplay, and so on, and (2) Wizard LM (Xu et al., 2023), which contains 218 questions spanning more categories and diverse difficulties. |
| Researcher Affiliation | Collaboration | Xuefei Ning1 foxdoraame@gmail.com Zinan Lin2 linzinan1995@gmail.com Zixuan Zhou14 zhouzx21@mails.tsinghua.edu.cn Zifu Wang3 zifu.wang@kuleuven.be Huazhong Yang1 yanghz@tsinghua.edu.cn Yu Wang1 yu-wang@tsinghua.edu.cn 1 Department of Electronic Engineering, Tsinghua University, Beijing, China 2 Microsoft Research, Redmond, Washington, USA 3 ESAT-PSI, KU Leuven, Leuven, Belgium 4 Infinigence-AI |
| Pseudocode | No | The paper includes structured prompt templates (Prompt 1, Prompt 2, Prompt 3) but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Website: https://sites.google.com/view/sot-llm Code: https://github.com/imagination-research/sot |
| Open Datasets | Yes | Datasets. We evaluate So T on two recent assistant-style datasets: (1) Vicuna-80 (Chiang et al., 2023), which contains 80 questions spanning nine categories, such as coding, math, writing, roleplay, and so on, and (2) Wizard LM (Xu et al., 2023), which contains 218 questions spanning more categories and diverse difficulties. |
| Dataset Splits | No | No explicit training/test/validation dataset splits (e.g., percentages or sample counts) were provided for the datasets used in the main LLM evaluation (Vicuna-80, Wizard LM) or the LIMA dataset used for router training. |
| Hardware Specification | Yes | For example, it takes 22 seconds for Claude (Anthropic, 2023) (accessed through Slack API) and 43 seconds for Vicuna-33B V1.3 (a 33B LLa MA-based model, running locally on one NVIDIA A100 GPU) to answer the question in Fig. 1.", "We run the profiling on the target GPU (NVIDIA A100-80G and NVIDIA RTX 3090) with CUDA 11.7, using the Hugging Face transformer library 4.28.1 and Py Torch 2.0.1. The host of A100-80G has an Intel Xeon Platinum 8358P CPU and 1T memory. The host of RTX 3090 has an Intel Xeon Gold 6246R CPU and 512G memory. |
| Software Dependencies | Yes | We run the profiling on the target GPU (NVIDIA A100-80G and NVIDIA RTX 3090) with CUDA 11.7, using the Hugging Face transformer library 4.28.1 and Py Torch 2.0.1. |
| Experiment Setup | Yes | The finetuning is conducted using the Adam W optimizer (Loshchilov & Hutter, 2019) with a weight decay of 0.01. The learning rate undergoes a warm-up phase during the first 1% of iterations to 5e-5 and then decays linearly. We train the model for 2 epochs using a batch size of 32. Input sequences are either padded or truncated to achieve a consistent length of 512 tokens. In the application of So T, false positives... we employ the Tversky loss (Wang et al., 2023b) with parameters α = 0.7 and β = 0.3... We also incorporate label smoothing (Szegedy et al., 2016) with a factor of ϵ = 0.2. |