CLLMs: Consistency Large Language Models
Authors: Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of our method, showing 2.4 to 3.4 improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks. Our code is available at https://github.com/hao-ailab/Consistency LLM. (Abstract) |
| Researcher Affiliation | Academia | 1Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University 2University of California, San Diego 3School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University. |
| Pseudocode | Yes | Algorithm 1 Generate dataset to train a CLLM, Algorithm 2 Training algorithm for a CLLM, Algorithm 3 Jacobi Decoding with KV Cache |
| Open Source Code | Yes | Our code is available at https://github.com/hao-ailab/Consistency LLM. |
| Open Datasets | Yes | Benchmarks and Setup. We evaluate performance across three domain-specific tasks, including text-to-SQL (Spider) (Yu et al., 2018), Python code generation (Codesearch-Python) (Husain et al., 2019) and graduate school math (GSM8k) (Cobbe et al., 2021). To test CLLMs generalizability on open-domain conversational interactions and instruction-following scenarios, we also train CLLMs on Share GPT2 data and perform evaluation on the MTbench (Zheng et al., 2023). The performance metrics are the greedy answers problem solve rate (test@1) on GSM8K, MT-bench score, execution accuracy on Spider, as well as and strict accuracy (pass@1) on Human-Eval. Additionally, we also run evaluations of CLLMs language modeling capability on raw-Wiki Text2 (Merity et al., 2016) and PTB (Pan et al., 2020). |
| Dataset Splits | No | The paper does not explicitly provide training, validation, and test dataset splits needed to reproduce the experiment. While it mentions 'training' and 'evaluation' on specific benchmarks, it does not specify how the data was divided into these sets for reproducibility. |
| Hardware Specification | Yes | Both training and evaluation are carried out on servers equipped with 8 NVIDIA A100 40GB GPUs and 128 AMD EPYC 7742 64-core processors. |
| Software Dependencies | No | The paper mentions specific LLM backbone models (LLaMA-2-7B, Deepseek-coder-7B-instruct) and discusses vLLM, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Results are measured with a batch size of 1. (Table 1 footnote). Consequently, the total loss for training a CLLM is: L(θ) = Lconsistency + w LAR where ω represents a weighting coefficient (Section 3.2.2). |