CLLMs: Consistency Large Language Models

Authors: Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of our method, showing 2.4 to 3.4 improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks. Our code is available at https://github.com/hao-ailab/Consistency LLM. (Abstract)
Researcher Affiliation Academia 1Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University 2University of California, San Diego 3School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University.
Pseudocode Yes Algorithm 1 Generate dataset to train a CLLM, Algorithm 2 Training algorithm for a CLLM, Algorithm 3 Jacobi Decoding with KV Cache
Open Source Code Yes Our code is available at https://github.com/hao-ailab/Consistency LLM.
Open Datasets Yes Benchmarks and Setup. We evaluate performance across three domain-specific tasks, including text-to-SQL (Spider) (Yu et al., 2018), Python code generation (Codesearch-Python) (Husain et al., 2019) and graduate school math (GSM8k) (Cobbe et al., 2021). To test CLLMs generalizability on open-domain conversational interactions and instruction-following scenarios, we also train CLLMs on Share GPT2 data and perform evaluation on the MTbench (Zheng et al., 2023). The performance metrics are the greedy answers problem solve rate (test@1) on GSM8K, MT-bench score, execution accuracy on Spider, as well as and strict accuracy (pass@1) on Human-Eval. Additionally, we also run evaluations of CLLMs language modeling capability on raw-Wiki Text2 (Merity et al., 2016) and PTB (Pan et al., 2020).
Dataset Splits No The paper does not explicitly provide training, validation, and test dataset splits needed to reproduce the experiment. While it mentions 'training' and 'evaluation' on specific benchmarks, it does not specify how the data was divided into these sets for reproducibility.
Hardware Specification Yes Both training and evaluation are carried out on servers equipped with 8 NVIDIA A100 40GB GPUs and 128 AMD EPYC 7742 64-core processors.
Software Dependencies No The paper mentions specific LLM backbone models (LLaMA-2-7B, Deepseek-coder-7B-instruct) and discusses vLLM, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup Yes Results are measured with a batch size of 1. (Table 1 footnote). Consequently, the total loss for training a CLLM is: L(θ) = Lconsistency + w LAR where ω represents a weighting coefficient (Section 3.2.2).