reproducibilityindex.ai

CLLMs: Consistency Large Language Models

Authors: Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the effectiveness of our method, showing 2.4 to 3.4 improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks. Our code is available at https://github.com/hao-ailab/Consistency LLM. (Abstract)
Researcher Affiliation	Academia	1Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University 2University of California, San Diego 3School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University.
Pseudocode	Yes	Algorithm 1 Generate dataset to train a CLLM, Algorithm 2 Training algorithm for a CLLM, Algorithm 3 Jacobi Decoding with KV Cache
Open Source Code	Yes	Our code is available at https://github.com/hao-ailab/Consistency LLM.
Open Datasets	Yes	Benchmarks and Setup. We evaluate performance across three domain-specific tasks, including text-to-SQL (Spider) (Yu et al., 2018), Python code generation (Codesearch-Python) (Husain et al., 2019) and graduate school math (GSM8k) (Cobbe et al., 2021). To test CLLMs generalizability on open-domain conversational interactions and instruction-following scenarios, we also train CLLMs on Share GPT2 data and perform evaluation on the MTbench (Zheng et al., 2023). The performance metrics are the greedy answers problem solve rate (test@1) on GSM8K, MT-bench score, execution accuracy on Spider, as well as and strict accuracy (pass@1) on Human-Eval. Additionally, we also run evaluations of CLLMs language modeling capability on raw-Wiki Text2 (Merity et al., 2016) and PTB (Pan et al., 2020).
Dataset Splits	No	The paper does not explicitly provide training, validation, and test dataset splits needed to reproduce the experiment. While it mentions 'training' and 'evaluation' on specific benchmarks, it does not specify how the data was divided into these sets for reproducibility.
Hardware Specification	Yes	Both training and evaluation are carried out on servers equipped with 8 NVIDIA A100 40GB GPUs and 128 AMD EPYC 7742 64-core processors.
Software Dependencies	No	The paper mentions specific LLM backbone models (LLaMA-2-7B, Deepseek-coder-7B-instruct) and discusses vLLM, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	Results are measured with a batch size of 1. (Table 1 footnote). Consequently, the total loss for training a CLLM is: L(θ) = Lconsistency + w LAR where ω represents a weighting coefficient (Section 3.2.2).