SIRIUS : Contexual Sparisty with Correction for Efficient LLMs

Authors: Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, it significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. [...] SIRIUS is evaluated on 6 models with 8 difficult generation tasks in reasoning, deduction, and coding and shows consistent effectiveness and efficiency.
Researcher Affiliation Collaboration Yang Zhou1, Zhuoming Chen1, Zhaozhuo Xu2, Xi Victoria Lin3, Beidi Chen1,3 1Carnegie Mellon Univeristy 2Stevens Institute of Technology 3FAIR at Meta
Pseudocode Yes Algorithm 1 Sirius
Open Source Code Yes We open-source our implementation of Sirius at https://github.com/Infini-AI-Lab/Sirius.git.
Open Datasets Yes Datasets To comprehensively evaluate SIRIUS performance, we deploy six mainstream LLMs with sizes ranging from 7B to 13B: Llama-2-7B, 13B, and Llama-3-8B with their instruction finetuned counterparts, all from Llama family. Following Wei et al. (2022) in LLM reasoning, we also tested CS models on two popular types of reasoning generation tasks: arithmetic and commonsense reasoning. On the Arithmetic side, besides GSM8K, we also evaluate CS models on AQu A-RAT. On the Common Sense side, we use CSQA Saha et al. (2018), Strategy QAGeva et al. (2021), Date, and Sports, last two from Big Bench Suite bench authors (2023). [...] We select Human Eval Chen et al. (2021) and MBPP+ Liu et al. (2023a) to evaluate SIRIUS.
Dataset Splits No The paper refers to using datasets and evaluating performance, but it does not explicitly state the specific training, validation, and test splits (e.g., percentages or exact counts) for any of the datasets used in the experiments.
Hardware Specification Yes Firstly, we consider the on-chip setting running Llama-3-8B-Instruct on a single GPU. [...] Results are shown in Table 3. On average, SIRIUS delivers the promised latency reduction from APU calculations. The speedup ratio on A40 and L40 closely aligns with the theoretical APU reported. On the other hand, A100 and H100 compute MLP more efficiently than it compute attention, making the latency ratio between computing MLP and attention not perfectly aligned with their ratio in parameter size. [...] We use a single L40 48GB with a PCIe bus bandwidth of 25 GB/s to run Llama-3-70B-Instruct with batch size 1.
Software Dependencies No We use torch compile to optimize the inference latency and limit the overhead other than running model inference. The paper mentions a software component ("torch compile") but does not provide specific version numbers for it or any other key software dependencies like PyTorch, CUDA, or other libraries.
Experiment Setup Yes The default sparsity for both methods is 50% for the MLP component of the model (whole MLP for coarse-grained sparsity and Up and Down linear layers only for fine-grained sparsity). [...] For arithmetic reasoning and coding, we use 50% neuron sparsity for both CSparse and FSparse. [...] Since commonsense reasoning tasks are generally less logically challenging comparatively, we lowered the neuron sparsity level to 40%. [...] With kernel size 10, SIRIUS achieves 0.74 APU with accuracy 71.27% accuracy. [...] We ablate this setting on a 30% subsampled GSM8K dataset, and only strict accuracy is reported. The performance is the score, while the efficiency is measured by Average Advance Length (AAL).