CHAI: Clustered Head Attention for Efficient LLM Inference

Authors: Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show that CHAI is able to reduce the memory requirements for storing K,V cache by up to 21.4% and inference time latency by up to 1.73 without any fine-tuning required. CHAI achieves this with a maximum 3.2% deviation in accuracy across 3 different models (i.e. OPT-66B, LLAMA-7B, LLAMA-33B) and 5 different evaluation datasets.
Researcher Affiliation Collaboration Saurabh Agarwal 1 Bilge Acun 2 Basil Hosmer 2 Mostafa Elhoushi 2 Yejin Lee 2 Shivaram Venkataraman 1 Dimitris Papailiopoulos 1 Carole-Jean Wu 2 1University of Wisconsin-Madison 2Meta-FAIR.
Pseudocode No The paper includes a schematic (Figure 10) illustrating the CHAI flow, but no formal pseudocode or algorithm blocks are provided.
Open Source Code No The paper states 'CHAI is built on top of Meta s x Formers (facebookresearch, 2023).' and cites a GitHub link for xFormers, but it does not provide a direct link or explicit statement about the open-source availability of the CHAI implementation code described in this paper.
Open Datasets Yes In our case, we sample a small number of samples (1024) from the C4 (Raffel et al., 2020) dataset... We evaluate the models on five commonly used NLP tasks: PIQA (Bisk et al., 2020), Hella Swag (Zellers et al., 2019), Arc-Challenge and Arc-Easy (Clark et al., 2018) and Bool QA (Clark et al., 2019).
Dataset Splits No The paper mentions using evaluation datasets (PIQA, Hella Swag, Arc-Challenge, Arc-Easy, Bool QA) but does not provide specific details on the train/validation/test splits, percentages, or sample counts for these datasets.
Hardware Specification Yes All our experiments are performed on servers with NVIDIA V100 GPUs. For OPT-66B we used eight GPUs on a single node, for LLAMA-33B we used four GPUs, and for LLAMA-7B, we used a single GPU for inference.
Software Dependencies No The paper mentions 'CHAI is built on top of Meta s x Formers (facebookresearch, 2023)', but it does not specify version numbers for xFormers or any other software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes To determine the number of clusters, we propose an offline strategy we run once for each model. In our case, we sample a small number of samples (1024) from the C4 (Raffel et al., 2020) dataset and perform elbow-plot analysis by plotting clustering error (i.e. sum of squared distance from the closest cluster) as a function of number of clusters. ... for each new request, we determine the cluster membership using K-Means clustering once we have processed five tokens, using the observed activations.