CHAI: Clustered Head Attention for Efficient LLM Inference
Authors: Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show that CHAI is able to reduce the memory requirements for storing K,V cache by up to 21.4% and inference time latency by up to 1.73 without any fine-tuning required. CHAI achieves this with a maximum 3.2% deviation in accuracy across 3 different models (i.e. OPT-66B, LLAMA-7B, LLAMA-33B) and 5 different evaluation datasets. |
| Researcher Affiliation | Collaboration | Saurabh Agarwal 1 Bilge Acun 2 Basil Hosmer 2 Mostafa Elhoushi 2 Yejin Lee 2 Shivaram Venkataraman 1 Dimitris Papailiopoulos 1 Carole-Jean Wu 2 1University of Wisconsin-Madison 2Meta-FAIR. |
| Pseudocode | No | The paper includes a schematic (Figure 10) illustrating the CHAI flow, but no formal pseudocode or algorithm blocks are provided. |
| Open Source Code | No | The paper states 'CHAI is built on top of Meta s x Formers (facebookresearch, 2023).' and cites a GitHub link for xFormers, but it does not provide a direct link or explicit statement about the open-source availability of the CHAI implementation code described in this paper. |
| Open Datasets | Yes | In our case, we sample a small number of samples (1024) from the C4 (Raffel et al., 2020) dataset... We evaluate the models on five commonly used NLP tasks: PIQA (Bisk et al., 2020), Hella Swag (Zellers et al., 2019), Arc-Challenge and Arc-Easy (Clark et al., 2018) and Bool QA (Clark et al., 2019). |
| Dataset Splits | No | The paper mentions using evaluation datasets (PIQA, Hella Swag, Arc-Challenge, Arc-Easy, Bool QA) but does not provide specific details on the train/validation/test splits, percentages, or sample counts for these datasets. |
| Hardware Specification | Yes | All our experiments are performed on servers with NVIDIA V100 GPUs. For OPT-66B we used eight GPUs on a single node, for LLAMA-33B we used four GPUs, and for LLAMA-7B, we used a single GPU for inference. |
| Software Dependencies | No | The paper mentions 'CHAI is built on top of Meta s x Formers (facebookresearch, 2023)', but it does not specify version numbers for xFormers or any other software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | To determine the number of clusters, we propose an offline strategy we run once for each model. In our case, we sample a small number of samples (1024) from the C4 (Raffel et al., 2020) dataset and perform elbow-plot analysis by plotting clustering error (i.e. sum of squared distance from the closest cluster) as a function of number of clusters. ... for each new request, we determine the cluster membership using K-Means clustering once we have processed five tokens, using the observed activations. |