reproducibilityindex.ai

CHAI: Clustered Head Attention for Efficient LLM Inference

Authors: Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we show that CHAI is able to reduce the memory requirements for storing K,V cache by up to 21.4% and inference time latency by up to 1.73 without any fine-tuning required. CHAI achieves this with a maximum 3.2% deviation in accuracy across 3 different models (i.e. OPT-66B, LLAMA-7B, LLAMA-33B) and 5 different evaluation datasets.
Researcher Affiliation	Collaboration	Saurabh Agarwal 1 Bilge Acun 2 Basil Hosmer 2 Mostafa Elhoushi 2 Yejin Lee 2 Shivaram Venkataraman 1 Dimitris Papailiopoulos 1 Carole-Jean Wu 2 1University of Wisconsin-Madison 2Meta-FAIR.
Pseudocode	No	The paper includes a schematic (Figure 10) illustrating the CHAI flow, but no formal pseudocode or algorithm blocks are provided.
Open Source Code	No	The paper states 'CHAI is built on top of Meta s x Formers (facebookresearch, 2023).' and cites a GitHub link for xFormers, but it does not provide a direct link or explicit statement about the open-source availability of the CHAI implementation code described in this paper.
Open Datasets	Yes	In our case, we sample a small number of samples (1024) from the C4 (Raffel et al., 2020) dataset... We evaluate the models on five commonly used NLP tasks: PIQA (Bisk et al., 2020), Hella Swag (Zellers et al., 2019), Arc-Challenge and Arc-Easy (Clark et al., 2018) and Bool QA (Clark et al., 2019).
Dataset Splits	No	The paper mentions using evaluation datasets (PIQA, Hella Swag, Arc-Challenge, Arc-Easy, Bool QA) but does not provide specific details on the train/validation/test splits, percentages, or sample counts for these datasets.
Hardware Specification	Yes	All our experiments are performed on servers with NVIDIA V100 GPUs. For OPT-66B we used eight GPUs on a single node, for LLAMA-33B we used four GPUs, and for LLAMA-7B, we used a single GPU for inference.
Software Dependencies	No	The paper mentions 'CHAI is built on top of Meta s x Formers (facebookresearch, 2023)', but it does not specify version numbers for xFormers or any other software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	To determine the number of clusters, we propose an offline strategy we run once for each model. In our case, we sample a small number of samples (1024) from the C4 (Raffel et al., 2020) dataset and perform elbow-plot analysis by plotting clustering error (i.e. sum of squared distance from the closest cluster) as a function of number of clusters. ... for each new request, we determine the cluster membership using K-Means clustering once we have processed five tokens, using the observed activations.