Efficient Streaming Language Models with Attention Sinks

Authors: Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Streaming LLM using four prominent recent model families: Llama-2 (Touvron et al., 2023b), MPT (Team, 2023), Py Thia (Biderman et al., 2023), and Falcon (Almazrouei et al., 2023).
Researcher Affiliation Collaboration 1 Massachusetts Institute of Technology 2 Meta AI 3 Carnegie Mellon University 4 NVIDIA
Pseudocode No The paper provides conceptual diagrams and descriptions of its methods (e.g., Figure 4 illustrating KV cache) but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Code and datasets are provided in the link. https://github.com/mit-han-lab/streaming-llm
Open Datasets Yes Code and datasets are provided in the link. We have made our code and datasets available in this github repo. We firstly evaluate Streaming LLM s language modeling perplexity using the concatenated PG19 (Rae et al., 2020) test set... We train the models on an 8x A6000 NVIDIA GPU server using the deduplicated Pile (Gao et al., 2020) dataset.
Dataset Splits No The paper uses established datasets like PG19 and The Pile, which have their own splits, and mentions 'test set' and 'training samples'. However, it does not provide explicit details about the train/validation/test dataset splits (e.g., percentages or sample counts) within the paper itself.
Hardware Specification Yes We train the models on an 8x A6000 NVIDIA GPU server... and tested on a single NVIDIA A6000 GPU using the Llama-2-7B and Llama-2-13B models.
Software Dependencies No The paper mentions that 'Both methods are implemented using the Huggingface Transformers library (Wolf et al., 2020)' but does not provide specific version numbers for this or any other software dependency.
Experiment Setup Yes For Llama-2 models, the cache size is set at 2048, while for Falcon, Pythia, and MPT models, it s set at 1024. Apart from reducing the training batch size to 256, we retained all Pythia training configurations, including learning rate schedules, model initialization, and dataset permutations. Both models were trained for 143,000 steps.