Efficient Streaming Language Models with Attention Sinks
Authors: Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Streaming LLM using four prominent recent model families: Llama-2 (Touvron et al., 2023b), MPT (Team, 2023), Py Thia (Biderman et al., 2023), and Falcon (Almazrouei et al., 2023). |
| Researcher Affiliation | Collaboration | 1 Massachusetts Institute of Technology 2 Meta AI 3 Carnegie Mellon University 4 NVIDIA |
| Pseudocode | No | The paper provides conceptual diagrams and descriptions of its methods (e.g., Figure 4 illustrating KV cache) but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and datasets are provided in the link. https://github.com/mit-han-lab/streaming-llm |
| Open Datasets | Yes | Code and datasets are provided in the link. We have made our code and datasets available in this github repo. We firstly evaluate Streaming LLM s language modeling perplexity using the concatenated PG19 (Rae et al., 2020) test set... We train the models on an 8x A6000 NVIDIA GPU server using the deduplicated Pile (Gao et al., 2020) dataset. |
| Dataset Splits | No | The paper uses established datasets like PG19 and The Pile, which have their own splits, and mentions 'test set' and 'training samples'. However, it does not provide explicit details about the train/validation/test dataset splits (e.g., percentages or sample counts) within the paper itself. |
| Hardware Specification | Yes | We train the models on an 8x A6000 NVIDIA GPU server... and tested on a single NVIDIA A6000 GPU using the Llama-2-7B and Llama-2-13B models. |
| Software Dependencies | No | The paper mentions that 'Both methods are implemented using the Huggingface Transformers library (Wolf et al., 2020)' but does not provide specific version numbers for this or any other software dependency. |
| Experiment Setup | Yes | For Llama-2 models, the cache size is set at 2048, while for Falcon, Pythia, and MPT models, it s set at 1024. Apart from reducing the training batch size to 256, we retained all Pythia training configurations, including learning rate schedules, model initialization, and dataset permutations. Both models were trained for 143,000 steps. |