Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Authors: Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical findings demonstrate that we can effectively prune up to 80% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to 2 increase in inference throughput and even greater memory savings. |
| Researcher Affiliation | Collaboration | µETH Zürich νML, CSEM SA τUniversity of Basel |
| Pseudocode | No | No explicit pseudocode or algorithm block labeled 'Pseudocode' or 'Algorithm' was found. The paper provides mathematical equations and descriptions of the method. |
| Open Source Code | Yes | We have also released the code as part of the supplementary material, including scripts on how to reproduce our results. Additionally, trained models will be released for further research. |
| Open Datasets | Yes | We fine-tune pretrained GPT-2 models 1, that support a context size of up to 1024 tokens, on a subset of the English Wikipedia 20220301.en and English bookcorpus datasets. We keep a separate test set where we report perplexity after training. All models shown, for a fair comparison, were fine-tuned using the same lightweight training setup as described in Appendix A. The datasets are provided by huggingface at https://huggingface.co/datasets/wikipedia and https: //huggingface.co/datasets/bookcorpus respectively. |
| Dataset Splits | No | The paper mentions training on subsets of Wikipedia and BookCorpus datasets and keeping a separate test set, but it does not specify explicit train/validation/test splits (e.g., percentages or counts for a validation set) for the main language modeling task. |
| Hardware Specification | Yes | We measure throughput using the optimal batch size on an NVIDIA RTX A5000 GPU. |
| Software Dependencies | No | The paper mentions 'flash-attention as provided by the scaled_dot_product_attention in pytorch-2.02', but does not list multiple key software components with their specific version numbers (e.g., Python, other libraries, CUDA). |
| Experiment Setup | Yes | We fine-tune pretrained models on a subset of the English Wikipedia 20220301.en and English bookcorpus datasets, for a total of 25000 steps with a batch size of 6. The datasets are provided by huggingface at https://huggingface.co/datasets/wikipedia and https: //huggingface.co/datasets/bookcorpus respectively. We use a learning rate of 1e 4 for the small and medium models and 5e 5 for the large and xl models with the Adam optimizer. We do not use any weight decay or any scheduler for the learning rate. |