Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Authors: Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical findings demonstrate that we can effectively prune up to 80% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to 2 increase in inference throughput and even greater memory savings.
Researcher Affiliation Collaboration µETH Zürich νML, CSEM SA τUniversity of Basel
Pseudocode No No explicit pseudocode or algorithm block labeled 'Pseudocode' or 'Algorithm' was found. The paper provides mathematical equations and descriptions of the method.
Open Source Code Yes We have also released the code as part of the supplementary material, including scripts on how to reproduce our results. Additionally, trained models will be released for further research.
Open Datasets Yes We fine-tune pretrained GPT-2 models 1, that support a context size of up to 1024 tokens, on a subset of the English Wikipedia 20220301.en and English bookcorpus datasets. We keep a separate test set where we report perplexity after training. All models shown, for a fair comparison, were fine-tuned using the same lightweight training setup as described in Appendix A. The datasets are provided by huggingface at https://huggingface.co/datasets/wikipedia and https: //huggingface.co/datasets/bookcorpus respectively.
Dataset Splits No The paper mentions training on subsets of Wikipedia and BookCorpus datasets and keeping a separate test set, but it does not specify explicit train/validation/test splits (e.g., percentages or counts for a validation set) for the main language modeling task.
Hardware Specification Yes We measure throughput using the optimal batch size on an NVIDIA RTX A5000 GPU.
Software Dependencies No The paper mentions 'flash-attention as provided by the scaled_dot_product_attention in pytorch-2.02', but does not list multiple key software components with their specific version numbers (e.g., Python, other libraries, CUDA).
Experiment Setup Yes We fine-tune pretrained models on a subset of the English Wikipedia 20220301.en and English bookcorpus datasets, for a total of 25000 steps with a batch size of 6. The datasets are provided by huggingface at https://huggingface.co/datasets/wikipedia and https: //huggingface.co/datasets/bookcorpus respectively. We use a learning rate of 1e 4 for the small and medium models and 5e 5 for the large and xl models with the Adam optimizer. We do not use any weight decay or any scheduler for the learning rate.