Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Authors: Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical findings demonstrate that we can effectively prune up to 80% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to 2 increase in inference throughput and even greater memory savings. |
| Researcher Affiliation | Collaboration | µETH Zürich νML, CSEM SA τUniversity of Basel |
| Pseudocode | No | No explicit pseudocode or algorithm block labeled 'Pseudocode' or 'Algorithm' was found. The paper provides mathematical equations and descriptions of the method. |
| Open Source Code | Yes | We have also released the code as part of the supplementary material, including scripts on how to reproduce our results. Additionally, trained models will be released for further research. |
| Open Datasets | Yes | We fine-tune pretrained GPT-2 models 1, that support a context size of up to 1024 tokens, on a subset of the English Wikipedia 20220301.en and English bookcorpus datasets. We keep a separate test set where we report perplexity after training. All models shown, for a fair comparison, were fine-tuned using the same lightweight training setup as described in Appendix A. The datasets are provided by huggingface at https://huggingface.co/datasets/wikipedia and https: //huggingface.co/datasets/bookcorpus respectively. |
| Dataset Splits | No | The paper mentions training on subsets of Wikipedia and BookCorpus datasets and keeping a separate test set, but it does not specify explicit train/validation/test splits (e.g., percentages or counts for a validation set) for the main language modeling task. |
| Hardware Specification | Yes | We measure throughput using the optimal batch size on an NVIDIA RTX A5000 GPU. |
| Software Dependencies | No | The paper mentions 'flash-attention as provided by the scaled_dot_product_attention in pytorch-2.02', but does not list multiple key software components with their specific version numbers (e.g., Python, other libraries, CUDA). |
| Experiment Setup | Yes | We fine-tune pretrained models on a subset of the English Wikipedia 20220301.en and English bookcorpus datasets, for a total of 25000 steps with a batch size of 6. The datasets are provided by huggingface at https://huggingface.co/datasets/wikipedia and https: //huggingface.co/datasets/bookcorpus respectively. We use a learning rate of 1e 4 for the small and medium models and 5e 5 for the large and xl models with the Adam optimizer. We do not use any weight decay or any scheduler for the learning rate. |