Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Authors: Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that Twilight can adaptively prune up to 98% tokens with nearly no accuracy loss in both long- and medium-context scenarios, leading to a 1.4 speedup over state-of-the-art sparse attention mechanisms. (...) 5 Evaluation In this section, we perform quantitative experiments to demonstrate that equipping state-of-the-art (SOTA) sparse attention algorithms with Twilight could improve efficiency while preserving accuracy. We present the accuracy and efficiency results in Section 5.1 and Section 5.2, respectively. At last, we perform ablation studies in Section 5.3.
Researcher Affiliation	Academia	Tsinghua University Massachusetts Institute of Technology University of California, Berkeley EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Top-p via Binary Search.
Open Source Code	Yes	https://github.com/tsinghua-ideal/Twilight (...) Our code and scripts are released at https://github. com/tsinghua-ideal/Twilight.
Open Datasets	Yes	We evaluate Twilight on two types of benchmarks: long-context, which includes Longbench [1] and RULER [14], and medium-context (500 to 2k tokens), which includes GSM8K [4], COQA [33], and the perplexity on the PG-19 dataset [32].
Dataset Splits	No	The paper uses standard benchmark datasets like Longbench, RULER, GSM8K, COQA, and PG-19. While these benchmarks typically have predefined splits, the paper does not explicitly state the specific training/test/validation percentages or methodologies used for its experiments within the text.
Hardware Specification	Yes	We evaluate the efficiency of Twilight on both the self-attention operator and the end-to-end decoding stage on a single A100 GPU.
Software Dependencies	No	The paper mentions software like PyTorch's scaled-dot-product-attention (SDPA), Flash Attention2 (FA2) [5], Memory Efficient Attention [21], Flash Infer [52], CUDA, and OpenAI Triton [39]. However, it does not provide specific version numbers for these software components, which are necessary for reproducible dependency information.
Experiment Setup	Yes	The hyperparameter p of Twilight is set to 0.95 for LLaMA-2/3 and 0.85 for Longchat, which will be explored in Section 5.3. (...) For DS, we use the optimized configurations tuned for each model provided by its official repository. (...) Following the baselines, we do not apply any sparse methods to the first two layers to ensure fair comparison. (...) we choose p = 0.85 for Longchat-7B-v1.5-32k.