Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PASS: Pruning Attention Heads with Almost-sure Sparsity Targets

Authors: Dujian Ding, Ganesh Jawahar, Laks V. S. Lakshmanan

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on IWSLT14 German-to-English translation and GLUE benchmark tasks demonstrate that our approaches outperform the SOTA by achieving up to 1.33 higher BLEU scores, 1.44% higher accuracy, and 60% higher attention speedups.
Researcher Affiliation	Collaboration	Dujian Ding EMAIL Department of Computer Science the University of British Columbia Ganesh Jawahar EMAIL Google Deep Mind Laks V.S. Lakshmanan EMAIL Department of Computer Science the University of British Columbia
Pseudocode	No	The paper includes mathematical formulations and derivations but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Codebase is available on https://github.com/Dujian Ding/PASS.
Open Datasets	Yes	We evaluate our methods with encoder-decoder (ED) Transformer models and BERT models on IWSLT14 German-to-English translation (Cettolo et al., 2014) and GLUE benchmark tasks (Wang et al., 2018).
Dataset Splits	No	The paper mentions training epochs for models (e.g., "30 training epochs for ED Transformer; 3 fine-tuning epochs for BERT-base") and references benchmark tasks, but does not explicitly provide specific training, validation, or test dataset split percentages or counts within the text.
Hardware Specification	Yes	All experiments are conducted on a high performance compute cluster equipped with NVIDIA P100 GPUs (each with 12GB GPU RAM). Comparison results are summarized in Table 5, where we report the average latency achieved by subnetworks of K unpruned heads on CPU and GPU devices separately. On CPU-only devices, the subnetworks can achieve speed-ups up to 48 heads unpruned out of 72. In contrast, on devices specialized for in-parallel matmul operations such as GPUs, the indexing approach may cause non-negligible overheads and head pruning achieves speed-ups only with high-sparsity targets, such as when less than 8 heads are retained from the subnetworks, as shown in Table 5. (using a 72-head Encoder-Decoder Transformer on 2 Intel Broadwell CPUs @ 2.2GHz, and 1 NVIDIA P100 Pascal GPU).
Software Dependencies	No	We use the fairseq toolkit (Ott et al., 2019) to implement a 6-layer ED Transformer with 72 heads in total, and the Hugging Face codebase (Wolf et al., 2020) to implement a 12-layer BERT-base with 144 heads in total. The paper mentions software tools like 'fairseq toolkit' and 'Hugging Face codebase' but does not specify their version numbers.
Experiment Setup	Yes	Detailed hyper-parameter settings are in Appendix B. We test all methods on both architectures with target tasks (30 training epochs for ED Transformer; 3 fine-tuning epochs for BERT-base as in Li et al. (2021)). Table 6: Hyper-parameters ED Transformer: λbase 1, λ0 2, learning rate for Φ 0.2. BERT-base: λbase 1e-5, λ0 1000, learning rate for Φ 0.5. We choose #n_step as 1, 000 in all experiments. λc is set to 0 during the first 20, 000 iterations and the last 7, 000 iterations when training ED Transformer models. For BERT-base models, λc is set to 0 except for iterations between 2, 000 and 5, 000. In our implementation, we empirically clip all ϕi s to the range [ 5, 5] to avoid the vanishing of gradients to excessively small values.