Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PASS: Pruning Attention Heads with Almost-sure Sparsity Targets

Authors: Dujian Ding, Ganesh Jawahar, Laks V. S. Lakshmanan

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on IWSLT14 German-to-English translation and GLUE benchmark tasks demonstrate that our approaches outperform the SOTA by achieving up to 1.33 higher BLEU scores, 1.44% higher accuracy, and 60% higher attention speedups.
Researcher Affiliation Collaboration Dujian Ding EMAIL Department of Computer Science the University of British Columbia Ganesh Jawahar EMAIL Google Deep Mind Laks V.S. Lakshmanan EMAIL Department of Computer Science the University of British Columbia
Pseudocode No The paper includes mathematical formulations and derivations but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Codebase is available on https://github.com/Dujian Ding/PASS.
Open Datasets Yes We evaluate our methods with encoder-decoder (ED) Transformer models and BERT models on IWSLT14 German-to-English translation (Cettolo et al., 2014) and GLUE benchmark tasks (Wang et al., 2018).
Dataset Splits No The paper mentions training epochs for models (e.g., "30 training epochs for ED Transformer; 3 fine-tuning epochs for BERT-base") and references benchmark tasks, but does not explicitly provide specific training, validation, or test dataset split percentages or counts within the text.
Hardware Specification Yes All experiments are conducted on a high performance compute cluster equipped with NVIDIA P100 GPUs (each with 12GB GPU RAM). Comparison results are summarized in Table 5, where we report the average latency achieved by subnetworks of K unpruned heads on CPU and GPU devices separately. On CPU-only devices, the subnetworks can achieve speed-ups up to 48 heads unpruned out of 72. In contrast, on devices specialized for in-parallel matmul operations such as GPUs, the indexing approach may cause non-negligible overheads and head pruning achieves speed-ups only with high-sparsity targets, such as when less than 8 heads are retained from the subnetworks, as shown in Table 5. (using a 72-head Encoder-Decoder Transformer on 2 Intel Broadwell CPUs @ 2.2GHz, and 1 NVIDIA P100 Pascal GPU).
Software Dependencies No We use the fairseq toolkit (Ott et al., 2019) to implement a 6-layer ED Transformer with 72 heads in total, and the Hugging Face codebase (Wolf et al., 2020) to implement a 12-layer BERT-base with 144 heads in total. The paper mentions software tools like 'fairseq toolkit' and 'Hugging Face codebase' but does not specify their version numbers.
Experiment Setup Yes Detailed hyper-parameter settings are in Appendix B. We test all methods on both architectures with target tasks (30 training epochs for ED Transformer; 3 fine-tuning epochs for BERT-base as in Li et al. (2021)). Table 6: Hyper-parameters ED Transformer: λbase 1, λ0 2, learning rate for Φ 0.2. BERT-base: λbase 1e-5, λ0 1000, learning rate for Φ 0.5. We choose #n_step as 1, 000 in all experiments. λc is set to 0 during the first 20, 000 iterations and the last 7, 000 iterations when training ED Transformer models. For BERT-base models, λc is set to 0 except for iterations between 2, 000 and 5, 000. In our implementation, we empirically clip all ϕi s to the range [ 5, 5] to avoid the vanishing of gradients to excessively small values.