Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-head Temporal Latent Attention
Authors: Keqi Deng, Phil Woodland
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. |
| Researcher Affiliation | Academia | Keqi Deng, Philip C. Woodland Department of Engineering, University of Cambridge Trumpington St., Cambridge, UK EMAIL, EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the paper. The methodology is described through textual explanations and mathematical equations. |
| Open Source Code | Yes | The code is fully open-sourced: https://github.com/D-Keqi/mtla |
| Open Datasets | Yes | The ST task uses the Mu ST-C [16] v1.0 English-German (En-De) dataset, with data preprocessing following the Fairseq example. The text summarisation task is conducted on the XSum [31] dataset. For the ASR task, the AMI [8] dataset is employed. For the SLU task, the SLURP [5] dataset is used to evaluate intent classification. More details of the datasets used are given in Appendix C. ... The following licenses apply to the datasets used in this paper: CC-BY-NC-ND-4.0: https://spdx.org/licenses/CC-BY-NC-ND-4.0 applies to Mu ST-C data. CC-BY-SA-4.0: https://spdx.org/licenses/CC-BY-SA-4.0 applies to XSum data. CC BY 4.0: https://spdx.org/licenses/CC-BY-4.0 applies to AMI data. CC BY-NC 4.0: https://spdx.org/licenses/CC-BY-NC-4.0 applies to SLURP data. |
| Dataset Splits | Yes | The data set statistics for the datasets used in the experiments are shown in Table 7. The Mu ST-C [16] v1.0 En-De dataset comprises English-German speech translation data collected from TED Talks. ... Train set train -Duration 400.0 hours -German words 3880K Test sets dev tst-COMMON -Duration 2.3 hours 4.1 hours -German words 26K 44K |
| Hardware Specification | Yes | All inference speed tests are conducted on the same NVidia RTX 6000 Ada GPU. ... Model training was performed on a single NVidia RTX 6000 Ada GPU with 48GB of memory. |
| Software Dependencies | No | The experiments are conducted using a Transformer-based decoder-only architecture, implemented within the Fairseq [33] toolkit. No specific version number for Fairseq or any other software dependencies is provided. |
| Experiment Setup | Yes | The decoder used for all tasks shares the same configuration: 9 layers, 512 attention dimensions, 2048 feed-forward dimensions, and 8 attention heads. ... For the ST task, training follows the Fairseq example, using a learning rate of 2e-3, 10,000 warm-up steps, and a maximum of 100,000 update steps. Each batch corresponds to 320,000 frames of Fbank features, which is approximately 53 minutes of speech. The MHA, MLA, and MTLA models all have 78M parameters. During inference, each batch corresponds to 50,000 frames of Fbank features, and the beam size is set to 50. |