Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
BlockDecoder: Boosting ASR Decoders with Context and Merger Modules
Authors: Darshan Prabhu, Preethi Jyothi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | BLOCKDECODER, a novel decoder architecture comprising two distinct components: a text encoder that is purely text-based, and a MERGER that combines information from the audio encoder and text encoder to generate output tokens... As a result, BLOCKDECODER yields a significant speedup ( ~2x) compared to traditional decoders, across diverse datasets, languages, and speech tasks, without any degradation in performance. The paper details extensive experiments on diverse datasets like Librispeech, Tedlium2, AISHELL, and Mozilla Common Voice, comparing performance metrics (WER, CER, RTF) against baselines in tables (e.g., Table 1, 2, 3, 4, 5) and presenting ablation studies. |
| Researcher Affiliation | Academia | Darshan Prabhu Department of CSE IIT Bombay EMAIL Preethi Jyothi Department of CSE IIT Bombay EMAIL |
| Pseudocode | No | The paper describes the architecture and computations mathematically and in text (e.g., Section 3.2.1 for Text Encoder and Section 3.2.2 for MERGER computations), but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps. Figure 2 provides a schematic overview of the architecture. |
| Open Source Code | Yes | The code is available at https://github.com/csalt-research/blockdecoder. |
| Open Datasets | Yes | Datasets. We show experiments on two tasks: ASR and Spoken Language Understanding (SLU). For ASR, we use: (1) Librispeech [38] consisting of 1000 hours of English read audiobooks with 100-hour and 960-hour training splits, (2) Tedlium2 [39] consisting of 200 hours of TED talk recordings, (3) Aishell [40] containing 170 hours of Mandarin Chinese speech data, and (4) Mozilla Common Voice [41], a multilingual dataset with durations ranging from 10 to 2500 hours per language. For SLU, we use the SLURP corpus [42], a 60-hour multi-domain English dataset evaluated for intent classification and entity recognition. |
| Dataset Splits | Yes | Datasets. We show experiments on two tasks: ASR and Spoken Language Understanding (SLU). For ASR, we use: (1) Librispeech [38] consisting of 1000 hours of English read audiobooks with 100-hour and 960-hour training splits, (2) Tedlium2 [39] consisting of 200 hours of TED talk recordings, (3) Aishell [40] containing 170 hours of Mandarin Chinese speech data, and (4) Mozilla Common Voice [41], a multilingual dataset with durations ranging from 10 to 2500 hours per language. We select five languages with training data spanning 100 to 400 hours. For SLU, we use the SLURP corpus [42], a 60-hour multi-domain English dataset evaluated for intent classification and entity recognition. Table 1, 2, 4, 5 report results on 'Test Clean' and 'Test Other' splits for Librispeech, 'Test' for Tedlium2, AISHELL, and MCV, and 'Test Acc.' and 'SLU-F1' for SLURP, indicating the use of standard, well-defined splits for these benchmark datasets. |
| Hardware Specification | Yes | All our experiments are conducted using the ESPnet toolkit [43] on NVIDIA A100 and A6000 GPUs. |
| Software Dependencies | No | All our experiments are conducted using the ESPnet toolkit [43] on NVIDIA A100 and A6000 GPUs. While ESPnet is mentioned, no specific version number for the toolkit or any other key software libraries (like PyTorch, TensorFlow, CUDA) is provided, making exact replication difficult without knowing the specific software environment. |
| Experiment Setup | Yes | Implementation Details. All our experiments are conducted using the ESPnet toolkit [43] on NVIDIA A100 and A6000 GPUs.2 Across all experiments, we apply 3-way speed perturbation with ratios {0.9,1.0,1.1}, along with Spec Augment [44]. Our experimental setup follows the recommended configurations in ESPnet recipes. Across all experiments, we employ standard efficient inference techniques such as KV-caching and Automatic Mixed Precision (AMP). For SLU experiments, we first train the model with an ASR objective, where the output label sequences are sentences with intent and entity-related tags. Then, during inference, we first decode the sequence as in ASR, and then compute SLU metrics by parsing the decoded output. A detailed summary of the hyperparameters used for each dataset is available in Appendix F. |