Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BlockDecoder: Boosting ASR Decoders with Context and Merger Modules

Authors: Darshan Prabhu, Preethi Jyothi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	BLOCKDECODER, a novel decoder architecture comprising two distinct components: a text encoder that is purely text-based, and a MERGER that combines information from the audio encoder and text encoder to generate output tokens... As a result, BLOCKDECODER yields a significant speedup ( ~2x) compared to traditional decoders, across diverse datasets, languages, and speech tasks, without any degradation in performance. The paper details extensive experiments on diverse datasets like Librispeech, Tedlium2, AISHELL, and Mozilla Common Voice, comparing performance metrics (WER, CER, RTF) against baselines in tables (e.g., Table 1, 2, 3, 4, 5) and presenting ablation studies.
Researcher Affiliation	Academia	Darshan Prabhu Department of CSE IIT Bombay EMAIL Preethi Jyothi Department of CSE IIT Bombay EMAIL
Pseudocode	No	The paper describes the architecture and computations mathematically and in text (e.g., Section 3.2.1 for Text Encoder and Section 3.2.2 for MERGER computations), but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps. Figure 2 provides a schematic overview of the architecture.
Open Source Code	Yes	The code is available at https://github.com/csalt-research/blockdecoder.
Open Datasets	Yes	Datasets. We show experiments on two tasks: ASR and Spoken Language Understanding (SLU). For ASR, we use: (1) Librispeech [38] consisting of 1000 hours of English read audiobooks with 100-hour and 960-hour training splits, (2) Tedlium2 [39] consisting of 200 hours of TED talk recordings, (3) Aishell [40] containing 170 hours of Mandarin Chinese speech data, and (4) Mozilla Common Voice [41], a multilingual dataset with durations ranging from 10 to 2500 hours per language. For SLU, we use the SLURP corpus [42], a 60-hour multi-domain English dataset evaluated for intent classification and entity recognition.
Dataset Splits	Yes	Datasets. We show experiments on two tasks: ASR and Spoken Language Understanding (SLU). For ASR, we use: (1) Librispeech [38] consisting of 1000 hours of English read audiobooks with 100-hour and 960-hour training splits, (2) Tedlium2 [39] consisting of 200 hours of TED talk recordings, (3) Aishell [40] containing 170 hours of Mandarin Chinese speech data, and (4) Mozilla Common Voice [41], a multilingual dataset with durations ranging from 10 to 2500 hours per language. We select five languages with training data spanning 100 to 400 hours. For SLU, we use the SLURP corpus [42], a 60-hour multi-domain English dataset evaluated for intent classification and entity recognition. Table 1, 2, 4, 5 report results on 'Test Clean' and 'Test Other' splits for Librispeech, 'Test' for Tedlium2, AISHELL, and MCV, and 'Test Acc.' and 'SLU-F1' for SLURP, indicating the use of standard, well-defined splits for these benchmark datasets.
Hardware Specification	Yes	All our experiments are conducted using the ESPnet toolkit [43] on NVIDIA A100 and A6000 GPUs.
Software Dependencies	No	All our experiments are conducted using the ESPnet toolkit [43] on NVIDIA A100 and A6000 GPUs. While ESPnet is mentioned, no specific version number for the toolkit or any other key software libraries (like PyTorch, TensorFlow, CUDA) is provided, making exact replication difficult without knowing the specific software environment.
Experiment Setup	Yes	Implementation Details. All our experiments are conducted using the ESPnet toolkit [43] on NVIDIA A100 and A6000 GPUs.2 Across all experiments, we apply 3-way speed perturbation with ratios {0.9,1.0,1.1}, along with Spec Augment [44]. Our experimental setup follows the recommended configurations in ESPnet recipes. Across all experiments, we employ standard efficient inference techniques such as KV-caching and Automatic Mixed Precision (AMP). For SLU experiments, we first train the model with an ASR objective, where the output label sequences are sentences with intent and entity-related tags. Then, during inference, we first decode the sequence as in ASR, and then compute SLU metrics by parsing the decoded output. A detailed summary of the hyperparameters used for each dataset is available in Appendix F.