Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Overcoming Non-monotonicity in Transducer-based Streaming Generation

Authors: Zhengrui Ma, Yang Feng, Min Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our Mono Attn-Transducer effectively handles nonmonotonic alignments in streaming scenarios, offering a robust solution for complex generation tasks. We conduct experiments on both speech-to-text/speech simultaneous translation to demonstrate the generality of our approach across various modalities.
Researcher Affiliation Academia 1Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3School of Future Science and Engineering, Soochow University. Correspondence to: Zhengrui Ma <EMAIL>, Yang Feng (Corresponding Author) <EMAIL>.
Pseudocode Yes Algorithm 1 Training Algorithm of Mono Attn-Transducer
Open Source Code Yes Code is available at https://github. com/ictnlp/Mono Attn-Transducer.
Open Datasets Yes We conduct experiments on two language pairs of Mu ST-C speech-to-text translation datasets: English to German (En De) and English to Spanish (En Es) (Di Gangi et al., 2019). For speech-to-speech experiments, we evaluate models on CVSS-C French to English (Fr En) dataset (Jia et al., 2022).
Dataset Splits No The paper mentions using specific datasets (Mu ST-C, CVSS-C) and describes partitioning the *test set* into easy, medium, and hard subsets for analysis, but it does not explicitly provide the train/validation/test splits (e.g., percentages or exact counts) for the entire datasets needed to reproduce the experimental setup.
Hardware Specification Yes Empirically, we found Mono Attn-Trasducer is 1.33 times slower than Transducer baseline with the same configuration on Nvidia L40 GPU. ...the peak memory usage of Transducer baseline is 28GB, while Mono Attn-Transducer exhibits a slightly higher peak usage of 32GB when the total number of source frames is fixed at 40,000 on a single Nvidia L40 GPU.
Software Dependencies Yes We use Simul Eval toolkit (Ma et al., 2020b) for evaluation. ... We use Simul Eval v1.1.4 for evaluation in all the experiments.
Experiment Setup Yes The speech encoder consists of two layers of causal 2D-convolution followed by 16 chunk-wise Transformer layers with pre-norm. Each convolution layer has a 3x3 kernel with 64 channels and a stride size of 2, resulting in a downsampling ratio of 4. ... The chunk size is adjusted within the set {320, 640, 960, 1280}ms. ... The predictor comprises two autoregressive Transformer layers with post-norm... All Transformer layers described above are configured with a 512 embedding dimension, 8 attention heads and a 2048 FFN dimension. ... We set the dropout rate to 0.1, weight decay to 0.01, and clip gradient norms exceeding 5.0. The dropout rates for activation and attention are both set to 0.1. The pretraining spans 50k updates with a batch size of 160k tokens. The learning rate gradually warms up to 5e-4 within 4k steps. Finetuning involves training for 20k updates... we optimize models using the Adam optimizer (Kingma & Ba, 2015).