Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation
Authors: Chenyang Le, Bing Han, Jinshun Li, Songyong Chen, Yanmin Qian
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive evaluation on six language pairs, our 500 M-parameter speech-to-text model outperforms the Seamless baseline, achieving under 7% BLEU degradation at 1.5 s average lag and under 3% at 3 s. We further demonstrate Simul MEGA s versatility by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality trade-offs. |
| Researcher Affiliation | Academia | Chenyang Le EMAIL Bing Han EMAIL Jinshun Li EMAIL Songyong Chen EMAIL Yanmin Qian EMAIL Auditory Cognition and Computational Acoustics Lab Mo E Key Lab of Artificial Intelligence, AI Institute School of Computer Science, Shanghai Jiao Tong University, Shanghai, China |
| Pseudocode | No | The paper does not contain any explicit pseudocode blocks or algorithms. It includes architectural diagrams in Figure 1 and Figure 2, but no structured algorithmic steps. |
| Open Source Code | Yes | The code can be found in https://github.com/nethermanpro/simulmega. |
| Open Datasets | Yes | We collect various open-source speech recognition datasets covering the six languages, including Libri Speech [31], Multilingual Librispeech(MLS) [32], Vox Populi [33], Common Voice[34], Wenet Speech[35], Ke Speech[36] and Emilia [37]. Then we create pseudo translation labels for these data by translating the transcription into different languages through a cloud text-to-text translation API. The dataset consists of approximately 100K hours of training data. For TTS experiment, we use a subset of ST training set that only contains Chinese and English data due to the language compatibility of the Cosy Voise 2. For S2TT evaluation, we employ case-sensitive BLEU[40] with punctuation3 as our primary quality metric and average lagging (AL)4 [43] for latency assessment. Our evaluation spans two benchmark datasets: Co Vo ST2 [44] and Fleurs [45]. |
| Dataset Splits | No | The paper describes the specific datasets used for training and evaluation (100K hours of training data, Co Vo ST2, Fleurs) and how evaluations are reported across language pairs, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for the 100K hours of training data they compiled, nor does it provide file names or URLs for custom splits. |
| Hardware Specification | Yes | Each experiment is trained in FP16 with 8 Nvidia H800 GPUs. [...] The evaluation is performed on a fully idle machine with a single Nvidia-H100 GPU. |
| Software Dependencies | No | The paper mentions several software tools and models used (e.g., sacrebleu, Whisper-Large-V3, Paraformer, AdamW optimizer, vLLM, Tensor RT) but does not provide specific version numbers for these software dependencies, which would be necessary for full reproducibility. |
| Experiment Setup | Yes | We train the offline model for 1M steps, which takes around 1 week. In stage 2 training, the chunk-AR blocks are frozen. The router and Mo E Refiner module are randomly initialized. The stage 2 training takes about 2 days. We use Adam W optimizer[39] and linear learning rate scheduler with 5000 steps of warmup. The maximum learning rate is 1e-4 in stage 1 training and 1e-5 in stage 2 training. In our experiment, we set wr = wp = 0.2 to prioritize the offline training. [...] In our implementation, we apply zero-mean unit-variance Gaussian noise (σR = 1) to the pre-Sigmoid logits. [...] where we set a small normalization weight wn = 0.01. [...] the Mo E refiner consists of Nrefiner = 6 layers, with the hidden size matching that of the base model. [...] we employ Low-Rank Adaptation(Lo RA)[38] (α = 64 ) at the chunk-AR blocks of the encoder. |