Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

Authors: Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on LRS2 and LRS3 demonstrate that Mo ME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise.
Researcher Affiliation Collaboration Umberto Cappellazzo Imperial College London Minsu Kim Meta AI Pingchuan Ma Meta AI Honglie Chen Meta AI Xubo Liu Meta AI Stavros Petridis Imperial College London Nat West AI Research Maja Pantic Imperial College London Nat West AI Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides mathematical formulations for the Mo ME module in Section 3.2 but not a step-by-step algorithm.
Open Source Code No The code will be publicly available upon acceptance and the training data is described in Section 4.1.
Open Datasets Yes We train and evaluate Mo ME on LRS2 [40] and LRS3 [41] datasets. LRS2 includes 225 hours of video clips. LRS3 contains 433 hours of transcribed English video clips.
Dataset Splits Yes We train and evaluate Mo ME on LRS2 [40] and LRS3 [41] datasets. For a fair comparison with previous methods, we apply the same compression rates as in [23].
Hardware Specification Yes measured on an NVIDIA L40 46GB GPU
Software Dependencies No The paper mentions using pre-trained models like Whisper [83], AV-Hu BERT [73], and Llama 3 family [92], and the AdamW optimizer. However, it does not provide specific version numbers for these software components or any other libraries used.
Experiment Setup Yes We train our model for 10 epochs with the Adam W optimizer with cosine annealing scheduler and weight decay set to 0.1 using NVIDIA H200 GPUs. The learning rate is set to 1e-3 for ASR and AVSR tasks, and 5e-4 for VSR. For decoding, we use beam search with a beam width of 15 and temperature of 0.6.