Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Authors: Sungnyun Kim, Kangwook Jang, Sangmin Bae, Sungwoo Cho, Se-Young Yun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiments and Results 5.1. Implementation Details 5.2. Computation Cost 5.3. Robust AVSR Benchmark Results 5.4. Expert and Group Load Analysis 5.5. Multilingual Audio-Visual Speech Tasks. Table 2 presents Mo HAVE s robust performance on the AVSR benchmark under diverse noisy conditions, demonstrating exceptional robustness across different noise types and SNR levels: N-WER of 5.8% for BASE and 4.5% for LARGE.
Researcher Affiliation Academia 1KAIST AI, Republic of Korea 2School of Electrical Engineering, KAIST, Republic of Korea. Correspondence to: Se-Young Yun <EMAIL>.
Pseudocode No The paper describes the methodology using narrative text and mathematical formulations (e.g., equations 1-5, 7-15) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets Yes For the robust AVSR benchmark, we utilize the LRS3 dataset (Afouras et al., 2018b)... We use 8 languages (excluding English) for multilingual AVSR and 6 languages for X-to English audio-visual speech-to-text translation (AVS2TT) tasks. We assess the models using WER for transcription and the BLEU score (Papineni et al., 2002) for translation. For multilingual evaluations, the Mu AVi C dataset (Anwar et al., 2023) is used, featuring 1,200 hours of audiovisual content from 8,000+ speakers across 9 languages, sourced from LRS3-TED (Afouras et al., 2018b) and m TEDx (Elizabeth et al., 2021). We extract audio noise samples from the MUSAN (Snyder et al., 2015) dataset, targeting different noise types such as babble, music, and natural noises, along with speech noise from LRS3. augmenting LRS3 with realistic background audio from the DEMAND dataset (Thiemann et al., 2013).
Dataset Splits Yes Following the experimental setup of Shi et al. (2022b), we extract audio noise samples from the MUSAN (Snyder et al., 2015) dataset, targeting different noise types such as babble, music, and natural noises, along with speech noise from LRS3. These noises are randomly augmented into the audio data, corrupting 25% of the training set with a signal-to-noise ratio (SNR) sampled from N(0, 5). We measure performance using the word error rate (WER), primarily under noisy conditions with SNRs of { -10, -5, 0, 5, 10} d B, specifically N-WER (Kim et al., 2024) which highlights the significance of visual cues in noise-corrupted environments. We use 8 languages (excluding English) for multilingual AVSR and 6 languages for X-to English audio-visual speech-to-text translation (AVS2TT) tasks. ensuring no speaker overlap between training and test sets.
Hardware Specification No The paper discusses computational cost in terms of FLOPs (Table 1) but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using the Adam optimizer (Kingma, 2014) but does not provide specific software dependencies or library versions (e.g., Python, PyTorch, TensorFlow versions) used for implementation.
Experiment Setup Yes Our Mo HAVE framework is developed in two configurations: BASE and LARGE. The BASE model consists of 12 Transformer (Vaswani et al., 2017) encoder layers and 6 decoder layers, while the LARGE model incorporates 24 encoder layers and 9 decoder layers. Both models audio-visual encoders are derived from the AV-Hu BERT-BASE/-LARGE models, pretrained on a noise-augmented corpus of LRS3 (Afouras et al., 2018b) + Vox Celeb2 (Chung et al., 2018). Our Mo E implementation activates top-2 out of 8 experts in every FFN layer within the decoder (Jiang et al., 2024), while the hierarchical architecture engages the top-1 expert from each audio and visual group. To facilitate the expert group specialization, load biasing is used with audio or video randomly dropped in 25% probability. We initialize our model using the pretrained checkpoint from (Shi et al., 2022a) and fine-tune it on the LRS3 train set for 120K steps. The encoder remains frozen for the first 90K steps, allowing only the AVSR decoder to be trained, after which the entire model is fine-tuned for the remaining 30K steps. Our fine-tuning setup follows the configurations from (Shi et al., 2022b). The Adam optimizer (Kingma, 2014) is used with a learning rate of 5e-4 and a polynomial decay schedule with an initial warmup. Each training step processes 8,000 audio-visual frames, equivalent to 320 seconds of speech data. For inference, we use beam search with a beam size of 50. Here, c B and c Z are set to 1e-2 and 1e-3, respectively, in line with (Fedus et al., 2022; Zoph et al., 2022), and c S is also set at 1e-2.