Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Objective Soups: Multilingual Multi-Task Modeling for Speech Processing
Authors: A F M Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, Tianyi Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Co Vo ST v2, Libri Speech, and AISHELL-1 reveal that a bi-level recipe separating recognition and translation tasks consistently outperforms standard flat optimization. Our work demonstrates that hierarchical MOO is a more effective and scalable approach for building state-of-the-art MSP models. |
| Researcher Affiliation | Collaboration | 1Rensselaer Polytechnic Institute 2IBM Research 3The Chinese University of Hong Kong 4University of Rochester 5Cornell Tech |
| Pseudocode | Yes | Algorithm 1 VS-MSP for multilingual multi-task MSP. Algorithm 2 VC-MSP for multilingual multi-task MSP Algorithm 3 VM-MSP for multilingual multi-task MSP. |
| Open Source Code | Yes | Our code has been released at https://github.com/afmsaif/Objective_Soups. |
| Open Datasets | Yes | Extensive experiments on Co Vo ST v2, Libri Speech, and AISHELL-1 reveal that a bi-level recipe separating recognition and translation tasks consistently outperforms standard flat optimization. Dataset. We evaluate our training algorithms on a combined dataset of Libri Speech [42], AISHELL v1 [7], and Co Vo ST v2 [57]. |
| Dataset Splits | Yes | Our approach involved splitting the Libri Speech dataset, allocating 860 hours for self-supervised pre-training and using the 100-hour train-clean-100 subset for supervised training. The trained models are tested on the AISHELL test dataset and the Libri Speech test-clean dataset. During training using Co Vo ST dataset, we use equal batch sizes across all languages and tasks to ensure balanced training. For high-resource En, we use the ful data without up-sampling, while applying up-sampling for low-resource languages 4x for Ca and Es and 2x for Fr and De. |
| Hardware Specification | Yes | All simulations were run on two NVIDIA A5000 GPUs and two NVIDIA A4500 GPUs, with an Intel i9-7920X CPU and 128 GB of DDR4 memory. |
| Software Dependencies | No | All models are trained with the Adam W optimizer, using a backbone learning rate of α = 5 10 5 and a head/decoder learning rate of β = 5 10 4. For CONFORMER-BASED MODELS, transcripts are tokenized with Sentence Piece [31] using a 1,000-token word-level vocabulary for every language except Chinese, where we employ a 4,930-token character-level model. |
| Experiment Setup | Yes | All models are trained with the Adam W optimizer, using a backbone learning rate of α = 5 10 5 and a head/decoder learning rate of β = 5 10 4. For VC-MSP, the bilevel penalty is initialized to η = 0 and increased by 0.02 each epoch. For VM-MSP, we set η1 = 0.1 and η2 = 0, with each being incremented by 0.02 per epoch. ... The SSL pre-training phase starts with a learning rate of α = 5 10 4 for 100 epochs, annealed by a factor of 0.1 every 20 epochs. Fine-tuning uses a maximum learning rate of β = 5 10 5, with a scheduler reducing the learning rate by a factor of 0.1 if the test loss does not improve within 10 epochs. All multi-objective models (VS-MSP, VC-MSP, and VM-MSP) and joint PT+FT models are trained for 200 epochs. For PT+FT, we pre-train the model for 200 epochs and fine-tune it for an additional 100 epochs. A batch size of 256 and Adam W optimizer are used for both self-supervised and supervised training. |