Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models

Authors: Yong Cheng, Yu Zhang, Melvin Johnson, Wolfgang Macherey, Ankur Bapna

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on the multilingual Co Vo ST AST (Wang et al., 2021b), Voxpopuli ASR (Wang et al., 2021a) and XTREME benchmarks show that our joint speech-text pre-trained models can achieve competitive results on both speech and text tasks. On Co Vo ST AST, Mu2SLAM establishes a new stateof-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points.
Researcher Affiliation Industry 1Google Research, Google LLC, USA. Correspondence to: Yong Cheng <chengyong@google.com>.
Pseudocode No The paper describes its approach and objectives using natural language and mathematical equations, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about providing open-source code for the described methodology, nor does it include any links to a code repository.
Open Datasets Yes Data Following m SLAM (Bapna et al., 2022), we use the same unlabeled speech data of approximately 429k hours in 51 languages. The m C4 dataset spanning 101 languages is used as unlabeled text data. ASR data come from Vox Populi, MLS, Babel, Co Vo ST and FLEURS. We only have two sources for AST data, Co Vo ST and FLEURS. We collect MT data from WMT and TED.
Dataset Splits No The paper states that 'we pick the best model based on validation sets' and shows results on 'XNLI dev accuracy', indicating the use of validation sets from standard benchmarks. However, it does not explicitly provide specific percentages, sample counts, or detailed splitting methodology for these validation sets within the text.
Hardware Specification Yes These two models run on 256 TPUv4 chips for 1.5M steps. All of fine-tuning experiments are conducted on 64 TPUv4 chips.
Software Dependencies No The paper mentions optimizers like 'Adam optimizer (Kingma & Ba, 2014)' and 'Adam W (Loshchilov & Hutter, 2017)', and metrics like 'Sacre BLEU script (Post, 2018)'. However, it does not specify programming languages, libraries, or specific software versions (e.g., Python 3.x, TensorFlow 2.x) used for implementation.
Experiment Setup Yes The batch sizes per TPU for speech-only, textonly, AST, ASR and MT data are 4, 8, 1, 1 and 1. We mask approximately 50% of the speech frames with spans of length up to 10 (Chung et al., 2021). However, for text inputs, we mask a continuous span of around 50% of words except for MT tasks where the mask ratio is 25%. The loss coefficients related to speech-only and text-only data are set to 1. The loss coefficients for the text to speech and alignment tasks are 0.1 while speech to text tasks need a slightly higher loss coefficient 0.3 for the decoder loss. For each fine-tuning tasks, we use grid search to tune the hyperparameters including batch sizes per TPU over {2, 4, 8}, learning rates over {0.5, 1, 2, 3, 5}, dropout ratios for encoder inputs and Transformer decoder over {0.1, 0.3}, warm-up steps over {4k, 8k, 16k}. The warm-up steps are universally set to 16k. The pre-trained spm-level model is in favor of a larger dropout of 0.3. For AST, ASR and MT, we randomly incorporate synonym noises into decoder inputs, the noise ratio is set to 0.06.