ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Authors: Chenyang Le, Yao Qian, Long Zhou, Shujie LIU, Yanmin Qian, Michael Zeng, Xuedong Huang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public Co Vo ST2 evaluation set. We have conducted a comprehensive ablation study on bridging the gap of speech and language representations, including tasks, losses, and strategies, as well as comparisons with previous works. We conduct experiments on Co Vo ST 2 [36] dataset, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages.
Researcher Affiliation Collaboration Chenyang Le Shanghai Jiao Tong University Shanghai, China nethermanpro@sjtu.edu.cn Yao Qian Microsoft Cloud and AI Redmond, WA, US yaoqian@microsoft.com Long Zhou Microsoft Research Asia Beijing, China lozhou@microsoft.com Shujie Liu Microsoft Research Asia Beijing, China shujliu@microsoft.com Yanmin Qian Shanghai Jiao Tong University Shanghai, China yanminqian@sjtu.edu.cn Michael Zeng Microsoft Cloud and AI Redmond, WA, US nzeng@microsoft.com Xuedong Huang Microsoft Cloud and AI Redmond, WA, US xdh@microsoft.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/nethermanpro/Com SL.
Open Datasets Yes We conduct experiments on Co Vo ST 2 [36] dataset, a large-scale multilingual speech translation corpus... Mozilla Common Voice [3] (version 11), a large-scale multi-lingual ASR dataset that is of the same source as Co Vo ST 2, is used to extract data.
Dataset Splits No The paper mentions using a 'validation set' to save the best checkpoint ('We save the checkpoint that has highest BLEU score on the validation set.'). However, it does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) within the paper itself for training, validation, or test sets for reproducibility.
Hardware Specification Yes It takes about 3 days to train on 4*8 Nvidia Tesla V100 GPUs with 32G of memory in each.
Software Dependencies No The paper mentions software like 'deepspeed Ze Ro [27]', 'activation checkpointing [10]', and 'sacre BLEU [25]' but does not provide specific version numbers for these software components.
Experiment Setup Yes We build two versions of model, named Com SL Medium and Com SL Large. The difference among them is that the speech Transformer blocks are initialized with the encoder from Whisper in different sizes. The Com SL Medium uses Whisper medium encoder that contains 24 layers of Transformer blocks with 1024 hidden dimensions and 16 heads. The Com SL Large uses a Whisper large encoder that contains 32 layers of Transformer blocks with 1280 hidden dimensions and 20 heads. The two versions all have a two-layer convolution adapter and an m BART model that is initialized by a mbart50-large-mant-to-many-mmt checkpoint. During inference, we run beam search with beam size 5. The input length of audio recording is limited to 11 seconds.