Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

Authors: Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs. Adding the mined data, achieves significant improvements in the BLEU score on the Co Vo ST2 and the MUST-C test sets with respect to a very competitive baseline.
Researcher Affiliation Industry Paul-Ambroise Duquenne Facebook AI Research padqn@fb.com Hongyu Gong Facebook AI Research hygong@fb.com Holger Schwenk Facebook AI Research schwenk@fb.com
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper states "All the mined multimodal corpora will be made freely available." but does not explicitly state that the source code for the methodology described in the paper will be released, nor does it provide a link to such code.
Open Datasets Yes We use the Dev and Test set of the Co Vo ST2 corpus (Wang et al., 2020a) which statistics are summarized in Table 1. ... Co Vo ST is a large-scale multilingual speech translation corpus based on Common Voice (Ardila et al., 2019). ... We used Librivox as our set of unlabeled speech data. Librivox is a repository of open domain audio books in different languages.3 ... As English texts, we use five snapshots from Common Crawl as processed in CCNet (Wenzek et al., 2019). ... We further evaluate the quality mined data in En-xx directions using Mu ST-C dataset (Di Gangi et al., 2019b)
Dataset Splits Yes We use the Dev and Test set of the Co Vo ST2 corpus (Wang et al., 2020a) which statistics are summarized in Table 1. ... Table 1: Statistic of Co Vost2 speech translation corpus used to train and evaluate the speech encoders. En De Es Fr Ru Train Dev Test Train Dev Test Train Dev Test Train Dev Test Train Dev Test Audio [hours] 430 26 25 184 21 22 113 22 23 264 22 23 18 10 11 #sentences 289k 16k 16k 128k 14k 14k 79k 13k 13k 207k 15k 15k 12k 6k 6k
Hardware Specification Yes The learning rate to finetune XLSR transformer is set to 10 4, and training was performed on 24 Tesla V100 GPUs.
Software Dependencies No The paper mentions software tools like 'Flashlight' (footnote 4) and 'fairseq' (footnote 2 for wav2vec) and deep learning frameworks implied by model names (e.g., PyTorch for fairseq), but does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes The learning rate to finetune XLSR transformer is set to 10 4, and training was performed on 24 Tesla V100 GPUs. ... We tune layer norm and multi-head attention parameters on the train set in each language direction, while other model parameters are frozen during the fine-tuning stage. ... S2T Transformer is trained on the combination of Mu ST-C and mined data for 200k steps and finetuned on Mu ST-C data only for 100k steps.