Pre-training for Speech Translation: CTC Meets Optimal Transport

Authors: Phuong-Hang Le, Hongyu Gong, Changhan Wang, Juan Pino, Benjamin Lecouteux, Didier Schwab

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the standard Co Vo ST-2 and Mu ST-C datasets show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data.
Researcher Affiliation Collaboration 1Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France 2Meta AI. Correspondence to: Phuong-Hang Le <hang.le@univ-grenoble-alpes.fr>.
Pseudocode No The paper describes the optimal transport and CTC algorithms but does not provide formal pseudocode or algorithm blocks with numbered steps.
Open Source Code Yes Our code is available at github.com/formiel/fairseq.
Open Datasets Yes We evaluate the pre-training methods presented in this paper on the standard Mu ST-C (Di Gangi et al., 2019) and Co Vo ST-2 (Wang et al., 2020c) datasets.
Dataset Splits Yes It is important to note that all our analyses are conducted on the dev splits of these datasets to prevent overfitting their test sets. Then, only the best-performing models will be selected for comparison with existing methods on the test sets.
Hardware Specification Yes Medium ASR/ST models were trained on 8 NVIDIA V100 GPUs while large ASR/ST ones were trained on 32 A100 GPUs. All MT models were trained on 8 V100 GPUs.
Software Dependencies No The paper mentions several software components like FAIRSEQ S2T toolkit, Adam optimizer, PyTorch's nn.functional.interpolate, g2p_en package, and Sentencepiece. However, it does not consistently provide specific version numbers for these dependencies to ensure reproducibility.
Experiment Setup Yes For the analysis, we use a medium architecture with hidden dimension d = 512. In the final experiments where we aim to reach state-of-the-art performance for comparison with existing methods, we also use the large variant where d = 1024. ... We set α = 0.1 in all experiments. ... we use this value [γ=1.0] in all experiments. ... We used the Adam optimizer (Kingma & Ba, 2015) with learning rate linearly increased for the first N warmup steps to a value ηmax, then decreased proportionally to the inverse square root of the step counter. ηmax is set to 2 10 3 in medium ASR/ST experiments and to 5 10 4 in experiments using large architecture. For MT experiments, ηmax = 5 10 3. N is set to 10000 in ASR/ST experiments and to 4000 in MT experiments. Label smoothing is set to 0.1 (Szegedy et al., 2016) for models using cross-entropy loss.