Pre-training for Speech Translation: CTC Meets Optimal Transport
Authors: Phuong-Hang Le, Hongyu Gong, Changhan Wang, Juan Pino, Benjamin Lecouteux, Didier Schwab
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the standard Co Vo ST-2 and Mu ST-C datasets show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data. |
| Researcher Affiliation | Collaboration | 1Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France 2Meta AI. Correspondence to: Phuong-Hang Le <hang.le@univ-grenoble-alpes.fr>. |
| Pseudocode | No | The paper describes the optimal transport and CTC algorithms but does not provide formal pseudocode or algorithm blocks with numbered steps. |
| Open Source Code | Yes | Our code is available at github.com/formiel/fairseq. |
| Open Datasets | Yes | We evaluate the pre-training methods presented in this paper on the standard Mu ST-C (Di Gangi et al., 2019) and Co Vo ST-2 (Wang et al., 2020c) datasets. |
| Dataset Splits | Yes | It is important to note that all our analyses are conducted on the dev splits of these datasets to prevent overfitting their test sets. Then, only the best-performing models will be selected for comparison with existing methods on the test sets. |
| Hardware Specification | Yes | Medium ASR/ST models were trained on 8 NVIDIA V100 GPUs while large ASR/ST ones were trained on 32 A100 GPUs. All MT models were trained on 8 V100 GPUs. |
| Software Dependencies | No | The paper mentions several software components like FAIRSEQ S2T toolkit, Adam optimizer, PyTorch's nn.functional.interpolate, g2p_en package, and Sentencepiece. However, it does not consistently provide specific version numbers for these dependencies to ensure reproducibility. |
| Experiment Setup | Yes | For the analysis, we use a medium architecture with hidden dimension d = 512. In the final experiments where we aim to reach state-of-the-art performance for comparison with existing methods, we also use the large variant where d = 1024. ... We set α = 0.1 in all experiments. ... we use this value [γ=1.0] in all experiments. ... We used the Adam optimizer (Kingma & Ba, 2015) with learning rate linearly increased for the first N warmup steps to a value ηmax, then decreased proportionally to the inverse square root of the step counter. ηmax is set to 2 10 3 in medium ASR/ST experiments and to 5 10 4 in experiments using large architecture. For MT experiments, ηmax = 5 10 3. N is set to 10000 in ASR/ST experiments and to 4000 in MT experiments. Label smoothing is set to 0.1 (Szegedy et al., 2016) for models using cross-entropy loss. |