DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
Authors: Qingkai Fang, Yan Zhou, Yang Feng
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the CVSS Fr En benchmark demonstrate that DASpeech can achieve comparable or even better performance than the state-of-the-art S2ST model Translatotron 2, while preserving up to 18.53 speedup compared to the autoregressive baseline. |
| Researcher Affiliation | Academia | Qingkai Fang1,2, Yan Zhou1,2, Yang Feng1,2 1Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) 2University of Chinese Academy of Sciences, Beijing, China {fangqingkai21b,zhouyan23z,fengyang}@ict.ac.cn |
| Pseudocode | No | The paper describes algorithms (e.g., Forward Algorithm, Backward Algorithm, Viterbi) using mathematical notation and descriptive text, but it does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is publicly available at https://github.com/ictnlp/DASpeech. |
| Open Datasets | Yes | We conduct experiments on the CVSS dataset [4], a large-scale S2ST corpus containing speech-to-speech translation pairs from 21 languages to English. |
| Dataset Splits | Yes | For the weight of TTS loss µ, we experiment with µ {1.0, 2.0, 5.0, 10.0} and choose µ = 5.0 according to results on the dev set. |
| Hardware Specification | Yes | All models are trained on 4 RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions several software tools and libraries such as fairseq, ASR-BLEU toolkit, Sacre BLEU, Sentence Piece toolkit, Adam optimizer, and Hi Fi-GAN vocoder, but it does not specify their version numbers. |
| Experiment Setup | Yes | For model regularization, we set dropout to 0.1 and weight decay to 0.01, and no label smoothing is used. ... During finetuning, we train the entire model for 50k updates with a batch of 320k audio frames. The learning rate warms up to 1e-3 within 4k steps. We use Adam optimizer [23] for both pretraining and finetuning. For the weight of TTS loss µ, we experiment with µ {1.0, 2.0, 5.0, 10.0} and choose µ = 5.0 according to results on the dev set. |