Divergence-Guided Simultaneous Speech Translation
Authors: Xinjie Chen, Kai Fan, Wei Luo, Linlin Zhang, Libo Zhao, Xinggao Liu, Zhongqiang Huang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on multiple translation directions of the Mu ST-C benchmark demonstrate that our approach achieves a better trade-off between translation quality and latency compared to existing methods. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2Alibaba DAMO Academy 3South China University of Technology |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/cxjfluffy/Di G-SST. |
| Open Datasets | Yes | We conduct experiments on the widely used Mu ST-C V1 corpus: English {German, Spanish, French} (En {De, Es, Fr}) (Gangi et al. 2019), detailed in Table 1. |
| Dataset Splits | Yes | Table 1: The statistics (sentences) of three language pairs in Mu ST-C. Split En-De En-Es En-Er Train 234K 270K 280K Dev 1423 1316 1412 Tst-COMMON 2641 2502 2632 |
| Hardware Specification | Yes | Training was conducted on 4 V100 GPUs, each with a batch size of 3.2M audio frames. |
| Software Dependencies | No | The paper mentions software tools like 'wav2vec 2.0', 'sacreBLEU', 'simuleval toolkit', 'Sentence Piece', and 'Montreal Forced Aligner', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Both the translation encoder and decoder employ 6 transformer layers, each with dimensions of 512 and 8 attention heads... Training was conducted on 4 V100 GPUs, each with a batch size of 3.2M audio frames. The translation model was trained for up to 40 epochs with early stopping after 20 non-improving epochs, followed by a 10-epoch policy module training with the translation model frozen. |