DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation
Authors: Weiting Tan, Jingyu Zhang, Lingfeng Shen, Daniel Khashabi, Philipp Koehn
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our strategies result in a notable improvement of about +7 ASRBLEU for English-Spanish (En-Es) and +2 ASR-BLEU for English-French (En-Fr) translations on the CVSS benchmark, while attaining over 14 speedup for En-Es and 5 speedup for En-Fr translations compared to autoregressive baselines. |
| Researcher Affiliation | Academia | Department of Computer Science Johns Hopkins University {wtan12, jzhan237, lshen30, danielk, phi}@jhu.edu |
| Pseudocode | Yes | Algorithm 1 Latent Diffusion Model Training; Algorithm 2 Normalized Units Construction |
| Open Source Code | Yes | Code available at: https://github.com/steventan0110/DiffNorm. |
| Open Datasets | Yes | We perform experiments using the established CVSS-C datasets [24], which are created from COVOST2 by employing advanced text-tospeech models to synthesize translation texts into speech [59]. CVSS-C comprises aligned speech in multiple languages along with their respective transcriptions. |
| Dataset Splits | Yes | Split En-Es En-Fr Size Length Size Length Train 79,012 256 207,364 228 Valid 13,212 296 14,759 264 Test 13,216 308 14,759 283 Table 1: Data statistics for CVSS benchmarks. |
| Hardware Specification | Yes | We train the VAE model using a learning rate of 5e-4 with distributed data-parallel (DDP) on 4 A100 GPUs, where we set the maximum batch token to be 15000. |
| Software Dependencies | No | The paper mentions software like Fairseq, Adam optimizer, Hifi GAN, WAV2VEC2.0, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | For optimization, we use the Adam [26] optimizer with betas (0.9, 0.98) and we apply gradient clipping by setting clip-norm=2.0. During training, we apply dropout with a probability of 0.1. We train the VAE model using a learning rate of 5e-4 with distributed data-parallel (DDP) on 4 A100 GPUs, where we set the maximum batch token to be 15000. |