Speech Self-Supervised Learning Using Diffusion Model Synthetic Data

Authors: Heting Gao, Kaizhi Qian, Junrui Ni, Chuang Gan, Mark A. Hasegawa-Johnson, Shiyu Chang, Yang Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that DIFFS4L can significantly improve the performance of SSL models, such as reducing the WER of the Hu BERT pretrained model by 6.26 percentage points in the English ASR task.
Researcher Affiliation Collaboration 1University of Illinois at Urbana-Champaign, IL, USA 2MITIBM Watson AI Lab, MA, USA 3University of California, Santa Barbara, CA, USA. Correspondence to: Heting Gao <hgao17@illinois.edu>.
Pseudocode No Figure 1 presents an 'algorithm overview' as a diagram, and the steps are described in prose, but there is no structured pseudocode or algorithm block.
Open Source Code Yes The code is available at github.com/Hertin/Diff S4L.
Open Datasets Yes Pretraining Dataset For the experiments in English, the methods to be evaluated are pretrained on Librispeech-960 dataset (Panayotov et al., 2015). We consider two settings, the low-resource setting and the high-resource setting. For the low-resource setting, the seed dataset D0 for training Steps 1 and 2 contains only 100 hours of real speech from the train-clean-100 subset.
Dataset Splits Yes For each language in MLS, we sample 100 hours from training split for pretaining and use the limited supervision subset for finetuning. Both cases use the provided dev and test split for validation and testing.
Hardware Specification Yes The training of WAV2VEC2 models requires 64 Tesla V100-SXM2-32GB GPUs and that of HUBERT models requires 32 GPUs. The model is trained for 40k updates on two V100-SXM2-32GB GPUs.
Software Dependencies No The entire training pipeline is constructed based on two existing code repositories: FAIRSEQ (Ott et al., 2019) and PRODIFF (Huang et al., 2022b). While the frameworks are named, specific version numbers are not provided for these software dependencies.
Experiment Setup Yes All the WAV2VEC2.0/HUBERT models are trained for 400k updates with a learning rate of 5 imes 10^-4. Each batch contains 1.4M audio samples. The synthesizer is optimized for the weighted sum of L1 reconstruction loss and structural similarity index (SSIM) loss (Huang et al., 2022b) with the weight being 0.5 for each loss. We use adam optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 10^-9 and inverse square root scheduler with 2000 warmup updates.