ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

Authors: Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we will evaluate CONTENTVEC on an extensive set of content-related tasks. In particular, we would like to investigate whether disentangling speakers has benefit in real-world tasks, and how large the benefit would be. Further experimental details can be found in Appendix B.
Researcher Affiliation Collaboration 1MIT-IBM Watson AI Lab 2University of Illinois at Urbana-Champaign 3Massachusetts Institute of Technology 4University of California, Santa Barbara.
Pseudocode No The paper describes its approach using text and diagrams (e.g., Figure 1) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Our code is available at https://github.com/auspicious3000/contentvec
Open Datasets Yes CONTENTVEC and all the baselines are trained on the Librispeech dataset (Panayotov et al., 2015).
Dataset Splits Yes The best model is selected based on the lowest validation masked prediction loss.
Hardware Specification No Our model is trained for 100k steps using 36 GPUs, with a batch size of at most 76 seconds of audio per GPU. The paper specifies the quantity of GPUs but does not provide specific models (e.g., NVIDIA A100, V100), CPU details, or other specific hardware configurations.
Software Dependencies No The paper mentions several software components and frameworks like fairseq, WAV2VEC 2.0, Hi Fi-GAN, and k-means clustering, but it does not specify version numbers for these, which are required for full reproducibility.
Experiment Setup Yes The speech representation network of CONTENTVEC has the same architecture as the HUBERT, which has 7 temporal convolutional feature extraction blocks followed by 12 layers of transformer layers of model dimension 768. During training, each layer is independently dropped with a probability of 0.05. ... The contrastive loss is imposed at the last but five layer, the temperature k is set to 0.1, and the contrastive loss weight λ = 1e-5 num train steps, which linearly increases to 10 when training for 100k steps. The masking strategy is the same as in Wav2Vec 2.0 (Baevski et al., 2020b), with the masking probability set to 0.08. ... Our model is trained for 100k steps using 36 GPUs, with a batch size of at most 76 seconds of audio per GPU... The learning rate is set to 5e-4.