ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers
Authors: Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we will evaluate CONTENTVEC on an extensive set of content-related tasks. In particular, we would like to investigate whether disentangling speakers has benefit in real-world tasks, and how large the benefit would be. Further experimental details can be found in Appendix B. |
| Researcher Affiliation | Collaboration | 1MIT-IBM Watson AI Lab 2University of Illinois at Urbana-Champaign 3Massachusetts Institute of Technology 4University of California, Santa Barbara. |
| Pseudocode | No | The paper describes its approach using text and diagrams (e.g., Figure 1) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Our code is available at https://github.com/auspicious3000/contentvec |
| Open Datasets | Yes | CONTENTVEC and all the baselines are trained on the Librispeech dataset (Panayotov et al., 2015). |
| Dataset Splits | Yes | The best model is selected based on the lowest validation masked prediction loss. |
| Hardware Specification | No | Our model is trained for 100k steps using 36 GPUs, with a batch size of at most 76 seconds of audio per GPU. The paper specifies the quantity of GPUs but does not provide specific models (e.g., NVIDIA A100, V100), CPU details, or other specific hardware configurations. |
| Software Dependencies | No | The paper mentions several software components and frameworks like fairseq, WAV2VEC 2.0, Hi Fi-GAN, and k-means clustering, but it does not specify version numbers for these, which are required for full reproducibility. |
| Experiment Setup | Yes | The speech representation network of CONTENTVEC has the same architecture as the HUBERT, which has 7 temporal convolutional feature extraction blocks followed by 12 layers of transformer layers of model dimension 768. During training, each layer is independently dropped with a probability of 0.05. ... The contrastive loss is imposed at the last but five layer, the temperature k is set to 0.1, and the contrastive loss weight λ = 1e-5 num train steps, which linearly increases to 10 when training for 100k steps. The masking strategy is the same as in Wav2Vec 2.0 (Baevski et al., 2020b), with the masking probability set to 0.08. ... Our model is trained for 100k steps using 36 GPUs, with a batch size of at most 76 seconds of audio per GPU... The learning rate is set to 5e-4. |