Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

Authors: Weiwei Lin, Chenhang He, Man-Wai Mak, Youzhi Tu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we will evaluate the proposed NFA model s performance on three kinds of utterance-level speech tasks, namely speaker, emotion, and language recognition, by comparing it to SSL models such as wav2vec2.0, Hu BERT, and Wav LM.
Researcher Affiliation Academia 1Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China. 2Dept. of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China.
Pseudocode Yes Algorithm 1 Training procedure of the proposed NFA model
Open Source Code No The paper does not include any statement about releasing the source code or provide a link to a code repository.
Open Datasets Yes We followed the SUPERB protocol (Yang et al., 2021) using the Vox Celeb1 (Nagrani et al., 2017) training split to train the model and used the test split to evaluate speaker verification performance.
Dataset Splits No The paper mentions training and test splits for various datasets (e.g., 'Vox Celeb1 training split', 'Libri Speech splits for training and evaluation', 'Vox Celeb1 train-test split'), but it does not provide specific details about validation splits (e.g., percentages, sample counts, or explicit mention of a validation set).
Hardware Specification No The paper does not provide specific details regarding the hardware used for experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions software like fairseq and sklearn, but it does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes λ in Eq. 14 is set to 0.01 for all models. After the optimization steps in Algorithm 1 were done, we re-trained the loading matrix T for each task with EM using unlabeled task-related data. Other than specifically stated, the acoustic features were extracted from layer 6 for the base SSL models (Hu BERT, Wav LM, and Wav2Vec2-XLS-R) and layer 9 for the large SSL models. The number of clusters in K-means is 100, and the rank of loading matrix dimension is 300 for all NFA models.