Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations
Authors: Weiwei Lin, Chenhang He, Man-Wai Mak, Youzhi Tu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we will evaluate the proposed NFA model s performance on three kinds of utterance-level speech tasks, namely speaker, emotion, and language recognition, by comparing it to SSL models such as wav2vec2.0, Hu BERT, and Wav LM. |
| Researcher Affiliation | Academia | 1Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China. 2Dept. of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China. |
| Pseudocode | Yes | Algorithm 1 Training procedure of the proposed NFA model |
| Open Source Code | No | The paper does not include any statement about releasing the source code or provide a link to a code repository. |
| Open Datasets | Yes | We followed the SUPERB protocol (Yang et al., 2021) using the Vox Celeb1 (Nagrani et al., 2017) training split to train the model and used the test split to evaluate speaker verification performance. |
| Dataset Splits | No | The paper mentions training and test splits for various datasets (e.g., 'Vox Celeb1 training split', 'Libri Speech splits for training and evaluation', 'Vox Celeb1 train-test split'), but it does not provide specific details about validation splits (e.g., percentages, sample counts, or explicit mention of a validation set). |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions software like fairseq and sklearn, but it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | λ in Eq. 14 is set to 0.01 for all models. After the optimization steps in Algorithm 1 were done, we re-trained the loading matrix T for each task with EM using unlabeled task-related data. Other than specifically stated, the acoustic features were extracted from layer 6 for the base SSL models (Hu BERT, Wav LM, and Wav2Vec2-XLS-R) and layer 9 for the large SSL models. The number of clusters in K-means is 100, and the rank of loading matrix dimension is 300 for all NFA models. |