reproducibilityindex.ai

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

Authors: Weiwei Lin, Chenhang He, Man-Wai Mak, Youzhi Tu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we will evaluate the proposed NFA model s performance on three kinds of utterance-level speech tasks, namely speaker, emotion, and language recognition, by comparing it to SSL models such as wav2vec2.0, Hu BERT, and Wav LM.
Researcher Affiliation	Academia	1Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China. 2Dept. of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China.
Pseudocode	Yes	Algorithm 1 Training procedure of the proposed NFA model
Open Source Code	No	The paper does not include any statement about releasing the source code or provide a link to a code repository.
Open Datasets	Yes	We followed the SUPERB protocol (Yang et al., 2021) using the Vox Celeb1 (Nagrani et al., 2017) training split to train the model and used the test split to evaluate speaker verification performance.
Dataset Splits	No	The paper mentions training and test splits for various datasets (e.g., 'Vox Celeb1 training split', 'Libri Speech splits for training and evaluation', 'Vox Celeb1 train-test split'), but it does not provide specific details about validation splits (e.g., percentages, sample counts, or explicit mention of a validation set).
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for experiments, such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions software like fairseq and sklearn, but it does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	λ in Eq. 14 is set to 0.01 for all models. After the optimization steps in Algorithm 1 were done, we re-trained the loading matrix T for each task with EM using unlabeled task-related data. Other than specifically stated, the acoustic features were extracted from layer 6 for the base SSL models (Hu BERT, Wav LM, and Wav2Vec2-XLS-R) and layer 9 for the large SSL models. The number of clusters in K-means is 100, and the rank of loading matrix dimension is 300 for all NFA models.