SSDM: Scalable Speech Dysfluency Modeling

Authors: Jiachen Lian, Xuanru Zhou, Zoe Ezzes, Jet Vonk, Brittany Morin, David Paul Baquirin, Zachary Miller, Maria Luisa Gorno Tempini, Gopala Anumanchipalli

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate phonetic transcription (forced alignment) performance using simulated data from VCTK++[1] and our proposed Libri-Dys dataset. The framewise F1 score and d PER[1] are used as evaluation metrics. Five types of training data are used: VCTK++, Libri TTS (100%, [106]), Libri-Dys (30%), Libri-Dys (60%), and Libri-Dys (100%).
Researcher Affiliation Academia Jiachen Lian1, Xuanru Zhou2, Zoe Ezzes3, Jet Vonk3, Brittany Morin3, David Baquirin3, Zachary Miller3, Maria Luisa Gorno Tempini3, Gopala Anumanchipalli1 1 UC Berkeley, 2 Zhejiang University, 3 UCSF
Pseudocode Yes Algorithm 1 Find Longest Common Subsequence (LCS)
Open Source Code No For code, we are waiting for the other approval.
Open Datasets Yes Data is opensourced at https://bit.ly/4ao Ld WU.
Dataset Splits No For training, we use VCTK++[1] and Libri-Dys datasets. For testing, we randomly sample 10% of the training data. The paper does not explicitly describe a separate validation split or how it's derived.
Hardware Specification Yes The training is conducted using two A6000 GPUs.
Software Dependencies No The paper mentions software like Wav LM, Glow algorithm, and Adam optimizer but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes In Eq. 2, τ = 2. In Eq. 4, a = b = 1, mrow = 3. In Eq. 6 and Eq. 7, we simply set K1 = K2 = 1. In Eq. 8, λ1 = λ2 = λ3 = 1. In Eq. 12 and Eq. 13, δ = 0.9. ... We use the Adam optimizer and decay the learning rate from 0.001 at a rate of 0.9 every 10 steps until convergence.