SSDM: Scalable Speech Dysfluency Modeling
Authors: Jiachen Lian, Xuanru Zhou, Zoe Ezzes, Jet Vonk, Brittany Morin, David Paul Baquirin, Zachary Miller, Maria Luisa Gorno Tempini, Gopala Anumanchipalli
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate phonetic transcription (forced alignment) performance using simulated data from VCTK++[1] and our proposed Libri-Dys dataset. The framewise F1 score and d PER[1] are used as evaluation metrics. Five types of training data are used: VCTK++, Libri TTS (100%, [106]), Libri-Dys (30%), Libri-Dys (60%), and Libri-Dys (100%). |
| Researcher Affiliation | Academia | Jiachen Lian1, Xuanru Zhou2, Zoe Ezzes3, Jet Vonk3, Brittany Morin3, David Baquirin3, Zachary Miller3, Maria Luisa Gorno Tempini3, Gopala Anumanchipalli1 1 UC Berkeley, 2 Zhejiang University, 3 UCSF |
| Pseudocode | Yes | Algorithm 1 Find Longest Common Subsequence (LCS) |
| Open Source Code | No | For code, we are waiting for the other approval. |
| Open Datasets | Yes | Data is opensourced at https://bit.ly/4ao Ld WU. |
| Dataset Splits | No | For training, we use VCTK++[1] and Libri-Dys datasets. For testing, we randomly sample 10% of the training data. The paper does not explicitly describe a separate validation split or how it's derived. |
| Hardware Specification | Yes | The training is conducted using two A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like Wav LM, Glow algorithm, and Adam optimizer but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | In Eq. 2, τ = 2. In Eq. 4, a = b = 1, mrow = 3. In Eq. 6 and Eq. 7, we simply set K1 = K2 = 1. In Eq. 8, λ1 = λ2 = λ3 = 1. In Eq. 12 and Eq. 13, δ = 0.9. ... We use the Adam optimizer and decay the learning rate from 0.001 at a rate of 0.9 every 10 steps until convergence. |