reproducibilityindex.ai

Generative Pre-training for Speech with Flow Matching

Authors: Alexander H. Liu, Matthew Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
Researcher Affiliation	Collaboration	Alexander H. Liu1 , Matt Le2, Apoorv Vyas2, Bowen Shi2, Andros Tjandra2, Wei-Ning Hsu2 1MIT CSAIL, 2Meta AI
Pseudocode	No	The paper describes the inference process and model architecture in text and diagrams, but it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	No	The paper provides links to 'Audio samples' and 'supplementary materials' (e.g., https://voicebox.metademolab.com/speechflow.html, https://openreview.net/forum?id=KpoQSgxbKH), but it does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We fine-tuned and tested Speech Flow on the benchmark dataset Voice Bank-Demand (VB-DMD; Valentini-Botinhao et al., 2017) ... Also trained our model using 100 hours of noisy speech from Deep Noise Supression Challenge 2020 (DNS2020; Reddy et al., 2020) ... We tested the fine-tuned model on Libri Mix (Cosentino et al., 2020) 16khz min. ... on filtered LS (Panayotov et al., 2015) test-clean.
Dataset Splits	No	The paper mentions training on certain datasets and testing on others, and describes random cropping for training data. However, it does not specify explicit training/validation/test splits (e.g., percentages or exact counts) for its own experimental setup, nor does it refer to predefined splits with specific details for reproducibility beyond simply mentioning the test sets of certain benchmarks.
Hardware Specification	Yes	We pre-train Speech Flow for 600k steps on 32 V100 GPUs with a batch size of 75 seconds per GPU with FP16. ... We fine-tuned Speech Flow on single V100 GPU...
Software Dependencies	No	The paper references various methods and frameworks by their associated papers (e.g., 'Adam optimizer (Kingma & Ba, 2014)', 'Transformer encoder (Vaswani et al., 2017)', 'torchdiffeq (Chen, 2018)'). While it mentions software by name and publication year, it does not provide specific software version numbers (e.g., PyTorch 1.9, CUDA 11.1) for key dependencies needed to reproduce the environment.
Experiment Setup	Yes	We use Adam optimizer (Kingma & Ba, 2014) with the learning rate warming up linearly to 5e-5 for the first 5k steps and linearly decaying to 1e-5 for the rest of the training. For masking, we set pdrop = 10%, nmask U[70%, 100%], and lmask = 10. ... The learning rate is set to peak at 2e-5 after 5k updates, then linearly decay to 0.