Generative Pre-training for Speech with Flow Matching

Authors: Alexander H. Liu, Matthew Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
Researcher Affiliation Collaboration Alexander H. Liu1 , Matt Le2, Apoorv Vyas2, Bowen Shi2, Andros Tjandra2, Wei-Ning Hsu2 1MIT CSAIL, 2Meta AI
Pseudocode No The paper describes the inference process and model architecture in text and diagrams, but it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code No The paper provides links to 'Audio samples' and 'supplementary materials' (e.g., https://voicebox.metademolab.com/speechflow.html, https://openreview.net/forum?id=KpoQSgxbKH), but it does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets Yes We fine-tuned and tested Speech Flow on the benchmark dataset Voice Bank-Demand (VB-DMD; Valentini-Botinhao et al., 2017) ... Also trained our model using 100 hours of noisy speech from Deep Noise Supression Challenge 2020 (DNS2020; Reddy et al., 2020) ... We tested the fine-tuned model on Libri Mix (Cosentino et al., 2020) 16khz min. ... on filtered LS (Panayotov et al., 2015) test-clean.
Dataset Splits No The paper mentions training on certain datasets and testing on others, and describes random cropping for training data. However, it does not specify explicit training/validation/test splits (e.g., percentages or exact counts) for its own experimental setup, nor does it refer to predefined splits with specific details for reproducibility beyond simply mentioning the test sets of certain benchmarks.
Hardware Specification Yes We pre-train Speech Flow for 600k steps on 32 V100 GPUs with a batch size of 75 seconds per GPU with FP16. ... We fine-tuned Speech Flow on single V100 GPU...
Software Dependencies No The paper references various methods and frameworks by their associated papers (e.g., 'Adam optimizer (Kingma & Ba, 2014)', 'Transformer encoder (Vaswani et al., 2017)', 'torchdiffeq (Chen, 2018)'). While it mentions software by name and publication year, it does not provide specific software version numbers (e.g., PyTorch 1.9, CUDA 11.1) for key dependencies needed to reproduce the environment.
Experiment Setup Yes We use Adam optimizer (Kingma & Ba, 2014) with the learning rate warming up linearly to 5e-5 for the first 5k steps and linearly decaying to 1e-5 for the rest of the training. For masking, we set pdrop = 10%, nmask U[70%, 100%], and lmask = 10. ... The learning rate is set to peak at 2e-5 after 5k updates, then linearly decay to 0.