Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiment
Researcher Affiliation Industry Fundamental AI Research (FAIR), Meta
Pseudocode No The paper includes diagrams and descriptions of the model architecture and training process, but no formal pseudocode or algorithm blocks.
Open Source Code No The paper provides a link to audio samples (https://voicebox.metademolab.com) but does not include an explicit statement or link to the open-source code for the Voicebox model itself.
Open Datasets Yes We train the English-only model on 60K hours ASR-transcribed English audiobooks and the multilingual model on 50K hours of multilingual audiobooks in 6 languages: English (En), French (Fr), German (De), Spanish (Es), Polish (Pl) and Portuguese (Pt).
Dataset Splits Yes The model is tested on the dev-clean split of Librispeech. We then take a 100 hour subset of the 60K hour-English data and set aside 2,703 random utterances (to match the size of dev-clean) which is used as a validation split.
Hardware Specification No The paper mentions '8 GPUs' for vocoder training but does not provide specific hardware details such as GPU models, CPU types, or memory specifications for running experiments.
Software Dependencies No The paper mentions software like PyTorch [45] and torchdiffeq [9], but does not specify their version numbers or other key software dependencies with version information required for reproducibility.
Experiment Setup Yes VB-En/VB-Multi audio models are trained for 500K/750K updates with an effective batch size of 240K frames... The Adam [33] optimizer is used with a peak learning rate of 1e-4, linearly warmed up for 5K steps... The audio/duration sequence is masked with pdrop = 0.3/0.2... puncond is set to 0.2 for audio/duration models.