reproducibilityindex.ai

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiment
Researcher Affiliation	Industry	Fundamental AI Research (FAIR), Meta
Pseudocode	No	The paper includes diagrams and descriptions of the model architecture and training process, but no formal pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a link to audio samples (https://voicebox.metademolab.com) but does not include an explicit statement or link to the open-source code for the Voicebox model itself.
Open Datasets	Yes	We train the English-only model on 60K hours ASR-transcribed English audiobooks and the multilingual model on 50K hours of multilingual audiobooks in 6 languages: English (En), French (Fr), German (De), Spanish (Es), Polish (Pl) and Portuguese (Pt).
Dataset Splits	Yes	The model is tested on the dev-clean split of Librispeech. We then take a 100 hour subset of the 60K hour-English data and set aside 2,703 random utterances (to match the size of dev-clean) which is used as a validation split.
Hardware Specification	No	The paper mentions '8 GPUs' for vocoder training but does not provide specific hardware details such as GPU models, CPU types, or memory specifications for running experiments.
Software Dependencies	No	The paper mentions software like PyTorch [45] and torchdiffeq [9], but does not specify their version numbers or other key software dependencies with version information required for reproducibility.
Experiment Setup	Yes	VB-En/VB-Multi audio models are trained for 500K/750K updates with an effective batch size of 240K frames... The Adam [33] optimizer is used with a peak learning rate of 1e-4, linearly warmed up for 5K steps... The audio/duration sequence is masked with pdrop = 0.3/0.2... puncond is set to 0.2 for audio/duration models.