Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiment |
| Researcher Affiliation | Industry | Fundamental AI Research (FAIR), Meta |
| Pseudocode | No | The paper includes diagrams and descriptions of the model architecture and training process, but no formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to audio samples (https://voicebox.metademolab.com) but does not include an explicit statement or link to the open-source code for the Voicebox model itself. |
| Open Datasets | Yes | We train the English-only model on 60K hours ASR-transcribed English audiobooks and the multilingual model on 50K hours of multilingual audiobooks in 6 languages: English (En), French (Fr), German (De), Spanish (Es), Polish (Pl) and Portuguese (Pt). |
| Dataset Splits | Yes | The model is tested on the dev-clean split of Librispeech. We then take a 100 hour subset of the 60K hour-English data and set aside 2,703 random utterances (to match the size of dev-clean) which is used as a validation split. |
| Hardware Specification | No | The paper mentions '8 GPUs' for vocoder training but does not provide specific hardware details such as GPU models, CPU types, or memory specifications for running experiments. |
| Software Dependencies | No | The paper mentions software like PyTorch [45] and torchdiffeq [9], but does not specify their version numbers or other key software dependencies with version information required for reproducibility. |
| Experiment Setup | Yes | VB-En/VB-Multi audio models are trained for 500K/750K updates with an effective batch size of 240K frames... The Adam [33] optimizer is used with a peak learning rate of 1e-4, linearly warmed up for 5K steps... The audio/duration sequence is masked with pdrop = 0.3/0.2... puncond is set to 0.2 for audio/duration models. |