Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiment |
| Researcher Affiliation | Industry | Fundamental AI Research (FAIR), Meta |
| Pseudocode | No | The paper includes diagrams and descriptions of the model architecture and training process, but no formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to audio samples (https://voicebox.metademolab.com) but does not include an explicit statement or link to the open-source code for the Voicebox model itself. |
| Open Datasets | Yes | We train the English-only model on 60K hours ASR-transcribed English audiobooks and the multilingual model on 50K hours of multilingual audiobooks in 6 languages: English (En), French (Fr), German (De), Spanish (Es), Polish (Pl) and Portuguese (Pt). |
| Dataset Splits | Yes | The model is tested on the dev-clean split of Librispeech. We then take a 100 hour subset of the 60K hour-English data and set aside 2,703 random utterances (to match the size of dev-clean) which is used as a validation split. |
| Hardware Specification | No | The paper mentions '8 GPUs' for vocoder training but does not provide specific hardware details such as GPU models, CPU types, or memory specifications for running experiments. |
| Software Dependencies | No | The paper mentions software like PyTorch [45] and torchdiffeq [9], but does not specify their version numbers or other key software dependencies with version information required for reproducibility. |
| Experiment Setup | Yes | VB-En/VB-Multi audio models are trained for 500K/750K updates with an effective batch size of 240K frames... The Adam [33] optimizer is used with a peak learning rate of 1e-4, linearly warmed up for 5K steps... The audio/duration sequence is masked with pdrop = 0.3/0.2... puncond is set to 0.2 for audio/duration models. |