Robust Speech Recognition via Large-Scale Weak Supervision

Authors: Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, Ilya Sutskever

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main findings are summarized in Figure 2 and Table 1.
Researcher Affiliation Industry 1Open AI, San Francisco, CA 94110, USA. Correspondence to: Alec Radford <alec@openai.com>, Jong Wook Kim <jongwook@openai.com>.
Pseudocode No The paper describes the approach and procedures in text and diagrams (Figure 1), but does not include any explicit pseudocode blocks or algorithms.
Open Source Code Yes We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
Open Datasets Yes We reuse a wide set of existing speech datasets to check whether Whisper is able to generalize well across domains, tasks, and languages. ... Libri Speech (Panayotov et al., 2015): We used the test-clean and test-other splits from the Libri Speech ASR corpus.
Dataset Splits No Early stopping based on the validation loss was used to select model checkpoints for each dataset size. (However, the specific splits for the main 680,000-hour training dataset are not detailed).
Hardware Specification No The paper acknowledges 'Open AI for their critical work on software and hardware infrastructure this project used', but does not provide specific details on the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies Yes Finally, we are grateful to the developers of the many software packages used throughout this project including, but not limited, to Numpy (Harris et al., 2020), Sci Py (Virtanen et al., 2020), ftfy (Speer, 2019), Py Torch (Paszke et al., 2019), pandas (pandas development team, 2020), and scikit-learn (Pedregosa et al., 2011).
Experiment Setup Yes We train a suite of models in order to study the scaling properties of Whisper ranging from 39M to 1550M params. Models were trained with Adam W (Loshchilov & Hutter, 2017) and gradient norm clipping (Pascanu et al., 2013) with a linear learning rate decay to zero after a warmup over the first 2048 updates. A batch size of 256 segments was used, and the models are trained for 220 updates which is between two and three passes over the dataset.