Robust Speech Recognition via Large-Scale Weak Supervision
Authors: Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, Ilya Sutskever
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main findings are summarized in Figure 2 and Table 1. |
| Researcher Affiliation | Industry | 1Open AI, San Francisco, CA 94110, USA. Correspondence to: Alec Radford <alec@openai.com>, Jong Wook Kim <jongwook@openai.com>. |
| Pseudocode | No | The paper describes the approach and procedures in text and diagrams (Figure 1), but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | We are releasing models and inference code to serve as a foundation for further work on robust speech processing. |
| Open Datasets | Yes | We reuse a wide set of existing speech datasets to check whether Whisper is able to generalize well across domains, tasks, and languages. ... Libri Speech (Panayotov et al., 2015): We used the test-clean and test-other splits from the Libri Speech ASR corpus. |
| Dataset Splits | No | Early stopping based on the validation loss was used to select model checkpoints for each dataset size. (However, the specific splits for the main 680,000-hour training dataset are not detailed). |
| Hardware Specification | No | The paper acknowledges 'Open AI for their critical work on software and hardware infrastructure this project used', but does not provide specific details on the hardware used, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | Yes | Finally, we are grateful to the developers of the many software packages used throughout this project including, but not limited, to Numpy (Harris et al., 2020), Sci Py (Virtanen et al., 2020), ftfy (Speer, 2019), Py Torch (Paszke et al., 2019), pandas (pandas development team, 2020), and scikit-learn (Pedregosa et al., 2011). |
| Experiment Setup | Yes | We train a suite of models in order to study the scaling properties of Whisper ranging from 39M to 1550M params. Models were trained with Adam W (Loshchilov & Hutter, 2017) and gradient norm clipping (Pascanu et al., 2013) with a linear learning rate decay to zero after a warmup over the first 2048 updates. A batch size of 256 segments was used, and the models are trained for 220 updates which is between two and three passes over the dataset. |