Continuous pseudo-labeling from the start

Authors: Dan Berrebbi, Ronan Collobert, Samy Bengio, Navdeep Jaitly, Tatiana Likhomanenko

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we show how we can do this by dynamically controlling the evolution of PLs during the training process in ASR. To the best of our knowledge, this is the first study that shows the feasibility of generating PLs from the very start of the training. We are able to achieve this using two techniques that avoid instabilities which lead to degenerate models that do not generalize. Firstly, we control the evolution of PLs through a curriculum that uses the online changes in PLs to control the membership of the cache of PLs and improve generalization. Secondly, we find that by sampling transcriptions from the predictive distribution, rather than only using the best transcription, we can stabilize training further. With these techniques, our ST models match prior works without an external language model. All our experiments are performed using the Libri Speech dataset (Panayotov et al., 2015).
Researcher Affiliation Collaboration Dan Berrebbi Carnegie Mellon University dberrebb@andrew.cmu.edu Ronan Collobert, Samy Bengio, Navdeep Jaitly, Tatiana Likhomanenko Apple {collobert,bengio,njaitly,antares}@apple.com
Pseudocode Yes Algorithm 1: slim IPL algorithm and our proposed changes (red deletion and green addition)
Open Source Code No We aim to open source the code of our method and experiments soon.
Open Datasets Yes All our experiments are performed using the Libri Speech dataset (Panayotov et al., 2015). We use the train-clean-360 and train-other-500 regular subsets as unlabeled data, and consider either a subset of 10h randomly drawn from train-clean-100, or the full 100h set (train-clean-100) as labeled data. Comparisons with existing works are also provided using the 10h subset from Libri-Light (Kahn et al., 2020b)1. In addition, we evaluate the final configuration of our methods on the Common Voice dataset Ardila et al. (2020) for French language where we sample 10h and 100h from the train set to use as labeled data and the rest as unlabeled data (see Appendix A.3).
Dataset Splits Yes All our experiments are performed using the Libri Speech dataset (Panayotov et al., 2015). We use the train-clean-360 and train-other-500 regular subsets as unlabeled data, and consider either a subset of 10h randomly drawn from train-clean-100, or the full 100h set (train-clean-100) as labeled data. Comparisons with existing works are also provided using the 10h subset from Libri-Light (Kahn et al., 2020b)1. In addition, we evaluate the final configuration of our methods on the Common Voice dataset Ardila et al. (2020) for French language where we sample 10h and 100h from the train set to use as labeled data and the rest as unlabeled data (see Appendix A.3). All hyper-parameters and model selections are performed using devclean and dev-other sets. We report final token (TER) or word (WER) error rates on test-clean and test-other sets.
Hardware Specification Yes All models are trained on tf32 tensor cores of 8 Ampere A100 40GB GPUs for a maximum of 500k updates.
Software Dependencies No The paper mentions software components like 'Adagrad optimizer', 'CTC loss', and 'Spec Augment', but does not provide specific version numbers for these or other software dependencies required for replication.
Experiment Setup Yes All models are trained with CTC loss and Adagrad optimizer with linear warmup period of 64k steps, constant learning rate of 0.03 and step-wise (by 2) learning rate decay at the end of training. All models are trained on tf32 tensor cores of 8 Ampere A100 40GB GPUs for a maximum of 500k updates. We use either a static batch of 8 examples or a dynamic batch that packs 290s of audio per GPU. By default we use C = 1000, λ = 1, M = 0.