Jointly Learning Visual and Auditory Speech Representations from Raw Data

Authors: Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments with models and datasets of different sizes. We find that, when fine-tuning our pre-trained models for VSR and ASR with only 30 hours of labelled data from LRS3 (Afouras et al., 2018b), RAVEn surpasses recent self-supervised methods by a large margin in most settings. Coupling pre-training with self-training reaches 23.8% WER for VSR on LRS3, even outperforming a method trained on 3000 more transcribed hours (Serdyuk et al., 2021).
Researcher Affiliation Collaboration 1Imperial College London 2Meta AI
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Code and models are available at https://github.com/ahaliassos/raven.
Open Datasets Yes For pre-training, we conduct experiments with LRS3 (Afouras et al., 2018b) (without the labels) as well as a combination of LRS3 and an English-only version of Vox Celeb2 (Chung et al., 2018) curated by Shi et al. (2022), which we refer to as LRS3+Vox2-en. The former features 433 hours of footage and the latter 1,759 hours. For fine-tuning, we use the full LRS3 with the labels as our high-resource labelled data setting, as well as a 30-hour subset (the trainval partition) as our low-resource setting. We present results for the LRS2 dataset (Chung et al., 2017) in Appendix B.
Dataset Splits Yes Ablations are performed with our Base model in the low-resource setting with LRS3 pre-training using the validation set from Shi et al. (2022) (as there is no official development set). When using a language model, β is chosen from {0.1, 0.2, 0.3, 0.4} using the validation set.
Hardware Specification Yes We train the Base model with 32 A100 GPUs, and the Large model with 128.
Software Dependencies No We use the ESPnet framework (Watanabe et al., 2018) for decoding. We set the beam size to 40. The final score used to choose the most likely sequence is given by S = λSctc+(1 λ)Satt+βSLM, where Sctc and Satt denote the scores from the CTC and attention branches, respectively, and λ = 0.1. SLM is an optional score from the language model, incorporated through shallow fusion (Watanabe et al., 2017). When using a language model, β is chosen from {0.1, 0.2, 0.3, 0.4} using the validation set.
Experiment Setup Yes Table 6 provides the default setting for pre-training. We use the Adam W (Loshchilov & Hutter, 2019) optimiser with linear learning rate warmup (Goyal et al., 2017) and a cosine decay schedule (Loshchilov & Hutter, 2017). During training, we apply random spatial cropping of size (88 88) followed by horizontal flipping with probability 0.5. These augmentations are applied in a time-consistent manner across the video clips to maintain temporal coherence.