Jointly Learning Visual and Auditory Speech Representations from Raw Data
Authors: Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments with models and datasets of different sizes. We find that, when fine-tuning our pre-trained models for VSR and ASR with only 30 hours of labelled data from LRS3 (Afouras et al., 2018b), RAVEn surpasses recent self-supervised methods by a large margin in most settings. Coupling pre-training with self-training reaches 23.8% WER for VSR on LRS3, even outperforming a method trained on 3000 more transcribed hours (Serdyuk et al., 2021). |
| Researcher Affiliation | Collaboration | 1Imperial College London 2Meta AI |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/ahaliassos/raven. |
| Open Datasets | Yes | For pre-training, we conduct experiments with LRS3 (Afouras et al., 2018b) (without the labels) as well as a combination of LRS3 and an English-only version of Vox Celeb2 (Chung et al., 2018) curated by Shi et al. (2022), which we refer to as LRS3+Vox2-en. The former features 433 hours of footage and the latter 1,759 hours. For fine-tuning, we use the full LRS3 with the labels as our high-resource labelled data setting, as well as a 30-hour subset (the trainval partition) as our low-resource setting. We present results for the LRS2 dataset (Chung et al., 2017) in Appendix B. |
| Dataset Splits | Yes | Ablations are performed with our Base model in the low-resource setting with LRS3 pre-training using the validation set from Shi et al. (2022) (as there is no official development set). When using a language model, β is chosen from {0.1, 0.2, 0.3, 0.4} using the validation set. |
| Hardware Specification | Yes | We train the Base model with 32 A100 GPUs, and the Large model with 128. |
| Software Dependencies | No | We use the ESPnet framework (Watanabe et al., 2018) for decoding. We set the beam size to 40. The final score used to choose the most likely sequence is given by S = λSctc+(1 λ)Satt+βSLM, where Sctc and Satt denote the scores from the CTC and attention branches, respectively, and λ = 0.1. SLM is an optional score from the language model, incorporated through shallow fusion (Watanabe et al., 2017). When using a language model, β is chosen from {0.1, 0.2, 0.3, 0.4} using the validation set. |
| Experiment Setup | Yes | Table 6 provides the default setting for pre-training. We use the Adam W (Loshchilov & Hutter, 2019) optimiser with linear learning rate warmup (Goyal et al., 2017) and a cosine decay schedule (Loshchilov & Hutter, 2017). During training, we apply random spatial cropping of size (88 88) followed by horizontal flipping with probability 0.5. These augmentations are applied in a time-consistent manner across the video clips to maintain temporal coherence. |