Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Authors: Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on two datasets: LRS3 (Afouras et al., 2018b) with 433 hours of transcribed English videos and Vox Celeb2 (Chung et al., 2018) with 2442 hours of unlabeled multilingual videos. |
| Researcher Affiliation | Collaboration | Bowen Shi1 Wei-Ning Hsu2 Kushal Lakhotia2 Abdelrahman Mohamed2 1Toyota Technological Institute at Chicago 2Meta AI bshi@ttic.edu {wnhsu,kushall,abdo}@fb.com |
| Pseudocode | No | The paper includes illustrations of the model architecture (e.g., Figure 1, Figure A.1) but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models are available at https://github.com/facebookresearch/av_hubert |
| Open Datasets | Yes | We conduct experiments on two datasets: LRS3 (Afouras et al., 2018b) with 433 hours of transcribed English videos and Vox Celeb2 (Chung et al., 2018) with 2442 hours of unlabeled multilingual videos. |
| Dataset Splits | Yes | As no official development set is provided, we randomly select 1,200 sequences from trainval as the validation set (about 1 hour) for early stopping and hyperparameter tuning. |
| Hardware Specification | Yes | We train on 32 and 64 V100-GPUs for BASE and LARGE. |
| Software Dependencies | No | The paper mentions software like 'fairseq' and 'Adam', but it does not specify version numbers for these or any other key software components, which is necessary for reproducible ancillary software description. |
| Experiment Setup | Yes | We set both pm and pa to 0.5 for modality dropout at training time. To extract features for clustering, both modalities are used. We adopt the strategy used in wav2vec 2.0 (Baevski et al., 2020) to generate masks, where p% of all frames are randomly selected as start and subsequent l frames are masked. In iteration 1-4, we mask the fused features and set p/l to be 8/10 respectively as we observe such practice generates higher quality cluster assignments (see section E.2). In the last iteration, we set p/l to be 6/5 for video and 8/10 for audio (see section D). We train the model with Adam (Kingma & Ba, 2015), warming up the learning rate for the first 8% of updates to a peak of 0.002 and then linearly decay it. Videos are batched together to not exceed 1,000 image frames (40 seconds) per GPU. Both BASE and LARGE models are updated for 400K and 600K steps at each iteration, respectively in 433h/1759h unlabeled settings. |