u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

Authors: Wei-Ning Hsu, Bowen Shi

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate u-Hu BERT on audio, visual, and audio-visual speech, where larger quantities of unlabeled and labeled data are available. Two audio-visual and one audio-only datasets are used for pre-training: (1) LRS3 [Afouras et al., 2018] with 433 hours of English audio-visual speech, (2) Vox Celeb2-En (VC2-En) with 1,326 hours of English You Tube audio-visual speech filtered from Vox Celeb2 [Shi et al., 2022a], and (3) TED-LIUM release 3 (TD) [Hernandez et al., 2018] with 452 hours of English audio collected from the same domain as LRS3. and Table 2: Speech recognition results on LRS3 test.
Researcher Affiliation Industry Wei-Ning Hsu Meta AI wnhsu@meta.com Bowen Shi Meta AI bshi@meta.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Codes and models are available at https://github.com/facebookresearch/av_hubert
Open Datasets Yes Two audio-visual and one audio-only datasets are used for pre-training: (1) LRS3 [Afouras et al., 2018] with 433 hours of English audio-visual speech, (2) Vox Celeb2-En (VC2-En) with 1,326 hours of English You Tube audio-visual speech filtered from Vox Celeb2 [Chung et al., 2018] by Shi et al. [2022a], and (3) TED-LIUM release 3 (TD) [Hernandez et al., 2018] with 452 hours of English audio collected from the same domain as LRS3.
Dataset Splits Yes LRS3 trainval and pretrain, combined for 433 hours, are used for training, with the same 1,200 utterances as Shi et al. [2022a] split for validation.
Hardware Specification Yes We then pre-train u-Hu BERT on the combined data for 1M updates on 64 32GB V100 GPUs using the Adam optimizer [Kingma and Ba, 2015] and a learning rate of 0.002.
Software Dependencies No The paper mentions software components and algorithms like Adam optimizer, Transformer, K-means, t-SNE, and specific tokenization/evaluation methods, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We then pre-train u-Hu BERT on the combined data for 1M updates on 64 32GB V100 GPUs using the Adam optimizer [Kingma and Ba, 2015] and a learning rate of 0.002. Gradient norm is clipped at 1.0. A batch size of maximal 40 seconds per GPU is used. When pre-trained on multimodal data, audio and video are dropped with a probability of 0.25 and 0.25, respectively.