u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
Authors: Wei-Ning Hsu, Bowen Shi
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate u-Hu BERT on audio, visual, and audio-visual speech, where larger quantities of unlabeled and labeled data are available. Two audio-visual and one audio-only datasets are used for pre-training: (1) LRS3 [Afouras et al., 2018] with 433 hours of English audio-visual speech, (2) Vox Celeb2-En (VC2-En) with 1,326 hours of English You Tube audio-visual speech filtered from Vox Celeb2 [Shi et al., 2022a], and (3) TED-LIUM release 3 (TD) [Hernandez et al., 2018] with 452 hours of English audio collected from the same domain as LRS3. and Table 2: Speech recognition results on LRS3 test. |
| Researcher Affiliation | Industry | Wei-Ning Hsu Meta AI wnhsu@meta.com Bowen Shi Meta AI bshi@meta.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes and models are available at https://github.com/facebookresearch/av_hubert |
| Open Datasets | Yes | Two audio-visual and one audio-only datasets are used for pre-training: (1) LRS3 [Afouras et al., 2018] with 433 hours of English audio-visual speech, (2) Vox Celeb2-En (VC2-En) with 1,326 hours of English You Tube audio-visual speech filtered from Vox Celeb2 [Chung et al., 2018] by Shi et al. [2022a], and (3) TED-LIUM release 3 (TD) [Hernandez et al., 2018] with 452 hours of English audio collected from the same domain as LRS3. |
| Dataset Splits | Yes | LRS3 trainval and pretrain, combined for 433 hours, are used for training, with the same 1,200 utterances as Shi et al. [2022a] split for validation. |
| Hardware Specification | Yes | We then pre-train u-Hu BERT on the combined data for 1M updates on 64 32GB V100 GPUs using the Adam optimizer [Kingma and Ba, 2015] and a learning rate of 0.002. |
| Software Dependencies | No | The paper mentions software components and algorithms like Adam optimizer, Transformer, K-means, t-SNE, and specific tokenization/evaluation methods, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We then pre-train u-Hu BERT on the combined data for 1M updates on 64 32GB V100 GPUs using the Adam optimizer [Kingma and Ba, 2015] and a learning rate of 0.002. Gradient norm is clipped at 1.0. A batch size of maximal 40 seconds per GPU is used. When pre-trained on multimodal data, audio and video are dropped with a probability of 0.25 and 0.25, respectively. |