Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on public benchmarks LRS3 and LRS2 show that our GILA outperforms the supervised learning state-of-the-art. |
| Researcher Affiliation | Academia | 1Nanyang Technological University, Singapore 2University of Aberdeen, UK 3University of Science and Technology of China, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is at https://github.com/YUCHEN005/GILA. |
| Open Datasets | Yes | We conduct experiments on two large-scale publicly available datasets, LRS3 [Afouras et al., 2018b] and LRS2 [Chung et al., 2017]. |
| Dataset Splits | No | All hyper-parameters are tuned on validation set. The paper mentions the use of a "validation set" but does not specify the exact split percentages or sample counts within the paper's text. |
| Hardware Specification | Yes | Our training follows the finetuning configurations in [Shi et al., 2022a] and takes 1.3 days on 4 V100-32GB GPUs, which is much more efficient than AV-Hu BERT pre-training ( 15.6 days on 64 V100-GPUs). |
| Software Dependencies | No | The paper mentions several models and frameworks (e.g., Transformer, ResNet-18, wav2vec2.0) but does not provide specific version numbers for any software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | For model configurations, our baseline follows AV-Hu BERT LARGE [Shi et al., 2022a] with 24 Transformer encoder layers and 9 decoder layers. For fair comparison, we build the GILA with 3 GI model layers, 12 Transformer encoder layers and 9 decoder layers. All other model configurations are same as AV-Hu BERT LARGE. The number of parameters in our baseline and GILA are 476M and 465M respectively. We also use Conformer as our backbone, with the convolution kernel size of 31. The system inputs are log filterbank features for audio stream and lip regions-of-interest (ROIs) for video stream. To sample A-V frame pairs in CL contrastive learning, we first sample starting indexes from (X0 A, X3 V ) with probability of 0.4 and from (X3 A, X0 V ) with 0.45 respectively, and then cut out 10 consecutive frames after each sampled index. To calculate contrastive loss, we use the same VQ module in wav2vec2.0 [Baevski et al., 2020], and set the temperature parameter τ to 0.1. We further use data augmentation to improve noise robustness, where we add MUSAN noise [Snyder et al., 2015] following prior work [Shi et al., 2022b], and report WER results on both clean and noisy test sets. The weighting parameters λi WL(i {1, 2, 3})/λ0,3 CL/λ3,0 CL are set to 0.001/0.08/0.01 respectively. All hyper-parameters are tuned on validation set. Our training follows the finetuning configurations in [Shi et al., 2022a] and takes 1.3 days on 4 V100-32GB GPUs, which is much more efficient than AV-Hu BERT pre-training ( 15.6 days on 64 V100-GPUs). The details of data augmentation, model and training configurations follow previous work [Shi et al., 2022b]. |