reproducibilityindex.ai

Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on public benchmarks LRS3 and LRS2 show that our GILA outperforms the supervised learning state-of-the-art.
Researcher Affiliation	Academia	1Nanyang Technological University, Singapore 2University of Aberdeen, UK 3University of Science and Technology of China, China
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is at https://github.com/YUCHEN005/GILA.
Open Datasets	Yes	We conduct experiments on two large-scale publicly available datasets, LRS3 [Afouras et al., 2018b] and LRS2 [Chung et al., 2017].
Dataset Splits	No	All hyper-parameters are tuned on validation set. The paper mentions the use of a "validation set" but does not specify the exact split percentages or sample counts within the paper's text.
Hardware Specification	Yes	Our training follows the finetuning configurations in [Shi et al., 2022a] and takes 1.3 days on 4 V100-32GB GPUs, which is much more efficient than AV-Hu BERT pre-training ( 15.6 days on 64 V100-GPUs).
Software Dependencies	No	The paper mentions several models and frameworks (e.g., Transformer, ResNet-18, wav2vec2.0) but does not provide specific version numbers for any software dependencies like programming languages or libraries.
Experiment Setup	Yes	For model configurations, our baseline follows AV-Hu BERT LARGE [Shi et al., 2022a] with 24 Transformer encoder layers and 9 decoder layers. For fair comparison, we build the GILA with 3 GI model layers, 12 Transformer encoder layers and 9 decoder layers. All other model configurations are same as AV-Hu BERT LARGE. The number of parameters in our baseline and GILA are 476M and 465M respectively. We also use Conformer as our backbone, with the convolution kernel size of 31. The system inputs are log filterbank features for audio stream and lip regions-of-interest (ROIs) for video stream. To sample A-V frame pairs in CL contrastive learning, we first sample starting indexes from (X0 A, X3 V ) with probability of 0.4 and from (X3 A, X0 V ) with 0.45 respectively, and then cut out 10 consecutive frames after each sampled index. To calculate contrastive loss, we use the same VQ module in wav2vec2.0 [Baevski et al., 2020], and set the temperature parameter τ to 0.1. We further use data augmentation to improve noise robustness, where we add MUSAN noise [Snyder et al., 2015] following prior work [Shi et al., 2022b], and report WER results on both clean and noisy test sets. The weighting parameters λi WL(i {1, 2, 3})/λ0,3 CL/λ3,0 CL are set to 0.001/0.08/0.01 respectively. All hyper-parameters are tuned on validation set. Our training follows the finetuning configurations in [Shi et al., 2022a] and takes 1.3 days on 4 V100-32GB GPUs, which is much more efficient than AV-Hu BERT pre-training ( 15.6 days on 64 V100-GPUs). The details of data augmentation, model and training configurations follow previous work [Shi et al., 2022b].