Visual Hallucination Elevates Speech Recognition

Authors: Fang Zhang, Yongxin Zhu, Xiangxiang Wang, Huang Chen, Xing Sun, Linli Xu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5% 12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio Visual (AV) setting even without video as input.
Researcher Affiliation Collaboration Fang Zhang1,2, Yongxin Zhu1,2, Xiangxiang Wang3, Huang Chen3, Xing Sun3, Linli Xu1,2 1School of Computer Science and Technology, University of Science and Technology of China 2State Key Laboratory of Cognitive Intelligence 3Tencent You Tu Lab
Pseudocode No The paper describes the model and its training process in detail and includes architectural diagrams (Figure 2, Figure 3), but it does not provide any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository for the methodology described.
Open Datasets Yes Our methodology is assessed utilizing two comprehensive, publicly accessible audio-visual datasets, namely LRS2 (Son Chung et al. 2017) and LRS3 (Afouras, Chung, and Zisserman 2018).
Dataset Splits Yes The LRS3 dataset (Afouras, Chung, and Zisserman 2018), extracted from TED and TEDx presentations, includes 118, 516 utterances in the pre-training set (408 hours), 31, 982 in the training-validation set (30 hours), and 1, 321 in the test set (0.9 hours).
Hardware Specification No The paper does not explicitly state the specific hardware used for running its experiments (e.g., specific GPU or CPU models, memory amounts).
Software Dependencies No The paper mentions various models and optimizers (e.g., Adam, Wav2Vec, Moco V2) but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks.
Experiment Setup Yes For our DFVGM, the model consists of 6 encoder layers and 6 decoder layers, where the number of attention heads is 4, the hidden dimension dmodel is 256, and the feed-forward dimension is 1024. We set Ka = 800, Kv = 1000 for the codebooks. We train our DFVGM by an Adam (Kingma and Ba 2014) optimizer with hyperparameters β1 = 0.9, β2 = 0.98. The dropout is set to 0.2 and label smoothing weight is set to 0.1. The hyperparameter R is set to 2. The architecture of our audio encoder and visual encoder are inherited from the previous work (Pan et al. 2022). Our fusion module is simply concatenating on the feature dimension. Our decoder is composed of a Transformer seq2seq decoder and a CTC decoder. An additional pretrained language model is used during training and inference. For the final training, the relative weight µ is set to 0.5 tuned on the validation set. The fusion module and decoder are trained for 50 epochs, with an initial learning rate of 1e-5. For the loss functions Lgt and Lpseudo, the relative CTC weight is 0.1. During inference, we use beam search with a beam size of 10.