Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Authors: Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental Results, Table 2: The performance comparison of different methods on AVSR, ASR, and VSR tasks.
Researcher Affiliation Collaboration Qiushi Zhu1, Jie Zhang1*, Yu Gu2, Yuchen Hu3, Lirong Dai1 1NERC-SLIP, University of Science and Technology of China (USTC), Hefei, China 2Tencent AI LAB 3Nanyang Technological University, Singapore
Pseudocode No The paper describes the model structure and pre-training task with mathematical equations for loss functions, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The pre-processed data and code are available at https://github.com/zqs01/multichannel-wav2vec2.
Open Datasets Yes For the multichannel audio-visual dataset, we utilize the MISP2021-AVSR1 (Chen, Du et al. 2022; Chen, Zhou et al. 2022) dataset, which contains a total of 141.24 hours of audio and video data collected from 253 Chinese (98 males and 165 females) speaking Mandarin in 34 real home TV rooms. In addition, we also utilize the single-channel Chinese Mandarin dataset Wenet Speech2 (Zhang, Lv et al. 2022) in the pre-training phase, which contains 10000+ hours of multi-domain Chinese data.
Dataset Splits No The paper mentions evaluating on "Dev" set for AVSD in Table 4, but it does not explicitly provide specific details on the dataset validation split, such as percentages, sample counts, or explicit methodology for creating the validation set, within the text.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory amounts, or types of computing resources used for running the experiments.
Software Dependencies No The paper mentions the use of Dlib and Beamform It tools and the Adam optimizer but does not specify any software libraries or dependencies with their version numbers.
Experiment Setup Yes The embedding layer dimension of the decoder is set to 768, the layer dimension of the feedforward neural network is set to 3072, and the number of attention heads is set to 12. The Adam optimizer is still utilized and the learning rate is set to 0.001. We use the label smoothing cross-entropy loss function, where the label smoothing factor is set to 0.1. The Adam optimizer is used and the learning rate is set to 1e-4.