Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
Authors: Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental Results, Table 2: The performance comparison of different methods on AVSR, ASR, and VSR tasks. |
| Researcher Affiliation | Collaboration | Qiushi Zhu1, Jie Zhang1*, Yu Gu2, Yuchen Hu3, Lirong Dai1 1NERC-SLIP, University of Science and Technology of China (USTC), Hefei, China 2Tencent AI LAB 3Nanyang Technological University, Singapore |
| Pseudocode | No | The paper describes the model structure and pre-training task with mathematical equations for loss functions, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The pre-processed data and code are available at https://github.com/zqs01/multichannel-wav2vec2. |
| Open Datasets | Yes | For the multichannel audio-visual dataset, we utilize the MISP2021-AVSR1 (Chen, Du et al. 2022; Chen, Zhou et al. 2022) dataset, which contains a total of 141.24 hours of audio and video data collected from 253 Chinese (98 males and 165 females) speaking Mandarin in 34 real home TV rooms. In addition, we also utilize the single-channel Chinese Mandarin dataset Wenet Speech2 (Zhang, Lv et al. 2022) in the pre-training phase, which contains 10000+ hours of multi-domain Chinese data. |
| Dataset Splits | No | The paper mentions evaluating on "Dev" set for AVSD in Table 4, but it does not explicitly provide specific details on the dataset validation split, such as percentages, sample counts, or explicit methodology for creating the validation set, within the text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory amounts, or types of computing resources used for running the experiments. |
| Software Dependencies | No | The paper mentions the use of Dlib and Beamform It tools and the Adam optimizer but does not specify any software libraries or dependencies with their version numbers. |
| Experiment Setup | Yes | The embedding layer dimension of the decoder is set to 768, the layer dimension of the feedforward neural network is set to 3072, and the number of attention heads is set to 12. The Adam optimizer is still utilized and the learning rate is set to 0.001. We use the label smoothing cross-entropy loss function, where the label smoothing factor is set to 0.1. The Adam optimizer is used and the learning rate is set to 1e-4. |