Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation

Authors: Chih-Chun Yang, Wan-Cyuan Fan, Cheng-Fu Yang, Yu-Chiang Frank Wang3036-3044

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments on different recognition and synthesis tasks to show that our model performs favorably against state-of-the-art approaches on each individual task, while ours is a unified solution that is able to jointly tackle the aforementioned audio-visual learning tasks.
Researcher Affiliation Collaboration 1Department of Computer Science and Information Engineering, National Taiwan University, Taiwan, R.O.C. 2Graduate Institute of Communication Engineering, National Taiwan University, Taiwan, R.O.C. 3ASUS Intelligent Cloud Services, Taiwan, R.O.C.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an unambiguous statement or link indicating that the source code for the methodology is openly available.
Open Datasets Yes LRW (Chung and Zisserman 2016): Known for its variety of speaking styles and head poses across subjects, LRW is an English-speaking video dataset collected from BBC programs with more than 1000 speakers. ... LRW-1000 (Yang et al. 2019): LRW-1000 is a Mandarin-speaking video dataset collected from more than 2,000 subjects with 1,000 vocabulary size.
Dataset Splits No The paper uses standard datasets (LRW and LRW-1000) but does not explicitly provide specific train/validation/test split percentages, sample counts, or detailed splitting methodology needed for reproduction. It mentions 'validation' in the context of a loss function, not dataset splitting.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments. It only states, 'We implement our model using Pytorch.'
Software Dependencies No The paper mentions 'Pytorch' and 'Adam W' but does not provide specific version numbers for these or any other software dependencies, making the setup irreproducible in terms of software.
Experiment Setup Yes Adam W is used as the optimizer for training with weight decay 5 10 4 as regularization. For the linguistic and synthesis modules, initial learning rates of 3 10 4 and 1 10 4 with a schedule of reduction are applied, respectively.