Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

Authors: Minsu Kim, Jeong Hun Yeo, Yong Man Ro1174-1182

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results verify the effectiveness of the proposed method in lip reading and in distinguishing the homophenes.
Researcher Affiliation Academia Image and Video Systems Lab, KAIST, South Korea {ms.k, sedne246, ymro}@kaist.ac.kr
Pseudocode No The paper includes architectural diagrams but does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We conduct the experiments on both word- and sentence-level lip reading databases. For the word-level lip reading, we use LRW (Chung and Zisserman 2016) and LRW-1000 (Yang et al. 2019) datasets, which are in English and Mandarin, respectively. For the sentence-level lip reading, we use LRS2 (Chung et al. 2017) dataset.
Dataset Splits No The paper mentions using LRW, LRW-1000, and LRS2 datasets, and states "LRW is an English word-level lip reading dataset which includes 500 words with a maximum of 1,000 training videos each." and "finally train and test on the LRS2 dataset." However, it does not explicitly provide specific training, validation, and test splits (e.g., percentages or counts) for their experiments.
Hardware Specification Yes We use four Titan RTX GPUs (24GB) and Intel Xeon Gold 6130 CPU.
Software Dependencies No The paper mentions optimizers and loss functions like "Adam W optimizer" and "hybrid CTC/Attention loss function", but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes MVM is empirically designed with 8 heads (h) and 112 slots (N). For the word-level lip reading, it is applied in 4 different levels... For the sentence-level lip reading, MVM is applied in 4 different levels... We use Adam W optimizer (Loshchilov and Hutter 2017), batch size of 200, 64, and 40 for LRW, LRW-1000, and LRS2, respectively, with initial learning rate of 0.0001, and α is set to 16.