Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
Authors: Minsu Kim, Jeong Hun Yeo, Yong Man Ro1174-1182
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results verify the effectiveness of the proposed method in lip reading and in distinguishing the homophenes. |
| Researcher Affiliation | Academia | Image and Video Systems Lab, KAIST, South Korea {ms.k, sedne246, ymro}@kaist.ac.kr |
| Pseudocode | No | The paper includes architectural diagrams but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We conduct the experiments on both word- and sentence-level lip reading databases. For the word-level lip reading, we use LRW (Chung and Zisserman 2016) and LRW-1000 (Yang et al. 2019) datasets, which are in English and Mandarin, respectively. For the sentence-level lip reading, we use LRS2 (Chung et al. 2017) dataset. |
| Dataset Splits | No | The paper mentions using LRW, LRW-1000, and LRS2 datasets, and states "LRW is an English word-level lip reading dataset which includes 500 words with a maximum of 1,000 training videos each." and "finally train and test on the LRS2 dataset." However, it does not explicitly provide specific training, validation, and test splits (e.g., percentages or counts) for their experiments. |
| Hardware Specification | Yes | We use four Titan RTX GPUs (24GB) and Intel Xeon Gold 6130 CPU. |
| Software Dependencies | No | The paper mentions optimizers and loss functions like "Adam W optimizer" and "hybrid CTC/Attention loss function", but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | MVM is empirically designed with 8 heads (h) and 112 slots (N). For the word-level lip reading, it is applied in 4 different levels... For the sentence-level lip reading, MVM is applied in 4 different levels... We use Adam W optimizer (Loshchilov and Hutter 2017), batch size of 200, 64, and 40 for LRW, LRW-1000, and LRS2, respectively, with initial learning rate of 0.0001, and α is set to 16. |