Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers
Authors: Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong Tang, Mingli Song6917-6924
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively. |
| Researcher Affiliation | Collaboration | Ya Zhao,1 Rui Xu,1 Xinchao Wang,2 Peng Hou,3 Haihong Tang,3 Mingli Song1 1Zhejiang University, 2Stevens Institute of Technology, 3Alibaba Group |
| Pseudocode | No | The paper describes the method using equations and textual explanations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about the availability of its source code or a link to a code repository. |
| Open Datasets | Yes | CMLR (Zhao, Xu, and Song 2019): it is currently the largest Chinese Mandarin lip reading dataset. ... LRS2 (Afouras et al. 2018): it contains more than 45,000 spoken sentences from BBC television. ... 2https://www.vipazoo.cn/CMLR.html 3http://www.robots.ox.ac.uk/ vgg/data/lip reading/lrs2.html |
| Dataset Splits | Yes | LRS2: ... The dataset has a pre-train set that contains sentences annotated with the alignment boundaries of every word. We follow the provided dataset partition in experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper discusses model architectures and components like GRU, LSTM, and CNN models, but does not provide specific version numbers for software dependencies or libraries used for implementation. |
| Experiment Setup | Yes | Implementation Details Lip Reader CMLR: The input images are 64x128 in dimension. ... We use a two-layer bidirectional GRU (Cho et al. 2014) with a cell size of 256 for the encoder and a two-layer uni-directional GRU with a cell size of 512 for the decoder. ... The initial learning rate was 0.0003 and decreased by 50% every time the training error did not improve for 4 epochs. LRS2: The input images are 112x112 pixels covering the region around the mouth. ... The encoder contains 3 layers of bi-directional LSTM (Hochreiter and Schmidhuber 1997) with a cell size of 256, and the decoder contains 3 layers of uni-directional LSTM with a cell size of 512. ... The initial learning rate was 0.0008 for pre-training, 0.0001 for training, and decreased by 50% every time the training error did not improve for 3 epochs. The balance weights used in both datasets are shown in Table 1. |