Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Authors: Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong Tang, Mingli Song6917-6924

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively.
Researcher Affiliation Collaboration Ya Zhao,1 Rui Xu,1 Xinchao Wang,2 Peng Hou,3 Haihong Tang,3 Mingli Song1 1Zhejiang University, 2Stevens Institute of Technology, 3Alibaba Group
Pseudocode No The paper describes the method using equations and textual explanations but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the availability of its source code or a link to a code repository.
Open Datasets Yes CMLR (Zhao, Xu, and Song 2019): it is currently the largest Chinese Mandarin lip reading dataset. ... LRS2 (Afouras et al. 2018): it contains more than 45,000 spoken sentences from BBC television. ... 2https://www.vipazoo.cn/CMLR.html 3http://www.robots.ox.ac.uk/ vgg/data/lip reading/lrs2.html
Dataset Splits Yes LRS2: ... The dataset has a pre-train set that contains sentences annotated with the alignment boundaries of every word. We follow the provided dataset partition in experiments.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper discusses model architectures and components like GRU, LSTM, and CNN models, but does not provide specific version numbers for software dependencies or libraries used for implementation.
Experiment Setup Yes Implementation Details Lip Reader CMLR: The input images are 64x128 in dimension. ... We use a two-layer bidirectional GRU (Cho et al. 2014) with a cell size of 256 for the encoder and a two-layer uni-directional GRU with a cell size of 512 for the decoder. ... The initial learning rate was 0.0003 and decreased by 50% every time the training error did not improve for 4 epochs. LRS2: The input images are 112x112 pixels covering the region around the mouth. ... The encoder contains 3 layers of bi-directional LSTM (Hochreiter and Schmidhuber 1997) with a cell size of 256, and the decoder contains 3 layers of uni-directional LSTM with a cell size of 512. ... The initial learning rate was 0.0008 for pre-training, 0.0001 for training, and decreased by 50% every time the training error did not improve for 3 epochs. The balance weights used in both datasets are shown in Table 1.