Lipper: Synthesizing Thy Speech Using Multi-View Lipreading

Authors: Yaman Kumar, Rohit Jain, Khwaja Mohd. Salik, Rajiv Ratn Shah, Yifang Yin, Roger Zimmermann2588-2595

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show this by presenting an exhaustive set of experiments for speaker-dependent, out-of-vocabulary and speaker-independent settings. Further, we compare the delay values of Lipper with other speechreading systems in order to show the real-time nature of audio produced. We also perform a user study for the audios produced in order to understand the level of comprehensibility of audios produced using Lipper.
Researcher Affiliation Collaboration Yaman Kumar Adobe ykumar@adobe.com Rohit Jain MIDAS Lab, NSIT-Delhi rohitj.co@nsit.net.in Khwaja Mohd. Salik MIDAS Lab, NSIT-Delhi khwajam.co@nsit.net.in Rajiv Ratn Shah MIDAS Lab, IIIT-Delhi rajivratn@iiitd.ac.in Yifang Yin NUS, Singapore yifang@comp.nus.edu.sg Roger Zimmermann NUS, Singapore rogerz@comp.nus.edu.sg
Pseudocode No The paper describes the system architecture and models (e.g., Figures 1, 2, 3) but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of the code for the methodology described.
Open Datasets Yes For training and testing Lipper, we use all the speakers of Oulu VS2 database (Anina et al. 2015) for speech-reconstruction purposes. Oulu VS2 is a multi-view audiovisual dataset with 53 speakers of various ethnicities like European, Indian, Chinese and American.
Dataset Splits Yes For making a text predicting model, we tried two different train-test data configurations: 1. In the first configuration, we randomly divided all the encoded audios of all the speakers into train, test and validation data with the ratio as (70, 10 and 20) respectively.
Hardware Specification Yes MIDAS lab gratefully acknowledges the support of NVIDIA Corporation with the donation of a Titan Xp GPU used for this research.
Software Dependencies No The paper describes the models and optimization techniques used (e.g., VGG-16, STCNN+Bi GRU, Adam optimization, cross-entropy loss) and signal processing methods (LPC, LSPs) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementation.
Experiment Setup Yes For training the model, we used lip-region images of size 224x224. While training, we use the batch size as 100 and then we train the system for 30 epochs with Adam optimization. We use 60 epochs for training and 20 epochs for finetuning the network. The network was trained with batch size as ten and number of epochs as twenty.