Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Lipper: Synthesizing Thy Speech Using Multi-View Lipreading
Authors: Yaman Kumar, Rohit Jain, Khwaja Mohd. Salik, Rajiv Ratn Shah, Yifang Yin, Roger Zimmermann2588-2595
AAAI 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show this by presenting an exhaustive set of experiments for speaker-dependent, out-of-vocabulary and speaker-independent settings. Further, we compare the delay values of Lipper with other speechreading systems in order to show the real-time nature of audio produced. We also perform a user study for the audios produced in order to understand the level of comprehensibility of audios produced using Lipper. |
| Researcher Affiliation | Collaboration | Yaman Kumar Adobe EMAIL Rohit Jain MIDAS Lab, NSIT-Delhi EMAIL Khwaja Mohd. Salik MIDAS Lab, NSIT-Delhi EMAIL Rajiv Ratn Shah MIDAS Lab, IIIT-Delhi EMAIL Yifang Yin NUS, Singapore EMAIL Roger Zimmermann NUS, Singapore EMAIL |
| Pseudocode | No | The paper describes the system architecture and models (e.g., Figures 1, 2, 3) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of the code for the methodology described. |
| Open Datasets | Yes | For training and testing Lipper, we use all the speakers of Oulu VS2 database (Anina et al. 2015) for speech-reconstruction purposes. Oulu VS2 is a multi-view audiovisual dataset with 53 speakers of various ethnicities like European, Indian, Chinese and American. |
| Dataset Splits | Yes | For making a text predicting model, we tried two different train-test data configurations: 1. In the first configuration, we randomly divided all the encoded audios of all the speakers into train, test and validation data with the ratio as (70, 10 and 20) respectively. |
| Hardware Specification | Yes | MIDAS lab gratefully acknowledges the support of NVIDIA Corporation with the donation of a Titan Xp GPU used for this research. |
| Software Dependencies | No | The paper describes the models and optimization techniques used (e.g., VGG-16, STCNN+Bi GRU, Adam optimization, cross-entropy loss) and signal processing methods (LPC, LSPs) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementation. |
| Experiment Setup | Yes | For training the model, we used lip-region images of size 224x224. While training, we use the batch size as 100 and then we train the system for 30 epochs with Adam optimization. We use 60 epochs for training and 20 epochs for finetuning the network. The network was trained with batch size as ten and number of epochs as twenty. |