Flow-Based Unconstrained Lip to Speech Generation
Authors: Jinzheng He, Zhou Zhao, Yi Ren, Jinglin Liu, Baoxing Huai, Nicholas Yuan843-851
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the superiority of our proposed method through objective and subjective evaluation on Lip2Wav-Chemistry Lectures and Lip2Wav-Chess-Analysis datasets. |
| Researcher Affiliation | Collaboration | 1Zhejiang University, China 2Huawei Cloud |
| Pseudocode | No | The paper describes the model architecture and components but does not provide pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to a demo video at https://glowlts.github.io/, but no explicit statement or link for open-source code for the methodology. |
| Open Datasets | Yes | In this paper, we focus on more challenging unconstrained, real-world settings and conduct experiments on Lip2Wav-Chemistry-Lectures and Lip2Wav-Chess-Analysis datasets proposed in Prajwal et al. (2020), which are the currently largest datasets for unconstrained settings. |
| Dataset Splits | No | The paper mentions using Lip2Wav-Chemistry-Lectures and Lip2Wav-Chess-Analysis datasets but does not explicitly provide training, validation, and test split percentages or sample counts. |
| Hardware Specification | Yes | All measurements are conducted with 1 NVIDIA 2080Ti GPU. |
| Software Dependencies | No | The paper states "Our implementation is based on Py Torch" but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We use 4 feed-forward Transformer blocks with 2 attention heads and a dropout of 0.1 in our condition module. For our flow-based decoder, we use 12 flow blocks in the training and inference process. Each flow block includes 1 actnorm layer, 1 invertible 1x1 conv layer, and 4 affine coupling layers. We optimize our model using Adam (Kingma and Ba 2014) optimizer with an initial learning rate of 2 * 10^-4 and weight decay of 1 * 10^-6 in both stages. It takes about 200k steps for the first stage of training and about 100k steps for the second stage. |