Lip to Speech Synthesis with Visual Context Attentional GAN
Authors: Minsu Kim, Joanna Hong, Yong Man Ro
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that the proposed VCA-GAN outperforms existing stateof-the-art and is able to effectively synthesize the speech from multi-speaker that has been barely handled in the previous works. |
| Researcher Affiliation | Academia | Minsu Kim, Joanna Hong, Yong Man Ro Image and Video Systems Lab KAIST {ms.k, joanna2587, ymro}@kaist.ac.kr |
| Pseudocode | No | The paper describes the architecture and equations of the proposed model but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement by the authors that they are releasing the code for the described methodology, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | The effectiveness of the proposed framework is evaluated on three public benchmark databases, GRID [7], TCD-TIMIT[8], and LRW[9] in both constrained-speaker setting and multi-speaker setting. |
| Dataset Splits | Yes | unseen-speaker setting: 15, 8, and 10 subjects are used for training, validation, and test, respectively. The dataset split from [10, 12] is used. |
| Hardware Specification | Yes | Titan-RTX is utilized for the computing. |
| Software Dependencies | No | The paper mentions software components and tools used (e.g., Adam optimizer, Griffin-Lim algorithm) but does not provide specific version numbers for any key software dependencies or libraries. |
| Experiment Setup | Yes | For the visual encoder, one 3D convolution layer and Res Net-18 [48], a popular architecture in lip reading [49], are utilized. Three generators are used (i.e., n=3) and 2 upsample layer is applied at the last two generators. Each generator is composed of 6, 3, and 3 Residual blocks, respectively. The global visual encoder is designed with 2 layer bi-GRU and one linear layer. For the audio encoder, 2 convolution layers with stride 2 and one Residual block are utilized. The postnet is composed of three 1D Residual blocks and two 1D convolution layers. Finally, the discriminators are basically composed of 2, 3, and 4 Residual blocks. Architectural details can be found in supplementary. All the audio in the dataset is resampled to 16k Hz, high-pass filtered with a 55Hz cutoff frequency, and transformed into mel-spectrogram using 80 mel-filter banks (i.e., F=80). For the dataset composed of 25 fps video (i.e., GRID and LRW), the audio is converted into mel-spectrogram by using window size of 640 and hop size of 160. For the 30 fps video (i.e., TCD-TIMIT), the window size of 532 and hop size of 133 are used. Thus, the resulting mel-spectrogram has four times the frame rate of the video. The images are cropped to the center of the lips and resized to the size of 112 112. During training, the contiguous sequence is randomly sampled with the size of 40 and 50 for GRID and TCD-TIMIT, respectively. We use Adam optimizer [50] with 0.0001 learning rate. The α, λrecon, and λsync are empirically set to 2, 50, and 0.5, respectively. The temperature parameter τ is set to 1. For the GAN loss, non-saturating adversarial loss [34] with R1 regularization [51] is used. |