GAIA: Zero-shot Talking Avatar Generation
Authors: Tianyu He, Junliang Guo, Runyi Yu, Yuchi Wang, jialiang zhu, Kaikai An, Leyi Li, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, sheng zhao, Jiang Bian
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS Benefitting from the disentanglement between motion and appearance, GAIA enables two common scenarios: the video-driven generation which aims to generate results with the appearance from a reference image and the motion from a driving video, and the speech-driven generation where the motion is predicted from a speech clip. The video-driven generation evaluates the VAE, while the speech-driven one evaluates the whole GAIA system. We compare GAIA with state-of-the-art methods for the two scenarios in Sec. 5.2, and further make detailed analyses in Sec. 5.3 to understand the model better. |
| Researcher Affiliation | Industry | Microsoft {tianyuhe,junliangguo,v-runyiyu,v-yuchiwang,xuta}@microsoft.com |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper provides a URL (https://microsoft.github.io/GAIA) which is a project page, not a direct link to a source-code repository. It does not explicitly state that source code is available at this link or elsewhere. |
| Open Datasets | Yes | For high-quality public datasets, we collect High Definition Talking Face Dataset (HDTF) (Zhang et al., 2021) and Casual Conversation datasets v1&v2 (CC v1&v2) (Hazirbas et al., 2021; Porgali et al., 2023) |
| Dataset Splits | Yes | We train our model on the union of the datasets described in Sec. 3, and we randomly sample 100 videos from them as the validation set. |
| Hardware Specification | Yes | For both the VAE and the diffusion model, we adopt Adam (Kingma & Ba, 2015) optimizer and train our models on 16 V100 GPUs. |
| Software Dependencies | No | The paper mentions several tools and models (e.g., 'wav2vec 2.0', 'Adam optimizer', 'Conformer', 'CLIP', '3DDFA', 'dlib') but does not provide specific version numbers for any software dependencies or libraries needed for reproducibility. |
| Experiment Setup | Yes | The learning rate is set to 4.5 e 6 and keeps constant during training. [...] The learning rate starts from 1.0 e 4 and follows the inverse square root schedule. [...] We use the resolution of 256 256 for all the settings. |