Zero-shot High-fidelity and Pose-controllable Character Animation
Authors: Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Guo-Jun Qi, Yu-Gang Jiang
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiment results demonstrate that our approach outperforms the state-of-the-art trainingbased methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations. |
| Researcher Affiliation | Collaboration | Bingwen Zhu1,2 , Fanyi Wang3 , Tianyi Lu1,2 , Peng Liu3 , Jingwen Su3 , Jinxiu Liu4 , Yanhao Zhang3 , Zuxuan Wu1,2 , Guo-Jun Qi5 and Yu-Gang Jiang1,2 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3OPPO AI Center 4South China University Of Technology 5Westlake University |
| Pseudocode | Yes | Algorithm 1 Pose-aware embedding optimization. Input: Source character image Is, source character pose ps, text prompt C, and target pose sequence P = {pi}N i=1, number of frames N, timestep T. Output: Optimized source embeddings { s,t}T t=1, Optimized pose-aware embeddings {{e xi,t}T t=1}N i=1 , and latent code ZT . |
| Open Source Code | No | The paper mentions leveraging "the official open source code of Disco" for comparison, but it does not provide any statement or link indicating that its own methodology's source code is publicly available. |
| Open Datasets | Yes | To further make a comprehensive quantitative performance comparison, we also follow the experimental settings in Magic Animate, and evaluate both image fidelity and video quality on two benchmark datasets, namely Tik Tok [Jafarian and Park, 2021] and TED-talks [Siarohin et al., 2021]. |
| Dataset Splits | No | The paper states: "For quantitative analysis, we first randomly sample 50 in-the-wild image-text pairs and 10 different disered pose sequences to conduct evaluations," but does not provide specific details on training, validation, or test data splits for reproducibility. |
| Hardware Specification | Yes | All experiments are performed on a single NVIDIA A100 GPU. |
| Software Dependencies | Yes | We implement Pose Animate based on the public pre-trained weights of Control Net [Zhang et al., 2023] and Stable Diffusion [Rombach et al., 2022] v1.5. |
| Experiment Setup | Yes | For each generated character animation, we generate N = 16 frames with a unified 512 512 resolution. In the experiment, we use DDIM sampler [Song et al., 2020] with the default hyperparameters: number of diffusion steps T = 50 and guidance scale w = 7.5. For the pose-aware control module, loss function of optimizing text embedding text is MSE. The optimization iterations are 250 in total with n = 5 inner iterations per step, and the optimizer is Adam. |