Zero-shot High-fidelity and Pose-controllable Character Animation

Authors: Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Guo-Jun Qi, Yu-Gang Jiang

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiment results demonstrate that our approach outperforms the state-of-the-art trainingbased methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.
Researcher Affiliation Collaboration Bingwen Zhu1,2 , Fanyi Wang3 , Tianyi Lu1,2 , Peng Liu3 , Jingwen Su3 , Jinxiu Liu4 , Yanhao Zhang3 , Zuxuan Wu1,2 , Guo-Jun Qi5 and Yu-Gang Jiang1,2 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3OPPO AI Center 4South China University Of Technology 5Westlake University
Pseudocode Yes Algorithm 1 Pose-aware embedding optimization. Input: Source character image Is, source character pose ps, text prompt C, and target pose sequence P = {pi}N i=1, number of frames N, timestep T. Output: Optimized source embeddings { s,t}T t=1, Optimized pose-aware embeddings {{e xi,t}T t=1}N i=1 , and latent code ZT .
Open Source Code No The paper mentions leveraging "the official open source code of Disco" for comparison, but it does not provide any statement or link indicating that its own methodology's source code is publicly available.
Open Datasets Yes To further make a comprehensive quantitative performance comparison, we also follow the experimental settings in Magic Animate, and evaluate both image fidelity and video quality on two benchmark datasets, namely Tik Tok [Jafarian and Park, 2021] and TED-talks [Siarohin et al., 2021].
Dataset Splits No The paper states: "For quantitative analysis, we first randomly sample 50 in-the-wild image-text pairs and 10 different disered pose sequences to conduct evaluations," but does not provide specific details on training, validation, or test data splits for reproducibility.
Hardware Specification Yes All experiments are performed on a single NVIDIA A100 GPU.
Software Dependencies Yes We implement Pose Animate based on the public pre-trained weights of Control Net [Zhang et al., 2023] and Stable Diffusion [Rombach et al., 2022] v1.5.
Experiment Setup Yes For each generated character animation, we generate N = 16 frames with a unified 512 512 resolution. In the experiment, we use DDIM sampler [Song et al., 2020] with the default hyperparameters: number of diffusion steps T = 50 and guidance scale w = 7.5. For the pose-aware control module, loss function of optimizing text embedding text is MSE. The optimization iterations are 250 in total with n = 5 inner iterations per step, and the optimizer is Adam.