Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation
Authors: Tserendorj Adiya, Jae Shin Yoon, JUNGEUN LEE, Sanghun Kim, Hwasup Lim
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence. We validate our bidirectional temporal diffusion model on two tasks: |
| Researcher Affiliation | Collaboration | Tserendorj Adiya3, Jae Shin Yoon2 Jungeun Lee1 Sanghun Kim1 Hwasup Lim1 1Korea Institute of Science and Technology 2Adobe 3AI Center, CJ Corporation |
| Pseudocode | Yes | Algorithm 1 Bidirectional Recursive Sampling Input: Initial noisy inputs Y k = {yk 1, ..., yk t }, driven pose sequence S = {s1, ..., s T }. Output: Denoised animation Y 0 = {y0 1, ..., y0 t } for k = K 1 to 0 step 1 do if K k is odd then Direction: Forward for t = 1 to T do yk 1 t = fθ(yk t , yk t 1, λ(k), st, df) end for else Direction: Backward for t = T to 1 step 1 do yk 1 t 1 = fθ(yk t 1, yk t , λ(k), st 1, db) end for end if end for |
| Open Source Code | No | The paper does not provide a direct link to a code repository or an explicit statement about the release of its source code. |
| Open Datasets | Yes | UBC Fashion dataset Zablotskaia et al. (2019): it consists of 500 training and 100 testing videos of individuals wearing various outfits and rotating 360 degrees. Each video lasts approximately 12 seconds at 30 FPS. |
| Dataset Splits | Yes | The dataset includes a total of 80 training videos and 19 testing videos, each of which lasts 32 seconds at 30 FPS. |
| Hardware Specification | Yes | training the BTU-Net and SR3 models using the UBC fashion dataset requires 15 and 30 epochs, respectively, on a setup of four A100 GPUs, completed within 67 hours. |
| Software Dependencies | No | The paper mentions software tools like Character Creator 4, iClone8, Mixamo motion data, and Nvidia Omniverse, as well as SR3 (with a citation). However, it does not provide specific version numbers for these or other programming libraries/frameworks (e.g., Python, PyTorch, CUDA) required for reproducibility. |
| Experiment Setup | Yes | Our method is trained at a resolution of 256x256... for 50k and 100k iterations with a batch size of 32, respectively. We set the denoising step to K = 1000 and the learning rate to 1e-5. During testing, we fine-tune model with test appearance condition for 300 iterations with a learning rate of 1e-5. It should be noted that we employ K = 50 at test time for expedited generation. |