Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos
Authors: Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, Qifeng Chen
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments compared with various baselines demonstrate the superiority of our approach, in terms of generation quality, text-video alignment, pose-video alignment, and temporal coherence. Implementation Details We implement our method based on the official codebase of Stable Diffusion (Rombach et al. 2022)1 and the publicly available 1.4 billion parameter T2I model2 and several Lo RA models from Civit AI3. We freeze the image autoencoder to encode each video frame to latent representation individually. We first train our model for 100k steps on the Laion-Pose. Then, we train for 50k steps on the HDVILA(Xue et al. 2022). The eight consecutive frames at the resolution of 512 512 from the input video are sampled for temporal consistency learning. The training process is performed on 8 NVIDIA Tesla 40G-A100 GPUs and can be completed within two days. At inference, we apply DDIM sampler (Song, Meng, and Ermon 2020) with classifier-free guidance (Ho and Salimans 2022) in our experiments for Pose-guided T2V generation. Quantitative results. 1) CLIP score: We follow (Ho et al. 2022a) to evaluate our approach on CLIP score (Nguyen et al. 2021; Park et al. 2021) or video-text alignment. We compute CLIP score for each frame and then average them across all frames. The final CLIP score is calculated on 1024 video samples. The results are reported in Tab. 1. Our approach produces a higher CLIP score, demonstrating better video-text alignment than the other two approaches. 2) Quality: Following Make-A-Video (Singer et al. 2022), We conduct the human evaluation of videos quality across a test set containing 32 videos. In detail, we display three videos in random order and request the evaluators to identify the one with superior quality. Our observations indicate that the raters favor the videos generated by our approach over Tune A-Video and Control Net in video quality. 3) Pose Accuracy: we regard the input pose sentence as ground truth video and evaluate the average precision of the skeleton on 1024 video samples. For a fair comparison, we adopt the same pose detector (Sun et al. 2019) for both the processing of LAIONPose collecting and evaluation. The results show that our model achieves comparable performance with Control Net. 4) Frame Consistency: Following (Esser et al. 2023), we report frame consistency measured via CLIP cosine similarity of consecutive frames, and the results are shown in Tab. 1. Our model outperforms the Control Net regarding temporal consistency, which demonstrates the necessity of our temporal designs. In addition, we obtain a comparable score with Tune-A-Video. However, Tune-A-Video relies on an input video and needs to overfit the input video, which is hard to sever as a general video generation model. Qualitative results. We first compare our approach with other four methods with the same pose sequences and text prompt in Fig. 5. Our approach obtains the better performance in consistency and artistry. In similar setting, We also compare our approach and Control Net and T2I-adapter in Fig. 7. It is apparent to discover that there is an inconsistent background in Control Net (e.g., the color of the street and the shirt of the boy). This phenomenon is common and can also be encountered in the 3rd row of Fig. 7 (the wall color). In contrast, our approach could effectively address the issue of temporal consistency and learn good inter-frame coherence. Ablation Study Effect of residual pose encoder. We ablate the feature residual design of the proposed pose encoder. To add controlling information to the diffusion model, one natural way is to directly concatenate the condition onto the input noisy latent of the model (Rombach et al. 2022). However, we discover that the concatenation approach yields worse performance than the residual approach. We compare the generation results of residual and concatenation approaches for injecting the pose information in the first row and the second row of Fig. 8. We discover that the feature residual manner for adding extra controls can preserve more of the generation ability of pre-trained stable diffusion. This is because the concatenation mechanism needs to retrain the first convolutional layer to meet the number of input channels which sacrifices the pretrained high-quality image synthesis prior. Number of layers of condition control. We also ablate the number of layers injecting the controlling signals in order to achieve better controllability (see comparisons between the second row and the third row in Fig. 8). Our results indicate that adding controls to more layers can lead to improved pose-frame alignment. Specifically, adding the pose into one single layer results in the mismatch between the target arm and the generated arm of Iron man. |
| Researcher Affiliation | Collaboration | Yue Ma1* , Yingqing He2*, Xiaodong Cun3, Xintao Wang3, Siran Chen4, Xiu Li1 , Qifeng Chen2 1 Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China 2 The Hong Kong University of Science and Technology, Hong Kong 3 Tencent AI Lab, Shenzhen, China 4 Shenzhen Institute of Advanced Technology, Chinese Academy of Science, Shenzhen, China {y-ma21, lixiu}@mails.tsinghua.edu.cn, yhebm@connect.ust.hk, {vinthony,xintao.alpha}@gmail.com Chensiran17@mails.ucas.ac.cn, cqf@ust.hk *Equal contributions. Work done during an internship at Tencent AI Lab. |
| Pseudocode | No | The paper describes the method's stages and components in text and diagrams (e.g., Figure 3), but it does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models are available on https://followyour-pose.github.io/. |
| Open Datasets | Yes | Training Stage 1: Pose-Controllable Text-to-Image Generation. In this stage, we train the pose-controllable text-to-image models. However, the current method does not have pose-image-caption pair for pose generation. We, therefore, collect the human skeleton images in the LAION (Schuhmann et al. 2021) by MMpose (Contributors 2020), only retaining images that could be detected more than 50% of the key points. Finally, an image-text-pose dataset named LAION-Pose is formed. This dataset contains diverse human-like characters with various background contexts. Training Stage 2: Video Generation via Pose-free Videos. However, the stage 1 model can generate similar pose videos yet the background is inconsistent. Thus, we further finetune the model from our first stage on the pose-free video dataset HDVLIA (Xue et al. 2022). This dataset contains continuous in-the-wild video text pairs. |
| Dataset Splits | No | The paper states training steps for LAION-Pose and HDVILA, and evaluation metrics on a certain number of video samples for testing, but it does not provide explicit training, validation, and test splits (e.g., percentages or counts for each subset) for reproducibility. |
| Hardware Specification | Yes | The training process is performed on 8 NVIDIA Tesla 40G-A100 GPUs and can be completed within two days. |
| Software Dependencies | No | The paper mentions implementing the method based on 'the official codebase of Stable Diffusion (Rombach et al. 2022)1' and 'the publicly available 1.4 billion parameter T2I model2', but it does not specify exact version numbers for software components like Stable Diffusion, PyTorch, Python, or CUDA. |
| Experiment Setup | Yes | We first train our model for 100k steps on the Laion-Pose. Then, we train for 50k steps on the HDVILA(Xue et al. 2022). The eight consecutive frames at the resolution of 512 512 from the input video are sampled for temporal consistency learning. At inference, we apply DDIM sampler (Song, Meng, and Ermon 2020) with classifier-free guidance (Ho and Salimans 2022) in our experiments for Pose-guided T2V generation. |