Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Authors: Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guangcong Zheng, Zhongang Qi, Ying Shan, Xi Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments Experimental Setup Datasets and Protocols For subject customization, we select subjects from image customization papers (Ruiz et al. 2023; Kumari et al. 2023) for a total of 20 subjects. For each subject, we use Chat GPT to generate 10 related prompts, which are used to test the generation of specified motion videos for the subject. All experiments use Video Crafter2 as the base model. When learning the subject, we use the Adam W optimizer, set the learning rate to 3 10 5 and the weight decay to 1 10 2. We perform 10,000 iterations on 4 NVIDIA A100 GPUs. For the Class-specific Prior Preservation Loss, similar to (Ruiz et al. 2023; Kumari et al. 2023),we collected 200 images from LAION400M (Schuhmann et al. 2021) for each subject as regularization data and set α to 1.0. During the inference process, we use DDIM (Song, Meng, and Ermon 2020) for 50-step sampling and classifier-free guidance with a cfg of 12.0 to generate videos with a resolution of 512 320. For all subjects, to facilitate experimentation and comparison, we uniformly set λs and λl to 0.4 and 0.8 respectively, and set K to 5 based on our observation. In actual use, these parameters can be adjusted by the user. Baselines Given that different base model are chosen in the current field of video customization, we reproduce Custom Diffusion (Kumari et al. 2023) and Dream Video (Wei et al. 2024) based on Video Crafter2. Since our methods do not introduce additional videos as guidance, to ensure fairness, we only reproduce the subject learning part of Dream Video for fair comparison. In addition, considering that VDMs need more steps to learn the appearance of the subject, and the default settings of Custom Diffusion and Dream Video cannot fit the subject appearance features well, we accordingly extend the training steps of these methods. Evaluation Metrics Follow (Wei et al. 2024; Wang et al. 2024b), we evaluate our approach with the following four metrics: (1) CLIP-T calculates the average cosine similarity between CLIP (Radford et al. 2021) image embeddings of all generated frames and their text embedding. (2) CLIP-I measures the visual similarity between the generated and target subjects. We computed the average cosine similarity between the CLIP image embeddings of all generated frames and the target images. (3) DINO-I (Ruiz et al. 2023), another metric to measure visual similarity using Vi TS/16 DINO (Zhang et al. 2022). Compared to CLIP, the self-supervised training model encourages the distinguishing features of individual subjects. (4) Temporal Consistency (Esser et al. 2023), we compute CLIP image embeddings on all generated frames and report the average cosine similarity between all pairs of consecutive frames. Quantitative Results We trained 20 subjects using Custom Diffusion, Dream Video and our method, respectively. After training, we used each method to generate videos for each subject using 10 prompts, employing the same random seed and denoising steps. The results, as shown in Table 1, indicate that our method outperforms existing methods in all four metrics. The degree of text alignment and subject fidelity have been significantly improved. The temporal consistency of the generated videos is roughly equivalent to that of other methods. The metrics used to evaluate the subject fidelity, CLIP-I and DINO-I, have improved by 1.7% and 4.4%, respectively, compared to existing methods. The degree of text alignment has improved by 1.5% compared to the previous best result. Qualitative Results We also visualized some results for qualitative analysis. We used the prompt of dynamic videos to generate videos of specified subjects, observing the subject fidelity in the generated videos and the motion fluency. As shown in Figure 5(a), when we want to generate a video of a specified plush toy sitting on a child s bed and the camera slowly pans to the right, we find that existing methods overfit reference image during training. Without guidance from additional videos, the generated motions are almost static. However, our method can generate videos with fluent motions and right concept combination. Besides, in Figure 5(b), only our method correctly generates the conceptual combination of the cat and the cardboard box and the motion of looking around with high subject fidelity. Furthermore, in Figure 5(c), when we want to generate a video of a musician playing a given guitar, Figure 6: User Study. Our Custom Crafter achieves the best human preference compared with other comparison methods we find that existing methods greatly damage the model s ability to combine concepts. They cannot generate a musician playing the guitar, and the motion is frozen . Similarly, in Figure 5(d), when we want to generate a video of a child handing out the dice toy, a similar situation occurs. Our method successfully generated the combination of concept of a child and a toy dice, and has smooth motions. Therefore, without guidance from additional videos, our method significantly outperforms existing methods in terms of concept combination ability and motion naturalness, and has better subject fidelity. Please refer to the supplementary material for more visualizations and demonstration videos. To further validate the effectiveness of our method, we conducted a human evaluation of our method and existing methods without using additional video data as guidance. We invited 20 professionals to evaluate the 30 sets of generated video results. For each group, we provided subject images and videos generated using the same seed and the same text prompt under different methods for comparison. We evaluated the quality of the generated videos in four dimensions: Text Alignment, Subject Fidelity, Motion Fluency, and Overall Quality. Text Alignment evaluates whether the generated video matches the text prompt. Subject Fidelity measures whether the generated object is close to the reference image. Motion Fluency is used to evaluate the quality of the motions in the generated video. Overall Quality is used to measure whether the quality of the generated video overall meets user expectations. As shown in Figure 6, our method has gained significantly more user preference in all metrics, proving the effectiveness of our method. Ablation Study In this section, we construct ablation studies to validate the effectiveness of each component. As shown in Table 2, we choose Custom Diffusion as the baseline to present the quantitative results of our designed Spatial Subject Learning Module and Dynamic Weighted Video Sampling Strategy. It can be observed that using our Spatial Subject Learning <new1> plush toy is eating bamboo. (b) W/O DWVVS (c) Custom Crafter (a) W/O Upate SA and DWVVS Figure 7: Effect of each design of our method. It can be seen that each of our designs has achieved the expected effect. SSLM DWVSS CLIP-T CLIP-I DINO-I T.Cons. 0.286 0.769 0.583 0.992 0.294 0.790 0.631 0.993 0.318 0.786 0.627 0.994 Table 2: Ablation Study. SSLM is Spatial Subject Learning Module, DWVSS is Dynamic Weighted Video Sampling Strategy.
Researcher Affiliation Collaboration Tao Wu 1, Yong Zhang 2*, Xintao Wang 2,4, Xianpan Zhou 3, Guangcong Zheng 1, Zhongang Qi 4, Ying Shan2,4, Xi Li 1* 1College of Computer Science and Technology, Zhejiang University 2Tencent AI Lab 3Polytechnic Institute, Zhejiang University 4ARC Lab, Tencent PCG EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Dynamic Weighted Video Sampling Strategy Input: A source prompt P, a random seed s, a small Lo RA weight λs used in Phase 1, a large Lo RA weight λl used in Phase 2, and the delimitation point k. Output: latent code for generating video. z T N(0, I) a unit Gaussian random variable with random seed s; Change(DM, λ, λs); /* Change λ to λs */ for t = T, T 1, . . . , 1 do if t == ( T K ) then Change(DM, λ, λl) /* Change λ to λl */ end zt 1 DM(zt, P, t, s) end Return: z0
Open Source Code Yes Code https://github.com/Wu Tao-CS/Custom Crafter
Open Datasets Yes For the Class-specific Prior Preservation Loss, similar to (Ruiz et al. 2023; Kumari et al. 2023),we collected 200 images from LAION400M (Schuhmann et al. 2021) for each subject as regularization data and set α to 1.0.
Dataset Splits No For subject customization, we select subjects from image customization papers (Ruiz et al. 2023; Kumari et al. 2023) for a total of 20 subjects. For each subject, we use Chat GPT to generate 10 related prompts, which are used to test the generation of specified motion videos for the subject. For the Class-specific Prior Preservation Loss, similar to (Ruiz et al. 2023; Kumari et al. 2023),we collected 200 images from LAION400M (Schuhmann et al. 2021) for each subject as regularization data and set α to 1.0. The paper describes how data is used for testing and regularization but does not provide specific training/validation/test splits for the main model training.
Hardware Specification Yes We perform 10,000 iterations on 4 NVIDIA A100 GPUs.
Software Dependencies No All experiments use Video Crafter2 as the base model. When learning the subject, we use the Adam W optimizer, set the learning rate to 3 10 5 and the weight decay to 1 10 2. During the inference process, we use DDIM (Song, Meng, and Ermon 2020) for 50-step sampling and classifier-free guidance with a cfg of 12.0 to generate videos with a resolution of 512 320. The paper mentions several software components (Video Crafter2, Adam W optimizer, DDIM, CLIP text encoder implied by CLIP metrics) but does not specify their version numbers.
Experiment Setup Yes When learning the subject, we use the Adam W optimizer, set the learning rate to 3 10 5 and the weight decay to 1 10 2. We perform 10,000 iterations on 4 NVIDIA A100 GPUs. For the Class-specific Prior Preservation Loss, similar to (Ruiz et al. 2023; Kumari et al. 2023),we collected 200 images from LAION400M (Schuhmann et al. 2021) for each subject as regularization data and set α to 1.0. During the inference process, we use DDIM (Song, Meng, and Ermon 2020) for 50-step sampling and classifier-free guidance with a cfg of 12.0 to generate videos with a resolution of 512 320. For all subjects, to facilitate experimentation and comparison, we uniformly set λs and λl to 0.4 and 0.8 respectively, and set K to 5 based on our observation.