Consistent Video-to-Video Transfer Using Synthetic Dataset

Authors: Jiaxin Cheng, Tianjun Xiao, Tong He

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations confirm the effectiveness of LVSC and motion compensation in enhancing video quality and consistency. 5 EXPERIMENTS
Researcher Affiliation Industry Jiaxin Cheng, Tianjun Xiao & Tong He Amazon Web Services Shanghai AI Lab {cjiaxin,tianjux,htong}@amazon.com
Pseudocode No The paper describes its methods procedurally but does not include any formally labeled pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/ amazon-science/instruct-video-to-video/tree/main
Open Datasets Yes Our synthetic dataset is constructed using paired prompts from two differentiated sources, each serving a specific purpose in the training process. The first source, LAION-IPTP, employs a finetuned GPT-3 model... The second source, Web Vid-MPT, which leverages video-specific captions from the Web Vid 10M dataset Bain et al. (2021).
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits for reproducibility of their experiments. It describes the TGVE dataset for evaluation, but not how it's split for training or validation.
Hardware Specification Yes This training process takes approximately 30 hours to complete on four NVIDIA A10G GPUs.
Software Dependencies No The paper mentions software components like 'Adam optimizer', 'Stable Diffusion', 'GPT-3 model', 'MPT-30B', 'DDIM', and 'CLIP-based filtering' but does not specify their version numbers or the versions of underlying libraries/frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes Our training procedure makes use of the Adam optimizer with a learning rate set at 5 × 10−5. The model is trained with a batch size of 512 over a span of 2,000 iterations. This training process takes approximately 30 hours to complete on four NVIDIA A10G GPUs. During sampling, we experiment with varying hyperparameters for video classifier-free guidance (VCFG) within the choice of [1.2, 1.5, 1.8], text classifier-free guidance to 10 and video resolutions of 256 and 384.