Consistent Video-to-Video Transfer Using Synthetic Dataset
Authors: Jiaxin Cheng, Tianjun Xiao, Tong He
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluations confirm the effectiveness of LVSC and motion compensation in enhancing video quality and consistency. 5 EXPERIMENTS |
| Researcher Affiliation | Industry | Jiaxin Cheng, Tianjun Xiao & Tong He Amazon Web Services Shanghai AI Lab {cjiaxin,tianjux,htong}@amazon.com |
| Pseudocode | No | The paper describes its methods procedurally but does not include any formally labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/ amazon-science/instruct-video-to-video/tree/main |
| Open Datasets | Yes | Our synthetic dataset is constructed using paired prompts from two differentiated sources, each serving a specific purpose in the training process. The first source, LAION-IPTP, employs a finetuned GPT-3 model... The second source, Web Vid-MPT, which leverages video-specific captions from the Web Vid 10M dataset Bain et al. (2021). |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits for reproducibility of their experiments. It describes the TGVE dataset for evaluation, but not how it's split for training or validation. |
| Hardware Specification | Yes | This training process takes approximately 30 hours to complete on four NVIDIA A10G GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer', 'Stable Diffusion', 'GPT-3 model', 'MPT-30B', 'DDIM', and 'CLIP-based filtering' but does not specify their version numbers or the versions of underlying libraries/frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | Our training procedure makes use of the Adam optimizer with a learning rate set at 5 × 10−5. The model is trained with a batch size of 512 over a span of 2,000 iterations. This training process takes approximately 30 hours to complete on four NVIDIA A10G GPUs. During sampling, we experiment with varying hyperparameters for video classifier-free guidance (VCFG) within the choice of [1.2, 1.5, 1.8], text classifier-free guidance to 10 and video resolutions of 256 and 384. |