reproducibilityindex.ai

Consistent Video-to-Video Transfer Using Synthetic Dataset

Authors: Jiaxin Cheng, Tianjun Xiao, Tong He

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluations conﬁrm the effectiveness of LVSC and motion compensation in enhancing video quality and consistency. 5 EXPERIMENTS
Researcher Affiliation	Industry	Jiaxin Cheng, Tianjun Xiao & Tong He Amazon Web Services Shanghai AI Lab {cjiaxin,tianjux,htong}@amazon.com
Pseudocode	No	The paper describes its methods procedurally but does not include any formally labeled pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/ amazon-science/instruct-video-to-video/tree/main
Open Datasets	Yes	Our synthetic dataset is constructed using paired prompts from two differentiated sources, each serving a speciﬁc purpose in the training process. The ﬁrst source, LAION-IPTP, employs a ﬁnetuned GPT-3 model... The second source, Web Vid-MPT, which leverages video-speciﬁc captions from the Web Vid 10M dataset Bain et al. (2021).
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits for reproducibility of their experiments. It describes the TGVE dataset for evaluation, but not how it's split for training or validation.
Hardware Specification	Yes	This training process takes approximately 30 hours to complete on four NVIDIA A10G GPUs.
Software Dependencies	No	The paper mentions software components like 'Adam optimizer', 'Stable Diffusion', 'GPT-3 model', 'MPT-30B', 'DDIM', and 'CLIP-based filtering' but does not specify their version numbers or the versions of underlying libraries/frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	Our training procedure makes use of the Adam optimizer with a learning rate set at 5 × 10−5. The model is trained with a batch size of 512 over a span of 2,000 iterations. This training process takes approximately 30 hours to complete on four NVIDIA A10G GPUs. During sampling, we experiment with varying hyperparameters for video classiﬁer-free guidance (VCFG) within the choice of [1.2, 1.5, 1.8], text classiﬁer-free guidance to 10 and video resolutions of 256 and 384.