Video-to-Video Synthesis

Authors: Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple benchmarks show the advantage of our method compared to strong baselines. In particular, our model is capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis. Finally, we apply our method to future video prediction, outperforming several competing systems.
Researcher Affiliation Collaboration Ting-Chun Wang1, Ming-Yu Liu1, Jun-Yan Zhu2, Guilin Liu1, Andrew Tao1, Jan Kautz1, Bryan Catanzaro1 1NVIDIA, 2MIT CSAIL {tingchunw,mingyul,guilinl,atao,jkautz,bcatanzaro}@nvidia.com, junyanz@mit.edu
Pseudocode No The paper describes the methodology and model architecture in detail, but it does not include a structured pseudocode block or a section explicitly labeled as "Algorithm".
Open Source Code Yes Code, models, and more results are available at our website. Please visit our website for code, models, and more results.
Open Datasets Yes Cityscapes [13]. The dataset consists of 2048 × 1024 street scene videos captured in several German cities. Apolloscape [29] consists of 73 street scene videos captured in Beijing, whose video lengths vary from 100 to 1000 frames. Face video dataset [57]. We use the real videos in the Face Forensics dataset, which contains 854 videos of news briefing from different reporters. Dance video dataset. We download You Tube dance videos for the pose to human motion synthesis task.
Dataset Splits Yes In summary, the training set contains 2975 videos, each with 30 frames. The validation set consists of 500 videos, each with 30 frames. We split the dataset into 704 videos for training and 150 videos for validation.
Hardware Specification Yes We train our model for 40 epochs using the ADAM optimizer [39] with lr = 0.0002 and (β1, β2) = (0.5, 0.999) on an NVIDIA DGX1 machine. We use all the GPUs in DGX1 (8 V100 GPUs, each with 16GB memory) for training.
Software Dependencies No The paper mentions several software components and algorithms (e.g., "Deep Lab V3", "Flow Net2", "Mask R-CNN", "ADAM optimizer", "LSGAN"), but does not provide specific version numbers for any of them.
Experiment Setup Yes We train our model for 40 epochs using the ADAM optimizer [39] with lr = 0.0002 and (β1, β2) = (0.5, 0.999) on an NVIDIA DGX1 machine. Our coarse-to-fine generator consists of three scales: 512 × 256, 1024 × 512, and 2048 × 1024 resolutions, respectively.