reproducibilityindex.ai

SinFusion: Training Diffusion Models on a Single Image or Video

Authors: Yaniv Nikankin, Niv Haim, Michal Irani

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents quantitative evaluations to support our main claim for the motion generalization capabilities of Sin Fusion. We measure the performance of our framework by training a model on a small portion of the original video, and test it on unseen frames from a different portion of the same video (Sec. 7.1). In Table 1 we report the results of these metrics, as well as SVFID (Gur et al., 2020) score, on 2 diverse video generation datasets (see details in Appendix B.1).
Researcher Affiliation	Academia	1Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel. Correspondence to: Yaniv Nikankin <yaniv.nikankin@weizmann.ac.il>.
Pseudocode	Yes	Algorithm 1 Training on a single image x
Open Source Code	No	Project Page: https://yanivnik.github.io/sinfusion
Open Datasets	Yes	We compare to two datasets of videos. One provided by Sin GAN-GIF (Arora & Lee, 2021) and the other by HP-VAE-GAN (Gur et al., 2020). We perform the comparison on the Places50 benchmark dataset. We also use several videos from MEAD Faces Dataset (Wang et al., 2020), and Timlapse Clouds Dataset (Jacobs et al., 2010; 2013).
Dataset Splits	No	Given a video with N frames, we train a model on n < N frames. At inference, we sample 100 frames from the rest of the N n frames (not seen during training), and for each of them, use the trained model to predict its next (or a more distant) frame.
Hardware Specification	Yes	On a Tesla V100-PCIE-16GB, for images/videos of resolution 144 256, our model trains for about 1.5 minutes per 1000 iterations, where each iteration is running one diffusion step on a large image crop.
Software Dependencies	No	Our code is implemented with Py Torch (Paszke et al., 2017).
Experiment Setup	Yes	We use a batch size of 1. Each large crop contains many large patches. Since our network is a fully convolutional network, each large patch is a single training example. We use ADAM optimizer (Kingma & Ba, 2014) with a learning rate of 2 10 4, reduced to 2 10 5 after 100K iterations. We set the diffusion timesteps T = 50.