SinFusion: Training Diffusion Models on a Single Image or Video

Authors: Yaniv Nikankin, Niv Haim, Michal Irani

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents quantitative evaluations to support our main claim for the motion generalization capabilities of Sin Fusion. We measure the performance of our framework by training a model on a small portion of the original video, and test it on unseen frames from a different portion of the same video (Sec. 7.1). In Table 1 we report the results of these metrics, as well as SVFID (Gur et al., 2020) score, on 2 diverse video generation datasets (see details in Appendix B.1).
Researcher Affiliation Academia 1Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel. Correspondence to: Yaniv Nikankin <yaniv.nikankin@weizmann.ac.il>.
Pseudocode Yes Algorithm 1 Training on a single image x
Open Source Code No Project Page: https://yanivnik.github.io/sinfusion
Open Datasets Yes We compare to two datasets of videos. One provided by Sin GAN-GIF (Arora & Lee, 2021) and the other by HP-VAE-GAN (Gur et al., 2020). We perform the comparison on the Places50 benchmark dataset. We also use several videos from MEAD Faces Dataset (Wang et al., 2020), and Timlapse Clouds Dataset (Jacobs et al., 2010; 2013).
Dataset Splits No Given a video with N frames, we train a model on n < N frames. At inference, we sample 100 frames from the rest of the N n frames (not seen during training), and for each of them, use the trained model to predict its next (or a more distant) frame.
Hardware Specification Yes On a Tesla V100-PCIE-16GB, for images/videos of resolution 144 256, our model trains for about 1.5 minutes per 1000 iterations, where each iteration is running one diffusion step on a large image crop.
Software Dependencies No Our code is implemented with Py Torch (Paszke et al., 2017).
Experiment Setup Yes We use a batch size of 1. Each large crop contains many large patches. Since our network is a fully convolutional network, each large patch is a single training example. We use ADAM optimizer (Kingma & Ba, 2014) with a learning rate of 2 10 4, reduced to 2 10 5 after 100K iterations. We set the diffusion timesteps T = 50.