SinFusion: Training Diffusion Models on a Single Image or Video
Authors: Yaniv Nikankin, Niv Haim, Michal Irani
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section presents quantitative evaluations to support our main claim for the motion generalization capabilities of Sin Fusion. We measure the performance of our framework by training a model on a small portion of the original video, and test it on unseen frames from a different portion of the same video (Sec. 7.1). In Table 1 we report the results of these metrics, as well as SVFID (Gur et al., 2020) score, on 2 diverse video generation datasets (see details in Appendix B.1). |
| Researcher Affiliation | Academia | 1Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel. Correspondence to: Yaniv Nikankin <yaniv.nikankin@weizmann.ac.il>. |
| Pseudocode | Yes | Algorithm 1 Training on a single image x |
| Open Source Code | No | Project Page: https://yanivnik.github.io/sinfusion |
| Open Datasets | Yes | We compare to two datasets of videos. One provided by Sin GAN-GIF (Arora & Lee, 2021) and the other by HP-VAE-GAN (Gur et al., 2020). We perform the comparison on the Places50 benchmark dataset. We also use several videos from MEAD Faces Dataset (Wang et al., 2020), and Timlapse Clouds Dataset (Jacobs et al., 2010; 2013). |
| Dataset Splits | No | Given a video with N frames, we train a model on n < N frames. At inference, we sample 100 frames from the rest of the N n frames (not seen during training), and for each of them, use the trained model to predict its next (or a more distant) frame. |
| Hardware Specification | Yes | On a Tesla V100-PCIE-16GB, for images/videos of resolution 144 256, our model trains for about 1.5 minutes per 1000 iterations, where each iteration is running one diffusion step on a large image crop. |
| Software Dependencies | No | Our code is implemented with Py Torch (Paszke et al., 2017). |
| Experiment Setup | Yes | We use a batch size of 1. Each large crop contains many large patches. Since our network is a fully convolutional network, each large patch is a single training example. We use ADAM optimizer (Kingma & Ba, 2014) with a learning rate of 2 10 4, reduced to 2 10 5 after 100K iterations. We set the diffusion timesteps T = 50. |