Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

Authors: Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, Steve Seitz

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments shows that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques. We compare our work qualitatively and quantitatively to related methods on two curated difficult datasets targeted for generative inbetweening: Davis (Pont-Tuset et al., 2017) and Pexels1, and our method produces notably higher quality videos with more coherent dynamics given distant keyframes. Quantitative evaluation For each dataset, we evaluate the generated in-between videos using FID (Heusel et al., 2017) and FVD (Ge et al., 2024), widely used metrics for evaluating generative models. These two metrics measure the distance between the distributions of generated frames/videos and actual ones. The results are shown in Tab. 1, and our method outperforms all of the baselines by a significant margin.
Researcher Affiliation Collaboration 1University of Washington, 2Google Deep Mind, 3UC Berkeley
Pseudocode Yes ALGORITHM 1: Light-weight backward motion fine-tuning Input: fθ, pdata(x), E( ) while not converged do... ALGORITHM 2: Dual-directional diffusion sampling Input: I0, IN 1, fθ, fθ , D( ) Compute condition c0, c N 1 from I0, IN 1; Set z T N(0, I); for t T to 1 do...
Open Source Code No The paper mentions using 'the public available model weights https://huggingface.co/stabilityai/ stable-video-diffusion-img2vid-xt' for Stable Video Diffusion, which is a base model. However, it does not provide any specific links or statements for the open-sourcing of their *own* methodology's implementation code.
Open Datasets Yes We use two high-resolution (1080p) datasets for evaluations: (1) The Davis dataset (Pont-Tuset et al., 2017), where we create a total of 117 input pairs from all of the videos. This dataset mostly features subject articulated motions, such as animal or human motions. (2) The Pexels dataset, where we collect a total of 106 input keyframe pairs from a compiled collection of high resolution videos on Pexels3, featuring directional dynamic scene motions such as vehicles moving, animals, or people running, surfing, wave movements, and time-lapse videos. (Footnote 3: https://www.pexels.com/)
Dataset Splits No The paper mentions that 'All input pairs are at least 25 frames apart and have the corresponding ground truth video clips.' and 'we create a total of 117 input pairs from all of the videos' for Davis and 'a total of 106 input keyframe pairs' for Pexels. However, it does not specify how these datasets were split into training, validation, or test sets for the experiments (e.g., percentages or exact counts for each split).
Hardware Specification Yes The training takes around 15K iterations with batch size of 4. We trained on 4 A100 GPUs.
Software Dependencies No The paper mentions using the 'Adam optimizer' and 'Pytorch pseudocode', but does not provide specific version numbers for Python, Pytorch, CUDA, or any other libraries or frameworks used in the implementation.
Experiment Setup Yes We use the Adam optimizer with learning rate of 1e 4, β1 = 0.9, β2 = 0.999, and weight decay of 1e 2. The training takes around 15K iterations with batch size of 4. We trained on 4 A100 GPUs. For sampling, we apply 50 sampling steps. For other parameters in SVD, we use the default values: motion bucket id = 127, noise aug strength = 0.02.