Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models

Authors: Hyeonho Jeong, Jong Chul Ye

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and applications demonstrate that Ground-A-Video s zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
Researcher Affiliation Academia Hyeonho Jeong & Jong Chul Ye Kim Jaechul Graduate School of AI, KAIST {hyeonho.jeong,jong.ye}@kaist.ac.kr
Pseudocode Yes Algorithm 1 Optical Flow guided Inverted Latents Smoothing
Open Source Code Yes Further results and code are available at http://ground-a-video.github.io.
Open Datasets Yes We use a subset of 20 videos from DAVIS dataset (Pont-Tuset et al., 2017).
Dataset Splits No The paper mentions using a subset of the DAVIS dataset but does not provide specific train/validation/test split percentages or sample counts for reproducibility.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU, CPU models, or cloud computing instances) used for running its experiments.
Software Dependencies No The paper mentions several software components and models (e.g., Stable Diffusion v1.4, Control Net Depth, GLIGEN, RAFT-Large, Zoe Depth, BLIP-2, GLIP, DDIM scheduler) but does not provide specific version numbers for the underlying software stack (e.g., Python, PyTorch/TensorFlow, CUDA).
Experiment Setup Yes Generated videos are configured to consist of 8 frames, unless explicitly specified, with a uniform resolution of 512x512. ... In the flow-driven inverted latents smoothing stage, the magnitude threshold Mthres is set to 0.2. At inference, DDIM scheduler (Song et al., 2020a) with 50 steps and classifier-free guidance (Ho & Salimans, 2022) of 12.5 scale is used.