reproducibilityindex.ai

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models

Authors: Hyeonho Jeong, Jong Chul Ye

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and applications demonstrate that Ground-A-Video s zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
Researcher Affiliation	Academia	Hyeonho Jeong & Jong Chul Ye Kim Jaechul Graduate School of AI, KAIST {hyeonho.jeong,jong.ye}@kaist.ac.kr
Pseudocode	Yes	Algorithm 1 Optical Flow guided Inverted Latents Smoothing
Open Source Code	Yes	Further results and code are available at http://ground-a-video.github.io.
Open Datasets	Yes	We use a subset of 20 videos from DAVIS dataset (Pont-Tuset et al., 2017).
Dataset Splits	No	The paper mentions using a subset of the DAVIS dataset but does not provide specific train/validation/test split percentages or sample counts for reproducibility.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., GPU, CPU models, or cloud computing instances) used for running its experiments.
Software Dependencies	No	The paper mentions several software components and models (e.g., Stable Diffusion v1.4, Control Net Depth, GLIGEN, RAFT-Large, Zoe Depth, BLIP-2, GLIP, DDIM scheduler) but does not provide specific version numbers for the underlying software stack (e.g., Python, PyTorch/TensorFlow, CUDA).
Experiment Setup	Yes	Generated videos are configured to consist of 8 frames, unless explicitly specified, with a uniform resolution of 512x512. ... In the flow-driven inverted latents smoothing stage, the magnitude threshold Mthres is set to 0.2. At inference, DDIM scheduler (Song et al., 2020a) with 50 steps and classifier-free guidance (Ho & Salimans, 2022) of 12.5 scale is used.