Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
Authors: Hyeonho Jeong, Jong Chul Ye
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and applications demonstrate that Ground-A-Video s zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency. |
| Researcher Affiliation | Academia | Hyeonho Jeong & Jong Chul Ye Kim Jaechul Graduate School of AI, KAIST {hyeonho.jeong,jong.ye}@kaist.ac.kr |
| Pseudocode | Yes | Algorithm 1 Optical Flow guided Inverted Latents Smoothing |
| Open Source Code | Yes | Further results and code are available at http://ground-a-video.github.io. |
| Open Datasets | Yes | We use a subset of 20 videos from DAVIS dataset (Pont-Tuset et al., 2017). |
| Dataset Splits | No | The paper mentions using a subset of the DAVIS dataset but does not provide specific train/validation/test split percentages or sample counts for reproducibility. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU, CPU models, or cloud computing instances) used for running its experiments. |
| Software Dependencies | No | The paper mentions several software components and models (e.g., Stable Diffusion v1.4, Control Net Depth, GLIGEN, RAFT-Large, Zoe Depth, BLIP-2, GLIP, DDIM scheduler) but does not provide specific version numbers for the underlying software stack (e.g., Python, PyTorch/TensorFlow, CUDA). |
| Experiment Setup | Yes | Generated videos are configured to consist of 8 frames, unless explicitly specified, with a uniform resolution of 512x512. ... In the flow-driven inverted latents smoothing stage, the magnitude threshold Mthres is set to 0.2. At inference, DDIM scheduler (Song et al., 2020a) with 50 steps and classifier-free guidance (Ho & Salimans, 2022) of 12.5 scale is used. |