Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
Authors: Daniel Geng, Andrew Owens
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method s ability to manipulate image structure, both qualitatively and quantitatively, on real and generated images. Additional results can be found in Appendix A3. |
| Researcher Affiliation | Academia | Daniel Geng, Andrew Owens University of Michigan |
| Pseudocode | No | The paper describes its method in text and through equations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://dangeng.github.io/motion_guidance |
| Open Datasets | Yes | We evaluate on two different datasets. The first dataset is composed of examples with handcrafted target flows, a subset of which can be seen in Figures 1 , 2 , 3 , 4 , and 7 . This dataset has the advantage of containing interesting motions that are of practical interest. In addition, we can write highly specific instructions for the Instruct Pix2Pix baseline for a fair comparison. However, this dataset is curated to an extent. We ameliorate this by performing an additional evaluation on an automatically generated dataset based on KITTI (Geiger et al., 2012), which contains egocentric driving videos with labeled bounding boxes on cars. |
| Dataset Splits | No | The paper describes the datasets used (KITTI, curated dataset) but does not provide specific train/validation/test split percentages or sample counts. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA A40 GPU. |
| Software Dependencies | Yes | We use RAFT (Teed & Deng, 2020) as our flow model. ... For our experiments we use Stable Diffusion (Rombach et al., 2021). Rather than performing diffusion directly on pixels, Stable Diffusion performs diffusion in a latent space, with an encoder and decoder to convert between pixel and latent space. To accommodate this, we precompose the decoder with the motion guidance function, L(D( )), so that the guidance function can accept latent codes. Additionally, we downsample our edit mask to 64 64, the spatial size of the Stable Diffusion latent space. ... We use Stable Diffusion v1.4 with a DDIM sampler for 500 steps, and we generate images at a resolution of 512 512. |
| Experiment Setup | Yes | We use Stable Diffusion v1.4 with a DDIM sampler for 500 steps, and we generate images at a resolution of 512 512. All experiments are conducted on a single NVIDIA A40 GPU. For our motion guidance function (Eq. 4) we found that setting λcolor to 100 and λflow to 3 worked well. In addition, in our implementation we scale the guidance gradients by a global weight of 300. We set the gradient clipping threshold cg to be 200 and take K = 10 recursive denoising steps. |