Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing
Authors: Ziqi Jiang, Zhen Wang, Long Chen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods. ... To show the performance of CLIPDrag we compared both drag-based methods (Drag Diffusion, Free Drag, Region Drag, Stable Drag, Instant Drag, Lightning Drag), and text-based method (Diff CLIP) on text-drag image editing tasks. ... All input images are from the DRAGBENCH datasets (Shi et al., 2024b). ... Quantitative results are shown in Figure 6(b). |
| Researcher Affiliation | Academia | Ziqi Jiang, Zhen Wang, Long Chen The Hong Kong University of Science and Technology EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and processes like 'Global-Local Motion Supervision' and 'Fast Point Tracking' in detailed text and mathematical formulations (e.g., equations 1-7), but it does not present any explicitly labeled pseudocode blocks or algorithms in a structured, code-like format. |
| Open Source Code | Yes | Codes: https://github.com/Zi Qi-Jiang/CLIPDrag. |
| Open Datasets | Yes | All input images are from the DRAGBENCH datasets (Shi et al., 2024b). |
| Dataset Splits | No | The paper mentions using 'DRAGBENCH datasets' and comparing methods 'on the DRAGBENCH benchmark with five different max iteration step settings,' but it does not provide specific details about training, validation, or test splits (e.g., percentages, sample counts, or explicit splitting methodology). |
| Hardware Specification | Yes | the result is calculated on a single 3090 GPU by averaging over 100 examples sampled from the Drag Bench. |
| Software Dependencies | Yes | We used Stable Diffusion 1.5 (Rombach et al., 2022) and CLIP-Vi T-B/16 (Dosovitskiy et al., 2020) as the base model. |
| Experiment Setup | Yes | For the Lo RA finetuning stage, we set the training steps as 80, and the rank as 16 with a small learning rate of 0.0005. In the DDIM inversion, we set the inversion strength to 0.7 and the total denoising steps to 50. In the Motion supervision, we had a large maximum optimization step of 2000, ensuring handles could reach the targets. The features were extracted from the last layer of the U-Net. The radius for motion supervision (r1) and point tracking (r2) were set to 4 and 12, respectively. The weight λ in the Global-Local Gradient Fusion process was 0.7. |