Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing

Authors: Ziqi Jiang, Zhen Wang, Long Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods. ... To show the performance of CLIPDrag we compared both drag-based methods (Drag Diffusion, Free Drag, Region Drag, Stable Drag, Instant Drag, Lightning Drag), and text-based method (Diff CLIP) on text-drag image editing tasks. ... All input images are from the DRAGBENCH datasets (Shi et al., 2024b). ... Quantitative results are shown in Figure 6(b).
Researcher Affiliation Academia Ziqi Jiang, Zhen Wang, Long Chen The Hong Kong University of Science and Technology EMAIL, EMAIL
Pseudocode No The paper describes methods and processes like 'Global-Local Motion Supervision' and 'Fast Point Tracking' in detailed text and mathematical formulations (e.g., equations 1-7), but it does not present any explicitly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code Yes Codes: https://github.com/Zi Qi-Jiang/CLIPDrag.
Open Datasets Yes All input images are from the DRAGBENCH datasets (Shi et al., 2024b).
Dataset Splits No The paper mentions using 'DRAGBENCH datasets' and comparing methods 'on the DRAGBENCH benchmark with five different max iteration step settings,' but it does not provide specific details about training, validation, or test splits (e.g., percentages, sample counts, or explicit splitting methodology).
Hardware Specification Yes the result is calculated on a single 3090 GPU by averaging over 100 examples sampled from the Drag Bench.
Software Dependencies Yes We used Stable Diffusion 1.5 (Rombach et al., 2022) and CLIP-Vi T-B/16 (Dosovitskiy et al., 2020) as the base model.
Experiment Setup Yes For the Lo RA finetuning stage, we set the training steps as 80, and the rank as 16 with a small learning rate of 0.0005. In the DDIM inversion, we set the inversion strength to 0.7 and the total denoising steps to 50. In the Motion supervision, we had a large maximum optimization step of 2000, ensuring handles could reach the targets. The features were extracted from the last layer of the U-Net. The radius for motion supervision (r1) and point tracking (r2) were set to 4 and 12, respectively. The weight λ in the Global-Local Gradient Fusion process was 0.7.