Emergent Correspondence from Image Diffusion
Authors: Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, Bharath Hariharan
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DIFT with two different types of diffusion models, on three groups of visual correspondence tasks including semantic correspondence, geometric correspondence, and temporal correspondence. We compare DIFT with other baselines, including task-specific methods, and other self-supervised models trained with similar datasets and similar amount of supervision (DINO [10] and Open CLIP [36]). |
| Researcher Affiliation | Academia | Luming Tang Menglin Jia Qianqian Wang Cheng Perng Phoo Bharath Hariharan Cornell University |
| Pseudocode | No | The paper describes the methodology in narrative text and mathematical formulas, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://diffusionfeatures. github.io. |
| Open Datasets | Yes | We conduct evaluation on three popular benchmarks: SPair-71k [55], PF-WILLOW [27] and CUB-200-2011 [89]. ... SD is trained on the LAION [75] whereas ADM is trained on Image Net [15] without labels. |
| Dataset Splits | No | The paper mentions testing portions of datasets (e.g., "12,234 image pairs on 18 categories for testing" for SPair-71k, "900 image pairs for testing" for PF-Willow, and "14 different splits of CUB (each containing 25 images)") and states they "grid search the hyper-parameters using SPair-71k", but it does not provide explicit training/validation splits or percentages for these datasets that were used for their specific experiments. The models themselves (SD, ADM, DINO, CLIP) were pre-trained elsewhere, so their original training splits are outside the scope of this paper's reproduction. |
| Hardware Specification | Yes | For example, when extracting features for semantic correspondence as in Sec. 5, on one single NVIDIA A6000 GPU, DIFTsd takes 203 ms vs. Open CLIP s 231 ms on one single 768 × 768 image; DIFTadm takes 110 ms vs. DINO s 154 ms on one single 512 × 512 image. |
| Software Dependencies | No | The paper mentions using pre-trained models like Stable Diffusion, Open CLIP, and DINO, and refers to cv2.findHomography() (OpenCV), but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions). |
| Experiment Setup | Yes | The total time step T for both diffusion models (ADM and SD) is 1000. ... We use t = 101 and n = 4 for DIFTadm on input image resolution 512 × 512... we use t = 261 and n = 1 for DIFTsd on input image resolution 768 × 768. ... when extracting features for one single image using DIFT, we use a batch of random noise to get an averaged feature map. The batch size is 8 by default. |