Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unsupervised Semantic Correspondence Using Stable Diffusion

Authors: Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, Kwang Moo Yi

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.
Researcher Affiliation	Collaboration	1 University of British Columbia, 2 Vector Institute for AI, 3 Google, 4 Simon Fraser University, 5 University of Toronto
Pseudocode	No	The paper describes its algorithm and methods in prose and with mathematical equations, but it does not include a distinct pseudocode or algorithm block.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or a link to a code repository.
Open Datasets	Yes	We evaluate semantic correspondence search on three standard benchmarks: SPair-71k [14] is the largest standard dataset for evaluating semantic correspondences composed of 70, 958 image pairs of 18 different classes. Since we do not perform any training, we only use the 12, 234 correspondences of the test set for evaluation; PF-Willow [13] comprises four classes wine bottle, duck, motorcycle, and car with 900 correspondences in the test set; CUB-200 [37] includes 200 different classes of birds. Following ASIC [46] we select the first three classes, yielding a total of 1, 248 correspondences in the test set.
Dataset Splits	No	The paper mentions using a 'validation subset' for hyperparameter tuning ('We choose our hyperparameters based on the validation subset of SPair-71k and PCK@0.05 via a fully-randomized search and applied them to all datasets.'), but it does not specify the explicit split percentages or sample counts for this validation set.
Hardware Specification	Yes	On an NVIDIA RTX 3090 GPU, finding a single prompt for a single keypoint takes 30 seconds.
Software Dependencies	Yes	We use Stable Diffusion version 1.4 [30].
Experiment Setup	Yes	On CUB-200 and PF-Willow datasets we use 10 optimization rounds for the embeddings and 30 random crops for the inference. For the larger SPair-71k dataset we use less 5 embeddings and 20 crops. ... We use the Adam [73] optimizer to find the prompts.