DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

Authors: Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, Dream Sim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks. Our project page: https://dreamsim-nights.github.io/
Researcher Affiliation Collaboration 1MIT 2Weizmann Institute of Science 3Adobe Research
Pseudocode No The paper describes its methods in prose but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper states: "Our project page: https://dreamsim-nights.github.io/". While project pages often link to code, the text explicitly states only that "Our dataset is publicly available on our project page." and does not make a similar explicit statement regarding the release of the code for the methodology described in the paper.
Open Datasets Yes Our dataset is publicly available on our project page. We sample images with a prompt of the same category, using the structure An image of a <category> . The <category> is drawn from image labels in popular datasets: Image Net [21], CIFAR-10 [46], CIFAR-100 [46], Oxford 102 Flower [60], Food-101 [8], and SUN397 [91].
Dataset Splits Yes We partition our resulting dataset into train, validation, and test components with a random 80/10/10 split.
Hardware Specification Yes We train on a single NVIDIA Ge Force RTX 3090 or NVIDIA TITAN RTX GPU with an Adam optimizer, learning rate of 3e 4, weight decay of 0, and batch size of 512 (nonensemble models) and 16 (ensemble models).
Software Dependencies No The paper mentions specific software like "Stable Diffusion v1.4", "Adam optimizer", "Hugging Face transformers repository", and "Facebook repository", but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes We train on a single NVIDIA Ge Force RTX 3090 or NVIDIA TITAN RTX GPU with an Adam optimizer, learning rate of 3e 4, weight decay of 0, and batch size of 512 (nonensemble models) and 16 (ensemble models). In MLP-tuned training, we use a width of 512. We tune the number of training epochs using the validation set; for the Tuned Lo RA ensemble model (Dream Sim) we train for 6 epochs. For Tuned Lo RA models we use rank r = 16, scaling α = 0.5, and dropout p = 0.3.