DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
Authors: Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, Dream Sim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks. Our project page: https://dreamsim-nights.github.io/ |
| Researcher Affiliation | Collaboration | 1MIT 2Weizmann Institute of Science 3Adobe Research |
| Pseudocode | No | The paper describes its methods in prose but does not include any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The paper states: "Our project page: https://dreamsim-nights.github.io/". While project pages often link to code, the text explicitly states only that "Our dataset is publicly available on our project page." and does not make a similar explicit statement regarding the release of the code for the methodology described in the paper. |
| Open Datasets | Yes | Our dataset is publicly available on our project page. We sample images with a prompt of the same category, using the structure An image of a <category> . The <category> is drawn from image labels in popular datasets: Image Net [21], CIFAR-10 [46], CIFAR-100 [46], Oxford 102 Flower [60], Food-101 [8], and SUN397 [91]. |
| Dataset Splits | Yes | We partition our resulting dataset into train, validation, and test components with a random 80/10/10 split. |
| Hardware Specification | Yes | We train on a single NVIDIA Ge Force RTX 3090 or NVIDIA TITAN RTX GPU with an Adam optimizer, learning rate of 3e 4, weight decay of 0, and batch size of 512 (nonensemble models) and 16 (ensemble models). |
| Software Dependencies | No | The paper mentions specific software like "Stable Diffusion v1.4", "Adam optimizer", "Hugging Face transformers repository", and "Facebook repository", but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We train on a single NVIDIA Ge Force RTX 3090 or NVIDIA TITAN RTX GPU with an Adam optimizer, learning rate of 3e 4, weight decay of 0, and batch size of 512 (nonensemble models) and 16 (ensemble models). In MLP-tuned training, we use a width of 512. We tune the number of training epochs using the validation set; for the Tuned Lo RA ensemble model (Dream Sim) we train for 6 epochs. For Tuned Lo RA models we use rank r = 16, scaling α = 0.5, and dropout p = 0.3. |