When does perceptual alignment benefit vision representations?
Authors: Shobhita Sundaram, Stephanie Fu, Lukas Muttenthaler, Netanel Tamir, Lucy Chai, Simon Kornblith, Trevor Darrell, Phillip Isola
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We finetune state-of-the-art models on human similarity judgments for image triplets and evaluate them across standard benchmarks. |
| Researcher Affiliation | Collaboration | Shobhita Sundaram1 Stephanie Fu2 Lukas Muttenthaler3,4 Netanel Y. Tamir5 Lucy Chai1 Simon Kornblith6 Trevor Darrell2 Phillip Isola1 1MIT 2U.C. Berkeley 3TU Berlin 4BIFOLD 5Weizmann Institute of Science 6Anthropic |
| Pseudocode | No | The paper includes diagrams, such as Figure 2, to illustrate methods, but it does not contain any formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | Our blog post and code are available at percep-align.github.io. |
| Open Datasets | Yes | We use the NIGHTS dataset to produce human-aligned variations of several large vision models [18]. The NIGHTS dataset consists of 20k synthetically generated image triplets, annotated with two alternative forced-choice human similarity judgments. These triplets are collected so that each has 6-10 unanimous human ratings, thus eliminating ambiguous cases where humans are likely to disagree. |
| Dataset Splits | Yes | Train/val/test splits on NIGHTS, BAPPS, and THINGS were used as-provided in the dataset. |
| Hardware Specification | Yes | All training and evaluation for dense prediction tasks is done on a single NVIDIA Titan RTX GPU. [...] This full research project required additional compute for experiments and results that are not included in this paper; these computations were also done on single NVIDIA Titan RTX, Ge Force 2080, Ge Force 3090, and V100 GPUs. |
| Software Dependencies | No | The paper mentions using "the sci-kit learn implementation" for VTAB classification but does not provide specific version numbers for scikit-learn or any other software dependencies. |
| Experiment Setup | Yes | We fine-tune its parameters θ on a dataset of triplets D = {(x, x0, x1), y}, where x denotes a reference image, and x0 and x1 denote two variation images. The judgement y {0, 1} indicates which of x0 and x1 is more similar to x. We measure distance (dissimilarity) between two images (x, x0) using the cosine distance between their respective image features (fθ(x), fθ( x0)), which is defined as: d(x, x0) = 1 − fθ(x) · fθ( x0) |fθ(x)||fθ( x0)|. |