When does perceptual alignment benefit vision representations?

Authors: Shobhita Sundaram, Stephanie Fu, Lukas Muttenthaler, Netanel Tamir, Lucy Chai, Simon Kornblith, Trevor Darrell, Phillip Isola

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We finetune state-of-the-art models on human similarity judgments for image triplets and evaluate them across standard benchmarks.
Researcher Affiliation Collaboration Shobhita Sundaram1 Stephanie Fu2 Lukas Muttenthaler3,4 Netanel Y. Tamir5 Lucy Chai1 Simon Kornblith6 Trevor Darrell2 Phillip Isola1 1MIT 2U.C. Berkeley 3TU Berlin 4BIFOLD 5Weizmann Institute of Science 6Anthropic
Pseudocode No The paper includes diagrams, such as Figure 2, to illustrate methods, but it does not contain any formal pseudocode blocks or algorithms.
Open Source Code Yes Our blog post and code are available at percep-align.github.io.
Open Datasets Yes We use the NIGHTS dataset to produce human-aligned variations of several large vision models [18]. The NIGHTS dataset consists of 20k synthetically generated image triplets, annotated with two alternative forced-choice human similarity judgments. These triplets are collected so that each has 6-10 unanimous human ratings, thus eliminating ambiguous cases where humans are likely to disagree.
Dataset Splits Yes Train/val/test splits on NIGHTS, BAPPS, and THINGS were used as-provided in the dataset.
Hardware Specification Yes All training and evaluation for dense prediction tasks is done on a single NVIDIA Titan RTX GPU. [...] This full research project required additional compute for experiments and results that are not included in this paper; these computations were also done on single NVIDIA Titan RTX, Ge Force 2080, Ge Force 3090, and V100 GPUs.
Software Dependencies No The paper mentions using "the sci-kit learn implementation" for VTAB classification but does not provide specific version numbers for scikit-learn or any other software dependencies.
Experiment Setup Yes We fine-tune its parameters θ on a dataset of triplets D = {(x, x0, x1), y}, where x denotes a reference image, and x0 and x1 denote two variation images. The judgement y {0, 1} indicates which of x0 and x1 is more similar to x. We measure distance (dissimilarity) between two images (x, x0) using the cosine distance between their respective image features (fθ(x), fθ( x0)), which is defined as: d(x, x0) = 1 − fθ(x) · fθ( x0) |fθ(x)||fθ( x0)|.