Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

Authors: Yihao Li, Saeed Salehi, Lyle H. Ungar, Konrad Kording

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We decode Is Same Object from patch embeddings across Vi T layers using a quadratic similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in DINO, CLIP, and Image Net-supervised Vi Ts, but is markedly weaker in MAE, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that Is Same Object is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating Is Same Object from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective.
Researcher Affiliation Academia Yihao Li1 Saeed Salehi2 Lyle Ungar1 Konrad P. Kording1 1University of Pennsylvania 2Machine Learning Group, Technical University of Berlin EMAIL, EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Quadratic Probe (full rank) Algorithm 2 Quadratic Probe (with fixed rank r)
Open Source Code Yes Code available at: https://github.com/liyihao0302/vit-object-binding.
Open Datasets Yes We extract DINOv2-Large [13] activations at each layer and train the probes on the ADE20K dataset [56] using cross-entropy loss for all pairwise probes to classify same-object vs. different-object patch pairs (see Figure 2 for Is Same Object visualizations). Supervised Image Net training. Although Image Net labels correspond to the dominant object in each image [57], class-level supervision still provides useful signals for object identity, consistent with the strong performance of our object-class probes.
Dataset Splits No The paper states: "We extract DINOv2-Large [13] activations at each layer and train the probes on the ADE20K dataset [56] using cross-entropy loss for all pairwise probes to classify same-object vs. different-object patch pairs". It also mentions, "We evaluate the semantic and instance segmentation performance with retrained segmentation heads on a subset of ADE20K under these variations". However, it does not explicitly provide specific train/test/validation splits (percentages, counts, or references to standard splits) for the ADE20K dataset used for training and evaluation of the probes or segmentation heads.
Hardware Specification Yes All computations are performed using float32 precision on a NVIDIA RTX 4090 GPU.
Software Dependencies No The paper mentions using "Adam optimizer" but does not specify version numbers for any software libraries, programming languages, or frameworks used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We use the Adam optimizer with a learning rate of 0.001 and a step learning rate scheduler with step size of 8 epochs and gamma decay factor of 0.2. All probes are trained for 16 epochs with a batch size of 256 or 128.