Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?
Authors: Yihao Li, Saeed Salehi, Lyle H. Ungar, Konrad Kording
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We decode Is Same Object from patch embeddings across Vi T layers using a quadratic similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in DINO, CLIP, and Image Net-supervised Vi Ts, but is markedly weaker in MAE, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that Is Same Object is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating Is Same Object from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. |
| Researcher Affiliation | Academia | Yihao Li1 Saeed Salehi2 Lyle Ungar1 Konrad P. Kording1 1University of Pennsylvania 2Machine Learning Group, Technical University of Berlin EMAIL, EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Quadratic Probe (full rank) Algorithm 2 Quadratic Probe (with fixed rank r) |
| Open Source Code | Yes | Code available at: https://github.com/liyihao0302/vit-object-binding. |
| Open Datasets | Yes | We extract DINOv2-Large [13] activations at each layer and train the probes on the ADE20K dataset [56] using cross-entropy loss for all pairwise probes to classify same-object vs. different-object patch pairs (see Figure 2 for Is Same Object visualizations). Supervised Image Net training. Although Image Net labels correspond to the dominant object in each image [57], class-level supervision still provides useful signals for object identity, consistent with the strong performance of our object-class probes. |
| Dataset Splits | No | The paper states: "We extract DINOv2-Large [13] activations at each layer and train the probes on the ADE20K dataset [56] using cross-entropy loss for all pairwise probes to classify same-object vs. different-object patch pairs". It also mentions, "We evaluate the semantic and instance segmentation performance with retrained segmentation heads on a subset of ADE20K under these variations". However, it does not explicitly provide specific train/test/validation splits (percentages, counts, or references to standard splits) for the ADE20K dataset used for training and evaluation of the probes or segmentation heads. |
| Hardware Specification | Yes | All computations are performed using float32 precision on a NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using "Adam optimizer" but does not specify version numbers for any software libraries, programming languages, or frameworks used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We use the Adam optimizer with a learning rate of 0.001 and a step learning rate scheduler with step size of 8 epochs and gamma decay factor of 0.2. All probes are trained for 16 epochs with a batch size of 256 or 128. |