Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing
Authors: Timo Lüddecke, Alexander S. Ecker
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose dense attentive probing, a parameter-efficient readout method for dense prediction on arbitrary backbones independent of the size and resolution of their feature volume. To this end, we extend cross-attention with distance-based masks of learnable sizes. We employ this method to evaluate 18 common backbones on dense predictions tasks in three dimensions: instance awareness, local semantics and spatial understanding. We find that DINOv2 outperforms all other backbones tested including those supervised with masks and language across all three task categories. |
| Researcher Affiliation | Academia | The provided paper text lists the authors 'Timo Lüddecke and Alexander Ecker' but does not explicitly state their institutional affiliations, departments, cities, countries, or email addresses. Therefore, it is not possible to classify the affiliation type based solely on the provided text. As a default, assuming academic affiliation common for research papers. |
| Pseudocode | No | The paper describes the Dense Attentive Probing (DeAP) method in Section 3 with equations and a diagram (Figure 2), but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code is available at https://eckerlab.org/code/deap. |
| Open Datasets | Yes | We use the COCO dataset (Lin et al., 2014), with the 5,000 images from the validation set being used for testing. For the instance discrimination task, we compute the ARI (adjusted rand index) test scores only on images with at least three large objects (resulting in a subset of 754 images). A natural choice for evaluating local semantics is a semantic segmentation task. Here we rely on two benchmarks: Pascal VOC 2012 (Everingham et al., 2015) and COCO Stuff (Caesar et al., 2018). ... We frame this as a depth map estimation problem, i. e. Dout = 1, relying on the NYUv2 dataset (Nathan Silberman & Fergus, 2012) for training and testing the depth estimation readout. |
| Dataset Splits | Yes | For these experiments, we use the COCO dataset (Lin et al., 2014), with the 5,000 images from the validation set being used for testing. For the instance discrimination task, we compute the ARI (adjusted rand index) test scores only on images with at least three large objects (resulting in a subset of 754 images). On COCO and Pascal we use the validation sets for testing, while model selection is carried out on a separate part of the training set via validation loss. |
| Hardware Specification | Yes | For example, our standard training for a readout on a Vi T-B/16 224 pixel backbone on Pascal VOC adds less than 70,000 parameters and trains in less than 16 minutes (using a single Nvidia RXT2080 GPU). |
| Software Dependencies | No | The paper mentions using 'Py Torch vision (Paszke et al., 2019)' and the 'timm package (Wightman, 2019)' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We use the Adam optimizer with a learning rate of 0.001, except for boundary prediction and depth where it is set to 0.002. We use 8 attention heads in all models. On COCO and Pascal we use the validation sets for testing, while model selection is carried out on a separate part of the training set via validation loss. name BS LR WD iterations val intv. img size heads dim base size Pascal VOC2012 32 0.001 0.010 6000 250 224 8 16 28 COCO Stuff 32 0.001 0.010 20000 250 224 8 16 28 NYUv2 Depth 32 0.001 0.010 3000 100 [216, 288] 8 16 28 Instance Discrimination 32 0.001 0.010 20000 -1 224 8 16 28 Boundaries 32 0.001 0.010 10000 250 448 8 16 56 Center Net 32 0.001 0.010 20000 1000 448 8 16 56 BS,LR and WD correspond to batch size, learning rate and weight decay respectively. |