Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Vision Transformers with Self-Distilled Registers
Authors: Zipeng Yan, Yinjie Chen, Chong Zhou, Bo Dai, Andrew Luo
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student Vi T under zero-shot and linear probing. Our code is publicly available at this repository. ... We demonstrate that PH-Reg effectively improves the consistency of dense feature representations in Vi Ts, leading to quantifiable improvements on downstream tasks that rely on fine-grained spatial understanding (e.g., semantic segmentation or depth prediction). ... We comprehensively evaluate the performance of PH-Reg on a diverse set of dense tasks, first using a zero-shot setup for open-vocabulary segmentation in section 4.1, followed by linear probe based segmentation and depth tasks in section 4.2. Finally we perform ablation studies to explore design decisions and investigate the nature of artifacts across different models in section 4.3. |
| Researcher Affiliation | Academia | Zipeng Yan*1 Yinjie Chen*1,2 Chong Zhou3 Bo Dai1 Andrew F. Luo 1 1 University of Hong Kong 2 Zhejiang University 3 Nanyang Technological University |
| Pseudocode | Yes | Algorithm 1 Denoising Process Input: Image I RH W 3; Image space coordinates C [0, 1] [0, 1]; Augmentation parameters θ1, θ2, ..., θn; Augmentation function T ; Vi T teacher model fteacher; 1. Zero init clean feature tensor Q 2. Zero init count tensor K 3. For i in {1, ..., n}: 4. θi = (xi, yi, flipi) 5. (Ii, Ci) = T (I, C, θi) 6. Dense feature Fi = fteacher(Ii) 7. (F valid i , Cvalid i ) = T 1(Fi, Ci, θi) 8. Q[Cvalid i ] = Q[Cvalid i ] + F valid i 9. K[Cvalid i ] = K[Cvalid i ] + 1 10. return Q/K |
| Open Source Code | Yes | Our code is publicly available at this repository. ... Our codes and weights are avaible in https://github.com/0raiser0/PH-Reg.git. |
| Open Datasets | Yes | Datasets. In this section, we follow prior works [59, 62, 61] to evaluate our approach on six semantic segmentation datasets, with their names abbreviated (in parentheses) to conserve table space: PASCAL VOC 2012 (VOC21) [63], PASCAL Context (PC 60) [64], COCO-Object (Object) [65], COCOStuff (Stuff) [66], Cityscape (City) [67], ADE20K-150 (ADE) [68]. In addition to these standard benchmarks, we also evaluate on two commonly used variants, PASCAL VOC 2012 (VOC20) and PASCAL Context (PC 59), in which the background class is excluded from the evaluation. ... In this section, we evaluate our approach in two semantic segmentation datasets: PASCAL VOC 2012 (VOC21) [63] and ADE20K-150 (ADE) [68] and one monocular depth estimation dataset: NYUv2-Depth dataset (NYUd) [69]. |
| Dataset Splits | Yes | We resize input images such that the shorter side is scaled to a specific resolution, while maintaining the original aspect ratio for the longer side. Additionally, we set fixed crop sizes and strides during evaluation. All evaluation parameters are summarized in Table S.5, while all other settings follow their default configurations. ... Table S.5: Dataset Specific Details for Open-vocabulary Semantic Segmentation. We list the per-dataset resolution, crop size, and stride used for each dataset. We maintain the same settings for all methods within a given dataset. |
| Hardware Specification | Yes | Training is conducted on 4 NVIDIA Ada 6000 GPUs, with mixed-precision optimization to balance computational efficiency and numerical stability. |
| Software Dependencies | No | The distillation framework is implemented in Py Torch, with distributed training managed via Py Torch Accelerate. ... Model Library Weight CLIP clip (Open AI) Vi T-B-16 Open CLIP open_clip hf-hub:laion/CLIP-Vi T-B-16-laion2B-s34B-b88K DFN-CLIP open_clip hf-hub:apple/DFN2B-CLIP-Vi T-B-16 DINOv2 transformers (Hugging Face) facebook/dinov2-base |
| Experiment Setup | Yes | Table S.2: Configs for CLIP-based models. Config Value optimizer Adam W initial learning rate 3e-4 final learning rate 1e-5 weight decay 1e-2 optimizer momentum (β1, β2) (0.9, 0.999) learning rate scheduler Exponential Scheduler batch size 16 training epochs 100 augmentation Random Square Crop ... Table S.3: Configs for DINOv2. Config Value optimizer Adam W initial learning rate 1e-4 final learning rate 5e-6 weight decay 1e-2 optimizer momentum (β1, β2) (0.9, 0.999) learning rate scheduler Exponential Scheduler batch size 8 training epochs 100 augmentation Random Square Crop |