Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed
Authors: Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With extensive experiments, our method achieves not only faster convergence, but also higher robustness on models and datasets with various scales. This work demonstrates that robustness need not be an emergent by-product of extreme scale but can instead be built into SSL through careful curriculum design and augmentation. We believe this opens the door to more accessible, reproducible, and robust self-supervised training. In summary, our contributions are the following: We conduct a comprehensive robustness and frequency-based analysis on models pretrained with a low-frequency data curriculum an underexplored direction. We identify the low-frequency bias introduced by this training scheme and propose Gaussian noise patching as a complementary augmentation to enhance robustness. The proposed curriculum accelerates convergence on Image Net-1K [27] with DINOv2 and a Vi T-B backbone, reducing pretraining time by 1.66 and FLOPs by 2.25 , while maintaining matching robustness and competitive clean linear probing accuracy. |
| Researcher Affiliation | Academia | 1Brown University 2Cornell University EMAIL EMAIL |
| Pseudocode | No | The paper describes the methods in text and provides mathematical formulas for bicubic interpolation in Appendix A.3, but it does not contain any clearly labeled pseudocode or algorithm blocks. For example, it defines the cubic kernel w(t) and the bicubic interpolation f(x, y) with formulas, but not as a procedural algorithm. |
| Open Source Code | Yes | The code is available at https://github.com/Kevin Z0217/fast_dinov2 |
| Open Datasets | Yes | For faster experimentation, we primarily use Image Net-100 [32], a subset of Image Net1K [27]. The training set consists of 100 randomly selected classes from Image Net-1K, with the first 500 images from each class. Similarly, the validation set contains the corresponding 100 classes from the original validation set, with 50 images per class. This results in a total of 50,000 training images and 5,000 validation images. For robustness evaluation, we use Image Net-100-C, derived from Image Net-C [17], which benchmarks model resilience to common corruptions. We maintain the exact image selection from the Image Net-100 validation set across all corruption levels and types. Additionally, we employ ADE20K for semantic segmentation tasks. Finally, we scale our approach to full Image Net-1K and evaluate robustness on Image Net-C. |
| Dataset Splits | Yes | For faster experimentation, we primarily use Image Net-100 [32], a subset of Image Net1K [27]. The training set consists of 100 randomly selected classes from Image Net-1K, with the first 500 images from each class. Similarly, the validation set contains the corresponding 100 classes from the original validation set, with 50 images per class. This results in a total of 50,000 training images and 5,000 validation images. For robustness evaluation, we use Image Net-100-C, derived from Image Net-C [17], which benchmarks model resilience to common corruptions. We maintain the exact image selection from the Image Net-100 validation set across all corruption levels and types. |
| Hardware Specification | Yes | All training runs are distributed across 4 NVIDIA L40S GPUs, while evaluations use either NVIDIA A6000 or NVIDIA A5500 GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' and 'Vi T-S/16 as the DINOv2 backbone' and 'Sim CLR model with Res Net backbone'. While these are specific software components/models, it does not provide explicit version numbers for core software like Python, PyTorch, or CUDA, which are necessary for full reproducibility of the software environment. |
| Experiment Setup | Yes | For Image Net-100 experiments, we use Vi T-S/16 as the DINOv2 backbone with a total batch size of 40, distributed across 4 GPUs (10 per GPU). ... We train baseline models for 500 epochs and training curriculum experiments for 200 epochs, ensuring the baseline converges to optimal performance. All Image Net-100 experiments use a fixed epoch length of 1,250 iterations. For Image Net-1K experiments, we employ Vi T-B/16 with a total batch size of 512 (128 per GPU), with epoch length of 2,500 iterations. The baseline and Fast DINOv2 are trained for 250 and 200 epochs on Image Net-1K, respectively. Following the official DINOv2 implementation, we use Adam W optimizer with square root learning rate scaling based on batch size, yielding a base learning rate of 7.9 10 4. For linear probing evaluation, we use a batch size of 128 with 12.5k total iterations. |