Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models

Authors: Jesimon Barreto, Carlos Caetano, Andre Araujo, William Schwartz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods.
Researcher Affiliation Collaboration Jesimon Barreto1 Carlos Caetano2 André Araujo3 William Robson Schwartz1 1Departamento de Ciência da Computação, Universidade Federal de Minas Gerais (UFMG) 2Recod.ai, Instituto de Computação, Universidade Estadual de Campinas (UNICAMP) 3Google Deep Mind
Pseudocode No The paper describes the methodology in Section 3 and 3.2, and presents a diagram of the training pipeline in Figure 2, but does not include a structured pseudocode or algorithm block.
Open Source Code Yes Code is publicly available at https://github.com/jesimonbarreto/VESSA.
Open Datasets Yes MVImage Net[40] and CO3D [41] are large-scale video datasets offering multi-view images.
Dataset Splits Yes To adapt them for classification, we designed a protocol that splits each class into training and testing sets (75%-25%), selecting one frame per video and using k-Nearest Neighbors (KNN) to evaluate the quality of learned embeddings across different views and instances.
Hardware Specification Yes Our experiments were performed on TPU v3-8, featuring 8 cores and 128 GB of high-bandwidth memory.
Software Dependencies No The paper mentions using the 'scenic library [42] in JAX' but does not specify version numbers for either.
Experiment Setup Yes Our training followed the base hyperparameter configuration of the DINO protocol [3], except for the specific settings detailed below. As a reference, we adopted 10 training epochs for both the initial projection head adaptation and the subsequent full model training, using a batch size of 256 and an input image resolution of 224 224. For each video, we sampled 3 frame pairs. The hyperparameter γ, which controls the weight of the distillation loss, was set to 1.