Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection

Authors: Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, SIMONE CALDERARA, Rita Cucchiara

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method achieves state-of-the-art performance on the ODin W-13 benchmark and ODin W-O, a newly introduced benchmark designed to assess class reappearance. [...] 5 Experimental Studies
Researcher Affiliation Academia AImage Lab University of Modena and Reggio Emilia EMAIL
Pseudocode Yes The pseudocode of Dit Hub is provided in Algorithm 1. [...] Algorithm 1: Dit Hub Training
Open Source Code Yes The code is provided in the supplementary material, accompanied by a README.md file that details the functioning of the codebase and enables the reproduction of the experimental results.
Open Datasets Yes Following [4], we evaluate Dit Hub on the ODin W-13 [20] benchmark, where it outperforms the state-of-the-art by a substantial +4.21 m AP in the Incremental Vision-Language Object Detection setting. We also introduce ODin W-O (Overlapped), a curated subset of ODin W-35 [19] focusing on classes re-appearing across multiple tasks. [...] Following the convention established by [4], we use the term ZCOCO to denote the zero-shot evaluation conducted on the MS COCO [23] validation set comprising 5000 samples and consisting of annotations for 80 common object categories. [...] For this purpose, we selected AODRaw [22], a challenging and recently introduced benchmark.
Dataset Splits Yes Following the convention established by [4], we use the term ZCOCO to denote the zero-shot evaluation conducted on the MS COCO validation set comprising 5000 samples and consisting of annotations for 80 common object categories. [...] We consider a sequence of tasks {1, . . . , T}, where each task t corresponds to a dataset Dt.
Hardware Specification Yes All experiments for Dit Hub were conducted on a single RTX A5000 GPU with 24 GB of VRAM.
Software Dependencies No Our PyTorch implementation efficiently utilized around 10 GB of VRAM during training. We employed Grounding DINO Tiny as an Open-Vocabulary object detector, which was pre-trained on the O365 [45], Gold G [16], and Cap4M [20] datasets.
Experiment Setup Yes For Lo RA fine-tuning, we applied low-rank updates to the encoder layers with a rank of 16. The optimization was performed using Adam W [27], with a learning rate of 1 10 3 and a weight decay of 1 10 2. We exclusively relied on Grounding DINO s contrastive classification loss and localization loss, without introducing any additional loss functions. [...] The merging coefficient λ for the A matrix is set to 0.3 for ODin W13 and 0.1 for ODin W-O; for the matrix B, we set λ = 0.7. [...] During training, we allocate an equal number of epochs to the warmup and the subsequent specialization phases see Section E for further details. [...] Each training run, with a batch size of 2, required approximately 8 hours.