Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection
Authors: Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, SIMONE CALDERARA, Rita Cucchiara
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method achieves state-of-the-art performance on the ODin W-13 benchmark and ODin W-O, a newly introduced benchmark designed to assess class reappearance. [...] 5 Experimental Studies |
| Researcher Affiliation | Academia | AImage Lab University of Modena and Reggio Emilia EMAIL |
| Pseudocode | Yes | The pseudocode of Dit Hub is provided in Algorithm 1. [...] Algorithm 1: Dit Hub Training |
| Open Source Code | Yes | The code is provided in the supplementary material, accompanied by a README.md file that details the functioning of the codebase and enables the reproduction of the experimental results. |
| Open Datasets | Yes | Following [4], we evaluate Dit Hub on the ODin W-13 [20] benchmark, where it outperforms the state-of-the-art by a substantial +4.21 m AP in the Incremental Vision-Language Object Detection setting. We also introduce ODin W-O (Overlapped), a curated subset of ODin W-35 [19] focusing on classes re-appearing across multiple tasks. [...] Following the convention established by [4], we use the term ZCOCO to denote the zero-shot evaluation conducted on the MS COCO [23] validation set comprising 5000 samples and consisting of annotations for 80 common object categories. [...] For this purpose, we selected AODRaw [22], a challenging and recently introduced benchmark. |
| Dataset Splits | Yes | Following the convention established by [4], we use the term ZCOCO to denote the zero-shot evaluation conducted on the MS COCO validation set comprising 5000 samples and consisting of annotations for 80 common object categories. [...] We consider a sequence of tasks {1, . . . , T}, where each task t corresponds to a dataset Dt. |
| Hardware Specification | Yes | All experiments for Dit Hub were conducted on a single RTX A5000 GPU with 24 GB of VRAM. |
| Software Dependencies | No | Our PyTorch implementation efficiently utilized around 10 GB of VRAM during training. We employed Grounding DINO Tiny as an Open-Vocabulary object detector, which was pre-trained on the O365 [45], Gold G [16], and Cap4M [20] datasets. |
| Experiment Setup | Yes | For Lo RA fine-tuning, we applied low-rank updates to the encoder layers with a rank of 16. The optimization was performed using Adam W [27], with a learning rate of 1 10 3 and a weight decay of 1 10 2. We exclusively relied on Grounding DINO s contrastive classification loss and localization loss, without introducing any additional loss functions. [...] The merging coefficient λ for the A matrix is set to 0.3 for ODin W13 and 0.1 for ODin W-O; for the matrix B, we set λ = 0.7. [...] During training, we allocate an equal number of epochs to the warmup and the subsequent specialization phases see Section E for further details. [...] Each training run, with a batch size of 2, required approximately 8 hours. |