Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

Authors: Jieren Deng, Haojian Zhang, Kun Ding, Jianhua Hu, Xingxuan Zhang, Yunkuan Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on COCO and ODin W-13 datasets demonstrate that Zi Ra effectively safeguards the zeroshot generalization ability of VLODMs while continuously adapting to new tasks.
Researcher Affiliation Academia Jieren Deng1, 2, Haojian Zhang 1, Kun Ding1, Jianhua Hu1, Xingxuan Zhang3, and Yunkuan Wang1 1Institute of Automation, Chinese Academy of Sciences (CAS) {dengjieren2019, jianhua.hu, zhanghaojian2014, yunkuan.wang}@ia.ac.cn 2School of Artificial Intelligence, University of Chinese Academy of Science, UCAS 3Shanghai Sixth People s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine zhangxingxuan@sjtu.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes the method and architecture using text and diagrams, but no formal pseudocode.
Open Source Code Yes Our code is available at https://github.com/Jarintotion Din/Zi Ra Grounding DINO.
Open Datasets Yes Datasets. We conduct our experiments on the COCO [21] datasets and the Object Detection in the Wild (ODin W) [18] benchmark. ODin W is a more challenging benchmark designed to test model performance under real-world scenarios. It comprises numerous sub-datasets from various domains for evaluation, such as Thermal (to detect objects in heat map images) and Aquarium (to detect different marine animals). Following GLIP [19], we use ODin W-13 datasets, they are labeled as Ae (Aerial Maritime Drone), Aq (Aquarium), Co (Cottontail Rabbits), Eg (Egohands), Mu (Mushrooms), Pa (Packages), Pv (Pascal VOC), Pi (Pistols), Po (Pothole), Ra (Raccoon), Sh (Shellfish), Th (Thermal Dogs and People), Ve (Vehicles). The 13 sub-datasets of ODin W-13 are trained sequentially, one by one, and are tested after all sub-datasets have been trained.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, and testing within each dataset used. It states that the ODin W-13 sub-datasets are trained sequentially and tested after all sub-datasets have been trained, but not the internal splits.
Hardware Specification Yes Our proposed method is implemented with Py Torch and trained on two Nvidia RTX 3090 GPUs.
Software Dependencies No The paper mentions 'implemented with Py Torch' and 'Adam W is used as the optimizer,' but it does not specify version numbers for PyTorch or any other software dependencies needed to replicate the experiment.
Experiment Setup Yes Each downstream task is trained for a total of two epochs with a batch size of 2. For Grounding DINO, we employ an initial learning rate of 10-3, which decays to 0.1 times the original value after the first epoch to ensure effective convergence. For OV-DINO, we employ an initial learning rate of 10-4, which also decays to 0.1 times the original value after the first epoch to ensure effective convergence. Adam W is used as the optimizer, and the weight decay is 10-4.