HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

Authors: Shengcao Cao, Dhiraj Joshi, Liangyan Gui, Yu-Xiong Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on prevalent image datasets, we demonstrate the superiority of HASSOD over existing methods, thereby advancing the state of the art in self-supervised object detection.
Researcher Affiliation Collaboration Shengcao Cao1 Dhiraj Joshi2 Liang-Yan Gui1 Yu-Xiong Wang1 1University of Illinois at Urbana-Champaign 2IBM Research 1{cao44,lgui,yxw}@illinois.edu 2djoshi@us.ibm.com
Pseudocode No The paper describes procedures using descriptive text and flowcharts (e.g., Figure 2, Figure 3), but it does not contain structured pseudocode or algorithm blocks clearly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Project page: https://HASSOD-NeurIPS23.github.io.
Open Datasets Yes We train a Cascade Mask R-CNN [4] with a Res Net-50 [13] backbone on MS-COCO [20] images. The backbone is initialized from DINO [5] self-supervised pre-training... We mainly conduct our experiments in a zero-shot manner on the validation sets of three benchmark datasets, namely Objects365 [27], LVIS [11], and SA-1B [18].
Dataset Splits Yes We train a Cascade Mask R-CNN [4] with a Res Net-50 [13] backbone on MS-COCO [20] images. We use both the train and unlabeled splits of MS-COCO, totaling to about 0.24 million images. We mainly conduct our experiments in a zero-shot manner on the validation sets of three benchmark datasets, namely Objects365 [27], LVIS [11], and SA-1B [18]. As SA-1B does not provide a validation split, we utilize a random subset of 50,000 images for our assessment.
Hardware Specification Yes The whole training process spans 40,000 iterations, taking about 20 hours on 4 NVIDIA A100 GPUs.
Software Dependencies No The paper states 'Our code is developed based on Py Torch [24] and Detectron2 [40]', but it does not specify version numbers for either PyTorch or Detectron2, nor for any other ancillary software components.
Experiment Setup Yes The whole training process starts with a burn-in stage, during which the student model is only trained on the initial pseudo-labels with a fixed learning rate 0.01 and fixed loss weights. After the burn-in stage, the teacher model is introduced, and we gradually adjust the learning rate from 0.01 to 0, the loss weight in the label-to-student branch from 1.0 to 0.0, and the loss weight in the teacher-to-student branch from 2.0 to 3.0, all following a cosine schedule. The whole training process spans 40,000 iterations with a batch size of 16 images. We resize the resolution of each image to 480 × 480... The merging process stops at three thresholds θmerge 1 = 0.4, θmerge 2 = 0.2, θmerge 3 = 0.1... The coverage threshold is set to θcover% = 90%.