Open-Vocabulary Object Detection via Language Hierarchy

Authors: Jiaxing Huang, Jingyi Zhang, Kai Jiang, Shijian Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets.
Researcher Affiliation Academia Jiaxing Huang, Jingyi Zhang, Kai Jiang, Shijian Lu College of Computing and Data Science Nanyang Technological University, Singapore
Pseudocode No The paper includes diagrams and algorithmic descriptions within the text but does not present a formal pseudocode block or algorithm box.
Open Source Code No The paper does not contain any explicit statement about releasing the source code or a link to a code repository for the described methodology.
Open Datasets Yes LVIS [41] is a large vocabulary dataset designed for long-tailed instance segmentation, which contains 100K images and 1203 categories. LVIS provides high-quality instance-wise annotations, including instance masks, class labels and bounding boxes. Image Net-21K [16] is a large and diverse dataset over 14M images across more than 21K categories.
Dataset Splits No The paper mentions training, validation, and test images for various datasets (e.g., Object365, Pascal VOC, Cityscapes) but does not provide explicit proportions or counts for training/validation/test splits uniformly across all experiments or as a general rule for reproducibility.
Hardware Specification No The paper mentions using specific backbones like "Swin-B", "Conv Ne Xt-T", "Res Net-50", "Res Net-18" and refers to "Run time (ms)" in efficiency comparisons, but it does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or memory.
Software Dependencies No The paper mentions using "SGD [87] as the optimizer" and "Center Net2 [54]" and "CLIP text embeddings [22]" but does not specify version numbers for these or other key software components.
Experiment Setup Yes We employ SGD [87] as the optimizer and adopt the cosine learning rate scheduler with a warm-up of 1000 iterations [15]. We set the input sizes of box-level annotated images (i.e., LVIS) and image-level annotated images (i.e., Image Net-21K) as 896 896 and 448 448, respectively... During training, we sample box-level and image-level mini-batches in a 1 : 16 ratio. We set the confidence threshold t (in pseudo box label generation in Eq. 3) as 0.75 in all experiments except in parameter analysis.