Open-Vocabulary Object Detection via Language Hierarchy
Authors: Jiaxing Huang, Jingyi Zhang, Kai Jiang, Shijian Lu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets. |
| Researcher Affiliation | Academia | Jiaxing Huang, Jingyi Zhang, Kai Jiang, Shijian Lu College of Computing and Data Science Nanyang Technological University, Singapore |
| Pseudocode | No | The paper includes diagrams and algorithmic descriptions within the text but does not present a formal pseudocode block or algorithm box. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | LVIS [41] is a large vocabulary dataset designed for long-tailed instance segmentation, which contains 100K images and 1203 categories. LVIS provides high-quality instance-wise annotations, including instance masks, class labels and bounding boxes. Image Net-21K [16] is a large and diverse dataset over 14M images across more than 21K categories. |
| Dataset Splits | No | The paper mentions training, validation, and test images for various datasets (e.g., Object365, Pascal VOC, Cityscapes) but does not provide explicit proportions or counts for training/validation/test splits uniformly across all experiments or as a general rule for reproducibility. |
| Hardware Specification | No | The paper mentions using specific backbones like "Swin-B", "Conv Ne Xt-T", "Res Net-50", "Res Net-18" and refers to "Run time (ms)" in efficiency comparisons, but it does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or memory. |
| Software Dependencies | No | The paper mentions using "SGD [87] as the optimizer" and "Center Net2 [54]" and "CLIP text embeddings [22]" but does not specify version numbers for these or other key software components. |
| Experiment Setup | Yes | We employ SGD [87] as the optimizer and adopt the cosine learning rate scheduler with a warm-up of 1000 iterations [15]. We set the input sizes of box-level annotated images (i.e., LVIS) and image-level annotated images (i.e., Image Net-21K) as 896 896 and 448 448, respectively... During training, we sample box-level and image-level mini-batches in a 1 : 16 ratio. We set the confidence threshold t (in pseudo box label generation in Eq. 3) as 0.75 in all experiments except in parameter analysis. |