Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
Authors: Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alexander J. Smola, Xu Sun
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to Co Op) and 84.4 h Io U on openvocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Experimental results in Figure 1 show that POMP outperforms previous state-of-the-art (SOTA) models on a broad range of visual recognition tasks and datasets. |
| Researcher Affiliation | Collaboration | Shuhuai Ren , Aston Zhang , Yi Zhu , Shuai Zhang , Shuai Zheng , Mu Li , Alex Smola , Xu Sun National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University Amazon Web Services |
| Pseudocode | No | The paper does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/amazon-science/prompt-pretraining. |
| Open Datasets | Yes | We conduct prompt pre-training on the Image Net-21K dataset (official winter 2021 released version2). We also evaluate on CIFAR10, FGVC Aircraft, Stanford Cars, SUN397, Image Net-1k, Oxford-Pets, Oxford Flowers102, Food-101, Euro SAT, DTD, UCF-101, COCO Stuff, Pascal VOC, ADE20K, PASCAL Context, LVIS, COCO, and Object365. |
| Dataset Splits | Yes | We follow the processing methods in [47], which involves cleaning invalid classes, allocating 50 images per class for a validation split, and crop-resizing all the images to 224 resolution. |
| Hardware Specification | Yes | We conduct all the experiments on 8 Nvidia V100 GPUs. |
| Software Dependencies | No | The paper mentions software like CLIP, Mask Former, Center Net2, but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | The number of training samples for each class is 16 (16 shots), and the prompt length is 16. We sample 1,000 classes at each training step, i.e., K = 1000 in (4). We use the SGD optimizer with an initial learning rate of 0.002, decayed by the cosine annealing rule. The batch size is 32, and the maximum epoch is 20. |