Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Authors: Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alexander J. Smola, Xu Sun

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to Co Op) and 84.4 h Io U on openvocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Experimental results in Figure 1 show that POMP outperforms previous state-of-the-art (SOTA) models on a broad range of visual recognition tasks and datasets.
Researcher Affiliation Collaboration Shuhuai Ren , Aston Zhang , Yi Zhu , Shuai Zhang , Shuai Zheng , Mu Li , Alex Smola , Xu Sun National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University Amazon Web Services
Pseudocode No The paper does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Our code is available at https://github.com/amazon-science/prompt-pretraining.
Open Datasets Yes We conduct prompt pre-training on the Image Net-21K dataset (official winter 2021 released version2). We also evaluate on CIFAR10, FGVC Aircraft, Stanford Cars, SUN397, Image Net-1k, Oxford-Pets, Oxford Flowers102, Food-101, Euro SAT, DTD, UCF-101, COCO Stuff, Pascal VOC, ADE20K, PASCAL Context, LVIS, COCO, and Object365.
Dataset Splits Yes We follow the processing methods in [47], which involves cleaning invalid classes, allocating 50 images per class for a validation split, and crop-resizing all the images to 224 resolution.
Hardware Specification Yes We conduct all the experiments on 8 Nvidia V100 GPUs.
Software Dependencies No The paper mentions software like CLIP, Mask Former, Center Net2, but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes The number of training samples for each class is 16 (16 shots), and the prompt length is 16. We sample 1,000 classes at each training step, i.e., K = 1000 in (4). We use the SGD optimizer with an initial learning rate of 0.002, decayed by the cosine annealing rule. The batch size is 32, and the maximum epoch is 20.