Could Giant Pre-trained Image Models Extract Universal Representations?
Authors: Yutong Lin, Ze Liu, Zheng Zhang, Han Hu, Nanning Zheng, Stephen Lin, Yue Cao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (Swin V2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box m AP and 52.2 mask m AP on COCO object detection test-dev, 57.6 val m Io U on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. |
| Researcher Affiliation | Collaboration | 1Xi an Jiaotong University 2University of Science and Technology of China 3Microsoft Research Asia |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We use public datasets and our code will be released. |
| Open Datasets | Yes | For evaluation on this task, we adopt COCO 2017 [34], the most widely-used benchmark for object detection and instance segmentation, which contains 118K training, 5K validation and 20K test-dev images. |
| Dataset Splits | Yes | For evaluation on this task, we adopt COCO 2017 [34], the most widely-used benchmark for object detection and instance segmentation, which contains 118K training, 5K validation and 20K test-dev images. |
| Hardware Specification | No | The paper states 'See Appendix' for hardware details under the checklist, but the main body of the paper provided does not contain specific hardware specifications like GPU models or processor types. |
| Software Dependencies | No | The paper does not explicitly provide specific software dependencies with version numbers, such as 'Python 3.8, PyTorch 1.9, and CUDA 11.1'. |
| Experiment Setup | Yes | For fair comparison, the same base network capacity is maintained across the pretraining tasks, such as using Swin Transformer with 224 224 input and a window size of 7. ... We deal with this issue by adopting a multi-scale augmentation similar to that of image classification. It randomly resizes the original image, then randomly crops a square part of the resized image [15]. ... For object detection, we adopt the framework of Mask R-CNN [20] with FPN [33] as the head network. ... For semantic segmentation, we adopt Mask2former [9] with a oneblock pixel decoder and four-block transformer decoder as the head network. |