reproducibilityindex.ai

Could Giant Pre-trained Image Models Extract Universal Representations?

Authors: Yutong Lin, Ze Liu, Zheng Zhang, Han Hu, Nanning Zheng, Stephen Lin, Yue Cao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (Swin V2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box m AP and 52.2 mask m AP on COCO object detection test-dev, 57.6 val m Io U on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition.
Researcher Affiliation	Collaboration	1Xi an Jiaotong University 2University of Science and Technology of China 3Microsoft Research Asia
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	We use public datasets and our code will be released.
Open Datasets	Yes	For evaluation on this task, we adopt COCO 2017 [34], the most widely-used benchmark for object detection and instance segmentation, which contains 118K training, 5K validation and 20K test-dev images.
Dataset Splits	Yes	For evaluation on this task, we adopt COCO 2017 [34], the most widely-used benchmark for object detection and instance segmentation, which contains 118K training, 5K validation and 20K test-dev images.
Hardware Specification	No	The paper states 'See Appendix' for hardware details under the checklist, but the main body of the paper provided does not contain specific hardware specifications like GPU models or processor types.
Software Dependencies	No	The paper does not explicitly provide specific software dependencies with version numbers, such as 'Python 3.8, PyTorch 1.9, and CUDA 11.1'.
Experiment Setup	Yes	For fair comparison, the same base network capacity is maintained across the pretraining tasks, such as using Swin Transformer with 224 224 input and a window size of 7. ... We deal with this issue by adopting a multi-scale augmentation similar to that of image classification. It randomly resizes the original image, then randomly crops a square part of the resized image [15]. ... For object detection, we adopt the framework of Mask R-CNN [20] with FPN [33] as the head network. ... For semantic segmentation, we adopt Mask2former [9] with a oneblock pixel decoder and four-block transformer decoder as the head network.