Rejuvenating image-GPT as Strong Visual Representation Learners
Authors: Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, Cihang Xie
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments showcase that D-i GPT excels as a strong learner of visual representations: A notable achievement is its compelling performance on the Image Net-1K dataset by training on publicly available datasets, D-i GPT unprecedentedly achieves 90.0% top-1 accuracy with a vanilla Vi TH. Additionally, D-i GPT shows strong generalization on the downstream task. |
| Researcher Affiliation | Academia | *Equal contribution 1Johns Hopkins University 2UC Santa Cruz. Correspondence to: Cihang Xie <cixie@ucsc.edu>. |
| Pseudocode | No | The paper describes the model architecture and methodology in text and through equations, but it does not include any explicitly labeled pseudocode or algorithm blocks/figures. |
| Open Source Code | Yes | Code is available at https://github.com/Oliver Rensu/D-i GPT. |
| Open Datasets | Yes | With Image Net-1K as the sole pertaining dataset, our base-size model achieves an 86.2% top-1 classification accuracy... [...] When further scaling the pretraining to Image Net-21K dataset... [...] The model is then finetuned on the LAION-400M dataset (Schuhmann et al., 2021; 2022). |
| Dataset Splits | Yes | Following (He et al., 2022), we finetune pretrained models using the Image Net-1K training set, and test on the Image Net-1K validation set with the input size of 224 224. [...] For semantic segmentation, we evaluate D-i GPT using the ADE20K dataset (Zhou et al., 2019), which comprises 150 categories with 20,000 training images and 2,000 validation images. |
| Hardware Specification | No | This work is supported by ONR with N00014-23-1-2641, TPU Research Cloud (TRC) program and Google Cloud Research Credits program. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer, but it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions). |
| Experiment Setup | Yes | Implementation details. In our experiments, we use CLIP to provide semantic tokens. We pretrain, by default, all models on Image Net-1K dataset for 300 epochs. We set the batch size to 4096 and the peak learning rate to lr = 1.5e 4 batchsize/256. We adopt a cosine learning rate decay schedule with a warm-up period of 40 epochs, and utilize the Adam W (Loshchilov & Hutter, 2019) optimizer with a weight decay of 0.05. We use random resized cropping and random horizontal flipping, with the input size set to 224 224. When further scaling the pretraining to Image Net-21K dataset, all models undergo 150 epochs of pretraining with a warm-up stage of 5 epochs, a learning rate lr = 1.5e 3, and a batch size of 4096. |